Parallel parsing of large files into a tall table

Question

Jon Glassman il 2 Mag 2019

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/459912-parallel-parsing-of-large-files-into-a-tall-table

Modificato: Jon Glassman il 2 Mag 2019

I currently have a process that I run several times per week to download the latest version of a large dataset of json objects and then parse it using Matlab into a few large tables. The dataset is getting larger every day, and currently takes about 2 hours to parse, operating completely within memory.

I have thoughts on how to speed this up and make it less memory intensive by using parfor and tall tables / datastores, but I am having difficulty. I’m hoping the community has thoughts.

Let me give a high level overview and describe the difficulty I’m having. For your reference, more detail is at the end of this. I’d like to do the following:

Use parfor to cycle through each of roughly 100 files, each containing 20,000 json objects:
Use jsondecode to convert each of the 20,000 json objects to a struct
Parse all of the structs into a table
Save the table to a datastore
At the end of the process, I want a single tall table in a datastore, which is vertically concatenated from all of the separate tables, preserving the orders of the files. In other words, the first 20,000 rows of the resultant table are from File 1, the next 20,000 rows are from File 2, etc.

I cannot figure out how to do this, especially the parallel loading of multiple tables into a datastore. Any feedback would be much appreciated.

For more detail, see below. Thanks in advance!

More detail:

Let me give a simplified overview of the current process, and my thought on how to improve this.

Description of the data (the json objects): there is a lot of detail below, but the only important point is the last one.

The json objects typically contain a significant amount of data that I don’t care about. I only want a subset of data.
Some of the fields that I care about are only present in a subset of the objects. For example, a field like “item.type” is present as text in most objects, but absent in some. Where it is absent, I want to default to ‘NULL’. If a numeric field like “timestamp” is absent, I want to default to a numeric value such as 0 or -1.
Some of the fields that I care about are actually arrays. For example, there is a field called “status” which is actually an array of the history of the status of the object, and I also parse all of this out. The data is messy in that some objects lack any status array; some objects have a single status element which MATLAB interprets as a struct. Some objects have multiple consistently-formatted status entries that MATLAB interprets as an array of structs, and some objects have multiple inconsistently-formatted status entries that MATLAB interprets as a cell array of structs.
This is all to say that I’m not aware of any canned functions that can parse these objects in the way that I require. My current process handles all of this weirdness, and my new process will need to incorporate all of the same logic.

File prep - this process works, and does not need to be improved right now

The dataset is a file with 2M+ json objects, which takes about 10+ GB of data on disk
I split the file into chunks of 20,000 json objects each, each of which takes ~150MB (can easily fit in memory)
Using parfor, I quickly (~1 minute) build an index that scans each of these files and identifies the starting point and ending point of each json object within the file. (FYI I also create a hash of each object so I can tell which ones have changed so that I don’t have to re-parse everything. This helps cut my processing time by ~70%)

Current process - this process is slow and uses a huge amount of memory, and needs to be improved

I then initialize a structure with one vertical array for all of the fields that I care about, with a height equal to the total number of json objects. For example, my_structure.item_type is initialized as a cell array, with the initial value {‘NULL’} for every row, and my_structure.timestamp is initialized with the zeros() function.
I then use a for loop to go through each individual json object:
I use jsondecode() to create a structure of the json object
I then read the fields that I care about and parse them one-by-one into the structure of vertical arrays. To deal with the fact that some fields are missing, I use try/catch for error handling.
I then convert the struct to a table using struct2table. This table is the desired final output of this script.

As described, the current process takes about 2 hours to run through the for loop, and it needs to keep the entire output in memory.

Proposed new process:

Use parfor to cycle through each of the roughly 100 files of 20,000 json objects:
Parse all of the json in the file into a structure of vertical arrays, one for each field, same as described in the current process except that it only has 20,000 rows.
Convert the structure to a table using struct2table, and save to a datastore
Vertically concatenate all of the separate tables into a single tall table in a datastore, preserving the orders of the files

I cannot figure out how to do this. How can I, in parallel, load multiple files into a datastore?