This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

parquetDatastore

Datastore for collection of Parquet files

Description

Use a ParquetDatastore object to manage a collection of Parquet files, where each individual Parquet file fits in memory, but the entire collection of files does not necessarily fit. You can create a ParquetDatastore object using the parquetDatastore function, specify its properties, and then import and process the data using object functions.

Creation

Syntax

pds = parquetDatastore(location)
pds = parquetDatastore(location,Name,Value)

Description

example

pds = parquetDatastore(location) creates a datastore pds from the collection of Parquet files specified by location.

example

pds = parquetDatastore(location,Name,Value) specifies additional parameters and properties for pds using one or more name-value pair arguments.

Input Arguments

expand all

Files or folders included in the datastore, specified as a path or a DsFileSet object.

  • path — Specify the path as a character vector, cell array of character vectors, string scalar, or a string array, containing the location of files or folders that are local or remote.

    • Local files or folders — Specify location as a local path to files or folders. If the files are not in the current folder, then local path must specify full or relative paths. Files within subfolders of the specified folder are not automatically included in the datastore. You can use the wildcard character (*) when specifying the local path. This character specifies that the datastore include all matching files or all files in the matching folders.

    • Remote files or folders — Specify location to be the full paths of the files or folders as an internationalized resource identifier (IRI) of the form hdfs:///path_to_file. For more information, see Work with Remote Data.

  • DsFileSet object — You also can specify location as a DsFileSet object. For more information, see matlab.io.datastore.DsFileSet.

When location represents a folder, the datastore includes only supported file formats and ignores any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions property.

The parquetDatastore function supports the .parquet file format.

Example: 'myfile.parquet'

Example: '../dir/data/myfile.parquet'

Example: {'C:\dir\data\myfile01.parquet','C:\dir\data\myfile02.parquet'}

Example: 's3://bucketname/path_to_files/*.parquet'

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'IncludeSubfolders',true

Extensions to include in datastore, specified as the comma-separated pair consisting of 'FileExtensions' and a character vector, cell array of character vectors, string scalar, or string array.

  • If you do not specify 'FileExtensions', then parquetDatastore automatically includes all files with .parquet and .parq extensions in the specified path.

  • If you want to include parquet files with non-standard file extensions in the parquetDatastore, then specify those extensions explicitly.

  • If you want to create a parquetDatastore for files without any extensions, then specify 'FileExtensions' as an empty character vector, ''.

Example: 'FileExtensions',{'.parquet','.parq'}

Example: 'FileExtensions','.myformat'

Example: 'FileExtensions',''

Data Types: char | cell | string

Subfolder inclusion flag, specified as the comma-separated pair consisting of 'IncludeSubfolders' and true or false. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

If you do not specify 'IncludeSubfolders', then the default value is false.

Example: 'IncludeSubfolders',true

Data Types: logical | double

Output datatype, specified as the comma-separated pair consisting of 'OutputType' and 'auto', 'table', or 'timeable'.

  • 'auto' — Return a table or a timetable. The parquetDatastore detects if the output should be a table or a timetable based on other name-value pairs that you specify. When you specify the RowTimes name-value pair, the parquetDatastore infers that the output is a timetable.

  • 'table' — Return a table. For more information on the table datatype, see table.

  • 'timetable' — Return a timetable. For more information on the timetables, see timetable.

The value of OutputType selects the data type returned from the preview, read, and readall functions.

Example: 'OutputType','timetable'

Data Types: char | string

Alternate file system root paths, specified as the comma-separated pair consisting of 'AlternateFileSystemRoots' and a string vector or a cell array. Use 'AlternateFileSystemRoots' when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use 'AlternateFileSystemRoots' to associate the root paths.

  • To associate a set of root paths that are equivalent to one another, specify 'AlternateFileSystemRoots' as a string vector. For example,

    ["Z:\datasets","/mynetwork/datasets"]

  • To associate multiple sets of root paths that are equivalent for the datastore, specify 'AlternateFileSystemRoots' as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:

    • Specify 'AlternateFileSystemRoots' as a cell array of string vectors.

      {["Z:\datasets", "/mynetwork/datasets"];...
       ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}

    • Alternatively, specify 'AlternateFileSystemRoots' as a cell array of cell array of character vectors.

      {{'Z:\datasets','/mynetwork/datasets'};...
       {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}

The value of 'AlternateFileSystemRoots' must satisfy these conditions:

  • Contains one or more rows, where each row specifies a set of equivalent root paths.

  • Each row specifies multiple root paths and each root path must contain at least two characters.

  • Root paths are unique and are not subfolders of one another.

  • Contains at least one root path entry that points to the location of the files.

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Properties

expand all

ParquetDatastore properties describe the format of the files in a datastore object, and control how the data is read from the datastore. With the exception of the Files property, you can specify the value of ParquetDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation.

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. The location argument defines these files.

The first file specified in the cell array determines the variable names and format information for all files in the datastore.

Example: {'C:\dir\data\file1.ext';'C:\dir\data\file2.ext'}

Data Types: cell | string

Amount of data to read in a call to the read function, specified as 'rowgroup', 'file', or a positive integer.

  • 'rowgroup' — Each call to read reads the number of rows specified in the row groups of the Parquet file. To get the number of rows in row groups, see the RowGroupHeights property of the ParquetInfo object.

  • 'file' — Each call to read reads all of the data in one file.

  • positive integer — Each call to read reads a maximum of ReadSize rows.

When you change ReadSize from a positive integer to 'file' or 'rowgroup', or vice versa, MATLAB resets the datastore to an unread state where no data has been read from it.

Data Types: double | char | string

Names of variables in the datastore, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, the datastore detects them from the first nonheader line in the first file. You can specify VariableNames with a character vector or string scalar, however the datastore converts and stores the property value to a cell array of character vectors. When modifying the VariableNames property, the number of new variable names must match the number of original variable names.

If ReadVariableNames is false, then VariableNames defaults to {'Var1','Var2', ...}.

Example: {'Time','Date','Quantity'}

Data Types: char | cell | string

Variables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.

Example: {'Var3','Var7','Var4'}

Data Types: cell | string

Name of the row times variable in the Parquet data, specified as the comma-separated pair consisting of 'RowTimes' and a character vector or string array containing the variable name.

RowTimes is a timetable related parameter. Each row of a timetable is associated with a time, which is captured in a time vector for the timetable. The variable specified in RowTimes, must contain a datetime or a duration vector.

By default, parquetDatastore uses the first datetime or duration variable as row times for the timetable.

Object Functions

hasdataDetermine if data is available to read
numpartitionsNumber of datastore partitions
partitionPartition a datastore
previewSubset of data in datastore
readRead data in datastore
readallRead all data in datastore
resetReset datastore to initial state
transformTransform datastore
combineCombine data from multiple datastores

Examples

collapse all

Create a ParquetDatastore object containing the file outages.parquet.

pds = parquetDatastore('outages.parquet')
pds = 
  ParquetDatastore with properties:

                       Files: {
                              ' .../devel/bat/Bdoc19a/build/matlab/toolbox/matlab/demos/outages.parquet'
                              }
               VariableNames: {1x6 cell}
       SelectedVariableNames: {1x6 cell}
                    ReadSize: 'rowgroup'
                  OutputType: 'table'
                    RowTimes: []
    AlternateFileSystemRoots: {}

Create a datastore for a sample Parquet file, and then read data from the file with different ReadSize values.

Create a datastore for airlinesmall.parquet, set ReadSize to 10 rows, and then read from the datastore. The value of ReadSize determines how many rows of data are read from the datastore with each call to the read function.

pds = parquetDatastore('outages.parquet','ReadSize',10);
read(pds)
ans=10×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"

Set the ReadSize property value to 'file' and read from the datastore. Every call to the read function reads all the data from the datastore.

pds.ReadSize ='file'; 
data = read(pds)
data=1468×6 table
      Region            OutageTime          Loss     Customers       RestorationTime             Cause      
    ___________    ____________________    ______    __________    ____________________    _________________

    "SouthWest"    01-Feb-2002 12:18:00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
    "SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
    "SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
    "West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
    "MidWest"      16-Mar-2002 06:18:00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
    "West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
    "West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
    "West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
    "NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
    "MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"
    "SouthEast"    05-Sep-2004 17:48:00    73.387         36073    05-Sep-2004 20:46:00    "equipment fault"
    "West"         21-May-2004 21:45:00    159.99           NaN    22-May-2004 04:23:00    "equipment fault"
    "SouthEast"    01-Sep-2002 18:22:00    95.917         36759    01-Sep-2002 19:12:00    "severe storm"   
    "SouthEast"    27-Sep-2003 07:32:00       NaN    3.5517e+05    04-Oct-2003 07:02:00    "severe storm"   
    "West"         12-Nov-2003 06:12:00    254.09    9.2429e+05    17-Nov-2003 02:04:00    "winter storm"   
    "NorthEast"    18-Sep-2004 05:54:00         0             0                     NaT    "equipment fault"
      ⋮

You also can set the value of ReadSize property to 'rowgroup'. For more information, see the ReadSize property of the ParquetDatastore object reference page.

Limitations

If you use parquetread or datastore to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.

Alternatives

You also can create a ParquetDatastore object using the datastore function. For example, ds = datastore(location,'Type','parquet') creates a datastore from a collection of files specified by location.

Introduced in R2019a