Main Content

parquetwrite

Write columnar data to Parquet file

Description

example

parquetwrite(filename,T) writes a table or timetable T to a Parquet 2.0 file with the filename specified in filename.

example

parquetwrite(filename,T,Name,Value) specifies additional options with one or more name-value pair arguments. For example, you can specify "VariableCompression" to change the compression algorithm used, or "Version" to write the data to a Parquet 1.0 file.

Examples

collapse all

Write tabular data into a Parquet file and compare the size of the same tabular data in .csv and .parquet file formats.

Read the tabular data from the file outages.csv into a table.

T = readtable('outages.csv');

Write the data to Parquet file format. By default, the parquetwrite function uses the Snappy compression scheme. To specify other compression schemes see 'VariableCompression' name-value pair.

parquetwrite('outagesDefault.parquet',T)

Get the file sizes and compute the ratio of the size of tabular data in the .csv format to size of the same data in .parquet format.

Get size of .csv file.

fcsv = dir(which('outages.csv'));
size_csv = fcsv.bytes
size_csv = 101040

Get size of .parquet file.

fparquet  = dir('outagesDefault.parquet');
size_parquet = fparquet.bytes
size_parquet = 44881

Compute the ratio.

sizeRatio = ( size_parquet/size_csv )*100 ;
disp(['Size Ratio = ', num2str(sizeRatio) '% of original size'])
Size Ratio = 44.419% of original size

Create nested data and write it to a Parquet file.

Create a table with one nested layer of data.

FirstName = ["Akane"; "Omar"; "Maria"];
LastName = ["Saito"; "Ali"; "Silva"];
Names = table(FirstName,LastName);
NumCourse = [5; 3; 6];
Courses = {["Calculus I"; "U.S. History"; "English Literature"; "Studio Art"; "Organic Chemistry II"];
            ["U.S. History"; "Art History"; "Philosphy"];
            ["Calculus II"; "Philosphy II"; "Ballet"; "Music Theory"; "Organic Chemistry I"; "English Literature"]};
data = table(Names,NumCourse,Courses)
data=3×3 table
            Names            NumCourse      Courses   
    FirstName    LastName                             
    _____________________    _________    ____________

     "Akane"     "Saito"         5        {5x1 string}
     "Omar"      "Ali"           3        {3x1 string}
     "Maria"     "Silva"         6        {6x1 string}

Write your nested data to a Parquet file.

parquetwrite("StudentCourseLoads.parq",data)

Read the nested Parquet data.

t2 = parquetread("StudentCourseLoads.parq")
t2=3×3 table
            Names            NumCourse      Courses   
    FirstName    LastName                             
    _____________________    _________    ____________

     "Akane"     "Saito"         5        {5x1 string}
     "Omar"      "Ali"           3        {3x1 string}
     "Maria"     "Silva"         6        {6x1 string}

Input Arguments

collapse all

Name of output Parquet file, specified as a character vector or string scalar.

Depending on the location you are writing to, filename can take on one of these forms.

Location

Form

Current folder

To write to the current folder, specify the name of the file in filename.

Example: 'myData.parquet'

Other folders

To write to a folder different from the current folder, specify the full or relative path name in filename.

Example: 'C:\myFolder\myData.parquet'

Example: 'dataDir\myData.parquet'

Remote Location

To write to a remote location, filename must contain the full path of the file specified as a uniform resource locator (URL) of the form:

scheme_name://path_to_file/myData.parquet

Based on the remote location, scheme_name can be one of the values in this table.

Remote Locationscheme_name
Amazon S3™s3
Windows Azure® Blob Storagewasb, wasbs
HDFS™hdfs

For more information, see Work with Remote Data.

Example: 's3://bucketname/path_to_file/myData.parquet'

Data Types: char | string

Input data, specified as a table or timetable.

Use parquetwrite to export structured Parquet data. For more information on Parquet data types supported for writing, see Apache Parquet Data Type Mappings.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: parquetwrite(filename,T,'VariableCompression','gzip','Version','1.0')

Compression scheme names, specified as one of these values:

  • 'snappy', 'brotli', 'gzip', or 'uncompressed'. If you specify one compression algorithm then parquetwrite compresses all variables using the same algorithm.

  • Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.

In general, 'snappy' has better performance for reading and writing, 'gzip' has a higher compression ratio at the cost of more CPU processing time, and 'brotli' typically produces the smallest file size at the cost of compression speed.

Example: parquetwrite('myData.parquet', T, 'VariableCompression', 'brotli')

Example: parquetwrite('myData.parquet', T, 'VariableCompression', {'brotli' 'snappy' 'gzip'})

Encoding scheme names, specified as one of these values:

  • 'auto'parquetwrite uses 'plain' encoding for logical variables, and 'dictionary' encoding for all others.

  • 'dictionary', 'plain' — If you specify one encoding scheme then parquetwrite encodes all variables with that scheme.

  • Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.

In general, 'dictionary' encoding results in smaller file sizes, but 'plain' encoding can be faster for variables that do not contain many repeated values. If the size of the dictionary or number of unique values grows to be too big, then the encoding automatically reverts to plain encoding. For more information on Parquet encodings, see Parquet encoding definitions.

Example: parquetwrite('myData.parquet', T, 'VariableEncoding', 'plain')

Example: parquetwrite('myData.parquet', T, 'VariableEncoding', {'plain' 'dictionary' 'plain'})

Number of rows to write per output row group, specified as a nonnegative numeric scalar or vector of nonnegative integers.

  • If you specify a scalar, the scalar value sets the height of all row groups in the output Parquet file. The last row group may contain fewer rows if there is not an exact multiple.

  • If you specify a vector, each value in the vector sets the height of a corresponding row group in the output Parquet file. The sum of all the values in the vector must match the height of the input table.

A row group is the smallest subset of a Parquet file that can be read into memory at once. Reducing the row group height helps the data fit into memory when reading. Row group height also affects the performance of filtering operations on a Parquet data set because a larger row group height can be used to filter larger amounts of data when reading.

If RowGroupHeights is unspecified and the input table exceeds 67108864 rows, the number of row groups in the output file is equal to floor(TotalNumberOfRows/67108864)+1.

Example: RowGroupHeights=100

Example: RowGroupHeights=[300, 400, 500, 0, 268]

Parquet version to use, specified as either '1.0' or '2.0'. By default, '2.0' offers the most efficient storage, but you can select '1.0' for the broadest compatibility with external applications that support the Parquet format.

Caution

Parquet version 1.0 has a limitation that it cannot round-trip variables of type uint32 (they are read back into MATLAB® as int64).

Limitations

In some cases, parquetwrite creates files that do not represent the original array T exactly. If you use parquetread or datastore to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.

Extended Capabilities

Version History

Introduced in R2019a

expand all