This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Read and Analyze Hadoop Sequence File

This example shows how to create a datastore for a Sequence file containing key-value data. Then, you can read and process the data one chunk at a time. Sequence files are outputs of mapreduce operations that use Hadoop®.

Set the appropriate environment variable to the location where Hadoop is installed. In this case, set the MATLAB_HADOOP_INSTALL environment variable.

setenv('MATLAB_HADOOP_INSTALL','/mypath/hadoop-folder')

hadoop-folder is the folder where Hadoop is installed and mypath is the path to that folder.

Create a datastore from the sample file, mapredout.seq, using the datastore function. The sample file contains unique keys representing airline carrier codes and corresponding values that represent the number of flights operated by that carrier.

ds = datastore('mapredout.seq')
ds = 
  KeyValueDatastore with properties:

       Files: {
              ' ...\matlab\toolbox\matlab\demos\mapredout.seq'
              }
    ReadSize: 1 key-value pairs
    FileType: 'seq'

datastore returns a KeyValueDatastore. The datastore function automatically determines the appropriate type of datastore to create.

Set the ReadSize property to six so that each call to read reads at most six key-value pairs.

ds.ReadSize = 6;

Read subsets of the data from ds using the read function in a while loop. For each subset of data, compute the sum of the values. Store the sum for each subset in an array named sums. The while loop executes until hasdata(ds) returns false.

sums = [];
while hasdata(ds)
    T = read(ds);
    T.Value = cell2mat(T.Value);
    sums(end+1) = sum(T.Value);
end

View the last subset of key-value pairs read.

T
T = 

      Key       Value
    ________    _____

    'WN'        15931
    'XE'         2357
    'YV'          849
    'ML (1)'       69
    'PA (1)'      318

Compute the total number of flights operated by all carriers.

numflights = sum(sums)
numflights =

      123523

See Also

| | |

Related Topics