MATLAB Interface for WebHDFS
Updated 2 Sep 2021
- Getting Started
- Enhancement requests
WebHDFS is a protocol that defines a public HTTP REST API which permits clients to access Hadoop® Distributed File System (HDFS) over the Web. It retains the security the native Hadoop® protocol offers and uses parallelism, for better throughput. To use this toolbox, the webhdfs functionality needs to be enabled in the Hadoop® Server.
This toolbox provides a set of functions that enable the user to directly work with files and folders stored in Hadoop® via a REST API and perform common operations such as read, write, upload, and download files.
When working with Hadoop® files, the WebHDFS is not the only alternative and you might want to consider other alternatives depending on the task at hand.
For Big Data applications, you can prototype an algorithm in MATLAB® either using tall arrays or our Spark API and deploy them direclty on Spark enabled Hadoop® cluster
You can access your files using Hive and Impala, and run any SQL or HQL command. This tool might be more suitable to run queries on large pieces of data.
These tools might be more suitable to run analytics on large sets of data, while the webhdfs interface might be a better tool to do small operations, since the data needs to travel back and forth over the internet.
Only base MATLAB® is required to run all the toolbox functionality. For users accessing Hadoop® via Kerberos authentication R2019b or newer is recommended.
[optional] Database Toolbox™ is needed to run Hive and Impala
[optional] MATLAB® Compiler™ is needed to deploy Spark or mapreduce jobs
[optional] MATLAB® Parallel Server™ is needed to run interactive Spark jobs
The toolbox can be installed directly from the Add-On explorer, or by double-clicking the
mltbx file. All the functionality will be then accessible under the namepsace
The toolbox will be updated regularly. To get the newest version, you can simply uninstall and re-install the toolbox direclty from the Add-On explorer in MATLAB.
This toolbox is licensed under an XLSA license. Please see LICENSE.txt.
A complete documentation for the toolbox can be found in the getting started guide. However, the most common tasks are also outlined below.
For more information, please look at the documentation of the toolbox:
The connection to a Hadoop® cluster is always done via the class
WebHdfsClient. This class supports several optional arguments:
- root [optional]: Root folder to parent all requests. If unspecified, all requests are assumed to be relative to "".
- protocol [optional]: whether the connection is done via http or https (default).
- host [optional]: hostname of the server running Hadoop®
- port [optional]: port number where Hadoop® is running
- name [optional]: for unauthenticated servers only. Specify the name of the user.
client = WebHdfsClient(root = 'data', protocol = 'http', host = "sandbox-hdp.hortonworks.com", port = 50070);
To avoid having to set the connection details every time, you can save the connection details so future connections only require the root folder of your requests (if any). These preferences persist between MATLAB® sessions
client.saveConnectionDetails(); client = WebHdfsClient("root", 'data');
These saved preferences can be removed at any poiny by running:
When you work with files and folders you can specify a relative or absolute path. If the specified path starts with "/", it will be interpreted as an absolute path. Otherwise, the code will interpret the path as relative to the "root" folder in the server. For example, the following line lists the status of the folder
client = WebHdfsClient("root", 'data'); status = client.hdfs_status("myData/testMW")
status = accessTime: 0 blockSize: 0 childrenNum: 0 fileId: 2495973 group: 'hdfs' length: 0 modificationTime: 1.6262e+12 owner: 'maria_dev' pathSuffix: '' permission: '755' replication: 0 storagePolicy: 0 type: 'DIRECTORY'
There are two authentication mechanisms supported by the toolbox:
- For unauthenticated servers, you will need to specify your Username during the hdfs connection if you want to access any user specific operations.
client = WebHdfsClient(root = 'data', name = "maria_dev");
- For Kerberos authentication, please use R2019b or newer. The authentication will be done automatically.
The following section outlines the most common commands to work with HDFS files. It shows how one can navigate the directories, open, download, and upload new files direclty in HDFS.
hdfs_content will give you information about the files inside a specific folder. For example, the following command returns the names and status of all the files within the folder
client = WebHdfsClient("root", 'data'); elements = client.hdfs_list("myData", status = true)
status is set to
false, only the names of the files are returned.
Similarly, the method
hdfs_recent_files, allows you to find most recently modified files in a directory. By default function returns one file but you can specify the maximum number of files return with the nfiles argument. If you you set the nfiles argument to None, then you will get back list of all files. This function returns only the file names. For example, to view the latest file added to the folder
/data/myData, we can run:
files = client.hdfs_recent_files("myData", 1)
files = accessTime: 1.6262e+12 blockSize: 134217728 childrenNum: 0 fileId: 2499552 group: 'hdfs' length: 184 modificationTime: 1.6262e+12 owner: 'maria_dev' pathSuffix: 'petdata.csv' permission: '777' replication: 1 storagePolicy: 0 type: 'FILE'
You can use the method
hdfs_makedirs to create Hadoop® directories. It will recursively create intermediate directories if they are missing. For example, the following call:
would also create directory
one/two if they were missing. Additionally, this method accepts an optional
overwrite parameter (true/false) to specify if the folder needs to be overwritten. Pleaes note that all contents will be discarded if overwrite is set to true.
You can delete files and directories from Hadoop® with the
hdfs_delete method. Files/directories are not moved to the HDFS Trash so they will be permanently deleted.
hdfs_delete will return True if the file/directory was deleted and False if the file/directory did not exist.
By default non-empty directories will not be deleted. However if you set the optional recursive argument to True then files/directories will be deleted recursively.
Finally, you can move files/directories in Hadoop® with the
- If the destination is an existing directory, then the source file/directory will be moved into it.
- If the destination is an existing file, then this method will return false.
- If the parent destination is missing, then this method will return false.
hdfs_download you can download files from a Hadoop® directory. For exmaple, to download something into a temporary directory you can run:
hdfs_path is a file then that file will be downloaded. If the argument is a directory then all the files and subfolders (together with their files) in that directory will be downloaded.
Note that wildcards are not supported so you can either download complete contents of a directory or individual files. If the local file or directory already exists then it will not be overwritten and an error will be raised. However, you can set an overwrite flag to force the download of the files:
client.hdfs_download(testFileName, tempdir(), overwrite=true);
The same process is equivalent to uploading files in HDFS. You can upload local files and folders with the
- If the target HDFS path exists and is a directory, then the files will be uploaded into it.
- If the target HDFS path exists and is a file, then it will be overwritten if the optional overwrite argument is set to True.
For example, to upload a single file, with the chosen permissions you can run:
lab.hdfs_upload("/data/one", "myfolder", overwrite = true, permission = 777)
hdfs_write you can directly read data from files in Hadoop® folders. The files will be read in memory, so you will not create a local copy of the file. Use the standard MATLAB® modes "r" for text files and "rb" for binary files like parquet. For example:
testFileName = 'myData/matlab_WebHdfsPetdata.csv'; reader = client.hdfs_open(testFileName,'r')
reader = WebHdfsFile with properties: encoding: "utf-8" hdfs_path: "/data/myData/Petdata.csv" mode: 'r'
data = reader.read(); disp(data)
Row,Age,Weight,Height Sanchez,38,176,71 Johnson,43,163,69 Lee,38,131,64 Diaz,40,133,67 Brown,49,119,64 Sanchez,38,176,71 Johnson,43,163,69 Lee,38,131,64 Diaz,40,133,67 Brown,49,119,64
If the data can be read using a standard MATLAB® command such as
parquetread, you can pass this command (with its standard inputs or parameters) as:
data = reader.read(@(x) readtable(x,'ReadVariableNames',false))
hdfs_open function you can also write data to files in HDFS. Note that this is different from uploading files, as the file will not exist in a local path, it will be created from the data in memory.
You can overwrite existing files by setting the mode to "wb" (binary) or "wt" (text), and you can append to an existing files by setting the mode to "at" (text) or "ab" (binary). Note that appending is supported with text files and some binary formats like Avro. Appending is not supported with Parquet files.
The toolbox has four helper methods to help you write data into the file depending on the format that you choose to write.
- "write": to add any type of text data to the file.
- "writeTable": to add/append tables to a file.
- "writeCell": to add/append cells to a file.
- "writeMatrix": to add/append matrices to a file.
testFileName = '/data/myData/petdata.csv'; file = client.hdfs_open(testFileName, 'w+'); writeTable(file, T) file.read(@readtable)
By default, the funciton writeTable in "write" mode, will only add the variable names to the table in UTF-8 format. This features can also be set as follows:
writeTable(file, T, 'WriteVariableNames', false, 'WriteRowNames', true, 'encoding', 'UTF-8') file.read(@(x) readtable(x,'ReadRowNames', true))
Finally, if the file is opened in "append" mode, the data will be added at the end of the file. If the file did not exist, an empty file would be created upon start.
testFileName = 'myData/petdata.csv'; file = client.hdfs_open(testFileName, 'a+');
Unlike write mode, the default settings in append mode do not add variable headers, or row names to the file. You can, however, specify these same options in append mode.
file.writeTable(T,'WriteVariableNames', true, 'WriteRowNames', true) file.writeTable(T, 'WriteRowNames', true) file.read(@(x) readtable(x,'ReadRowNames', true))
It is possible to set the read/write/execute file permissions programmatically. You need to pass the file permissions as a 3-number octal as show here Chmod Calculator (chmod-calculator.com)
client.hdfs_set_permission( 'myData/petdata.csv', 777);
To report any bug, or enchancement request, pelase submit a GitHub issue.
Edu Benet Cerda (2022). MATLAB Interface for WebHDFS (https://github.com/mathworks/MATLAB-Interface-for-WebHDFS/releases/tag/v1.0.0), GitHub. Retrieved .
MATLAB Release Compatibility
Platform CompatibilityWindows macOS Linux
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!Start Hunting!