How do I extract the contents of an HTML table on a web page into a MATLAB table?

65 visualizzazioni (ultimi 30 giorni)
I'd like to plot and analyze the TSA traveler data from this website: https://www.tsa.gov/coronavirus/passenger-throughput
The data is embedded on the page as an HTML table element.
How do I extract the table content into a MATLAB table?

Risposta accettata

Pat Canny
Pat Canny il 23 Giu 2020
You can extract the <table> content, which is all stored in a set of <td> tags, as a string array and go from there.
You first need to use findElement and extractHTMLText on an htmlTree object.
You then can use reshape to arrange the data, then use array2table to convert to a table.
Here is one approach:
travel_data = webread('https://www.tsa.gov/coronavirus/passenger-throughput');
travel_data_tree = htmlTree(travel_data);
selector = "td";
subtrees = findElement(travel_data_tree,selector);
str = extractHTMLText(subtrees);
table_data = str(4:end); % first three elements are just the column names
reshape_ncols = 3;
reshape_nrows = length(table_data)/reshape_ncols;
table_data_reshaped = reshape(table_data,reshape_ncols,reshape_nrows)';
% Convert to table
traveler_data_table = array2table(table_data_reshaped,'VariableNames',["Date" "Travelers_Today" "Travelers_Last_Year"]); % I got lazy with VariableNames, I know.
% Convert data types from strings to appropriate types
traveler_data_table.Date = datetime(traveler_data_table.Date);
traveler_data_table.Travelers_Today = str2double(traveler_data_table.Travelers_Today);
traveler_data_table.Travelers_Last_Year = str2double(traveler_data_table.Travelers_Last_Year);
traveler_data_table.Traveler_Ratio = traveler_data_table.Travelers_Today ./ traveler_data_table.Travelers_Last_Year;
% Plot the results
figure
plot(traveler_data_table.Date,traveler_data_table.Traveler_Ratio)
title("TSA Traveler Ratio by Date (2020 vs. 2019)")
grid on
% Some more fun analysis
% When did it bottom out?
[min_ratio,idx] = min(traveler_data_table.Traveler_Ratio);
min_ratio_pct = 100*min_ratio;
min_date = traveler_data_table.Date(idx);
disp("The minimum traveler ratio of " + min_ratio_pct + "% occurred on " + string(min_date))
latest_pct = 100*traveler_data_table.Traveler_Ratio(1);
disp("The current ratio is " + latest_pct + "%")

Più risposte (1)

Christopher Creutzig
Christopher Creutzig il 7 Giu 2022
Starting in R2021b, you can directly use readtable for HTML tables:
readtable("https://www.tsa.gov/coronavirus/passenger-throughput",...
FileType="html",ReadVariableNames=true,ThousandsSeparator=",")
ans = 364×5 table
Date 2022 2021 2020 2019 __________ __________ __________ __________ __________ 06/05/2022 2.3872e+06 1.9847e+06 4.4126e+05 2.6699e+06 06/04/2022 1.9814e+06 1.6812e+06 3.5302e+05 2.226e+06 06/03/2022 2.3326e+06 1.8799e+06 4.1968e+05 2.6498e+06 06/02/2022 2.2132e+06 1.8159e+06 3.9188e+05 2.6239e+06 06/01/2022 1.9991e+06 1.5879e+06 3.0444e+05 2.3702e+06 05/31/2022 2.1081e+06 1.6828e+06 2.6774e+05 2.2474e+06 05/30/2022 2.3122e+06 1.9002e+06 3.5326e+05 2.499e+06 05/29/2022 2.0965e+06 1.6505e+06 3.5295e+05 2.5556e+06 05/28/2022 1.9942e+06 1.6058e+06 2.6887e+05 2.1172e+06 05/27/2022 2.3847e+06 1.9596e+06 3.2713e+05 2.5706e+06 05/26/2022 2.3799e+06 1.8545e+06 3.2178e+05 2.4858e+06 05/25/2022 2.1477e+06 1.6182e+06 2.6117e+05 2.269e+06 05/24/2022 2.0207e+06 1.4708e+06 2.6484e+05 2.4536e+06 05/23/2022 2.329e+06 1.7474e+06 3.4077e+05 2.5122e+06 05/22/2022 2.3509e+06 1.8637e+06 2.6745e+05 2.0707e+06 05/21/2022 1.9888e+06 1.55e+06 2.5319e+05 2.1248e+06
  3 Commenti
Simon
Simon il 27 Gen 2025 alle 9:18
Modificato: Simon il 27 Gen 2025 alle 9:20
Thanks for this excellent answer. Is there a way to read in only the latest Number in the top row when readtable() fetches the data, without doing the extraction in Matlab table?
I want to read in the latest data from a table, which is updated daily.
Christopher Creutzig
Christopher Creutzig il 27 Gen 2025 alle 12:39
You can use DataRows=[1,1] to only read from the first row.
Note: The count here does not start where the import without the option starts. Depending on the data, you may have better luck with DataRows=[2,2] or something like that.
Why not just DataRows=2? Because by backward compatibility with other related options in readtable, DataRows=2 means “data rows start at n=2.”

Accedi per commentare.

Categorie

Scopri di più su Tables in Help Center e File Exchange

Tag

Prodotti


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by