convert html content to table

Question

roberto il 22 Apr 2023

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1951453-convert-html-content-to-table

Modificato: Stephen23 il 24 Apr 2023

Hi everyone! I have a file with this content :

<html xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas- microsoft-com:office:office" xmlns="http://www.w3.org/TR/REC-html40"> <head><META http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <table> <tr> <th>DATE</th><th>OPENING</th><th>MAXIMUM</th><th>MINIMUM</th><th>CLOSURE</th><th>VAR % CLOSE</th> <th>VOLUME</th> </tr> <tr> <td>10/24/22</td><td>12,900</td><td>13,042</td><td>12,761</td><td>12,964</td><td>0.00 </td><td>80421</td> </tr> <tr>.

this is a sample line of content. i would like to convert the file content into a table. thanks

7 Commenti
Mostra 5 commenti meno recentiNascondi 5 commenti meno recenti

Rik il 23 Apr 2023

If you have trouble with Matlab basics you may consider doing the Onramp tutorial (which is provided for free by Mathworks).

roberto il 23 Apr 2023

Spostato: Star Strider il 23 Apr 2023

volm.xls

if you open the file directly in matlab you will get an error message. instead opening it first in excel and saving it as xlsx you will be able to use it in matlab. when opening in excel there will be a warning pop up, but you can safely open the file is safe. if opened as txt, you can see the html content.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Star Strider il 22 Apr 2023

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1951453-convert-html-content-to-table#answer_1221108

Modificato: Star Strider il 23 Apr 2023

Apri in MATLAB Online

volm.xls

It appears to be an Excel file in a website. Perhaps using readtable would work. If not, then perhaps first using websave might.

EDIT — (22 Apr 2023 at 18:01)

Use the URL of the website as the argument to readtable.

Example from a recent post:

T1 = readtable('https://www.mathworks.com/matlabcentral/answers/uploaded_files/1245967/PPG_sensor.csv', 'VariableNamingRule','preserve')
T1 = 5187×1 table
Wave____


















EDIT — (23 Apr 2023 at 14:20)

This is not the best or most efficient code I ever wrote, however it has the virtue of working —

str = fileread('volm.xls');

expression = '(\d+\/\d+\/\d+)|(\d*\.\d*)|\d*';

[tokens,matches] = regexp(str,expression,'tokens','match');

vmatches = matches(5:end);

arlim = numel(vmatches)

arlim = 504

v = reshape(vmatches, 8, []).';

dnv = str2double(v(:,2:end));

dtv = datetime(v(:,1), 'InputFormat','dd/MM/yy');% + timeofday(datetime(v(:,2), 'InputFormat','HH:mm:ss'));

T1 = array2table(dnv);

T1 = addvars(T1, dtv, 'Before',1);

T1.Properties.VariableNames = {'DATA','APERTURA','MASSIMO','MINIMO','CHISURA','VAR','% CLOSE','VOLUME'}

T1 = 63×8 table

DATA APERTURA MASSIMO MINIMO CHISURA VAR % CLOSE VOLUME ___________ ________ _______ ______ _______ ___ _______ ______ 23-Jan-2023 15.141 15.209 15.071 15.16 0 0 43887 24-Jan-2023 15.177 15.215 15.073 15.154 0 4 45015 25-Jan-2023 15.143 15.2 15.018 15.134 0 13 56891 26-Jan-2023 15.207 15.274 15.119 15.177 0 28 60531 27-Jan-2023 15.202 15.239 15.124 15.195 0 12 53538 30-Jan-2023 15.204 15.212 15.033 15.183 0 8 67499 31-Jan-2023 15.164 15.243 15.038 15.174 0 6 62442 01-Feb-2023 15.207 15.366 15.151 15.219 0 30 68697 02-Feb-2023 15.307 15.596 15.302 15.565 2 27 95973 03-Feb-2023 15.551 15.556 15.392 15.509 0 36 70726 06-Feb-2023 15.427 15.451 15.32 15.395 0 74 68405 07-Feb-2023 15.43 15.483 15.317 15.355 0 26 74440 08-Feb-2023 15.449 15.537 15.411 15.451 0 63 72839 09-Feb-2023 15.486 15.705 15.445 15.568 0 76 88021 10-Feb-2023 15.484 15.523 15.272 15.331 1 52 84766 13-Feb-2023 15.327 15.468 15.279 15.441 0 72 52535

VN = T1.Properties.VariableNames;

figure

plot(T1{:,1}, T1{:,[2 3 4]})

hold on

patch([T1{:,1}; flip(T1{:,1})], [T1{:,3}; flip(T1{:,4})], [1 1 1]*0.25, 'EdgeColor','none', 'FaceAlpha',0.25)

hold off

grid

axis('padded')

xlabel(VN{1})

ylabel('Value')

legend(VN{[2 3 4]}, 'Location','best')

The ‘DATA’ are a datetime array, the rest are double values.

EDIT — (23 Apr 2023 at 14:52)

Added plot.

EDIT — (23 Apr 2023 at 17:48)

Changed code to use reshape on the the original cell array ‘vmatches’ to eliminate the loop and make the code more efficient.

.

7 Commenti
Mostra 5 commenti meno recentiNascondi 5 commenti meno recenti

Star Strider il 23 Apr 2023

Apri in MATLAB Online

russfut.xls

This one also has a time variable, so that provided a bit more excitement.

Thsi works —

str = fileread('russfut.xls');

expression = '(\d+\/\d+\/\d+)|(\d+\:\d+\:\d+)|(\d*\.\d*)|\d*';

[tokens,matches] = regexp(str,expression,'tokens','match');

% matches

vmatches = matches(5:end);

arlim = numel(vmatches)

arlim = 97191

v = reshape(vmatches, 9, []).';

dnv = str2double(v(:,3:end));

dtv = datetime(v(:,1), 'InputFormat','dd/MM/yy') + timeofday(datetime(v(:,2), 'InputFormat','HH:mm:ss'));

T1 = array2table(dnv);

T1 = addvars(T1, dtv, 'Before',1);

T1.Properties.VariableNames = {'DATA ORA','APERTURA','MASSIMO','MINIMO','CHISURA','VAR','% CLOSE','VOLUME'}

T1 = 10799×8 table

DATA ORA APERTURA MASSIMO MINIMO CHISURA VAR % CLOSE VOLUME ____________________ ________ _______ ______ _______ ___ _______ ______ 21-Apr-2023 11:13:23 15.885 15.885 15.885 15.885 0 0 0 21-Apr-2023 11:13:27 15.884 15.884 15.884 15.884 0 1 1 21-Apr-2023 11:13:28 15.886 15.886 15.886 15.886 0 1 7 21-Apr-2023 11:13:32 15.885 15.885 15.885 15.885 0 1 1 21-Apr-2023 11:13:33 15.885 15.885 15.885 15.885 0 0 1 21-Apr-2023 11:13:34 15.885 15.885 15.885 15.885 0 0 1 21-Apr-2023 11:13:45 15.887 15.887 15.887 15.887 0 1 1 21-Apr-2023 11:13:49 15.887 15.887 15.887 15.887 0 0 1 21-Apr-2023 11:13:51 15.888 15.888 15.888 15.888 0 1 4 21-Apr-2023 11:13:56 15.886 15.886 15.886 15.886 0 1 1 21-Apr-2023 11:14:00 15.886 15.886 15.886 15.886 0 0 1 21-Apr-2023 11:14:01 15.887 15.887 15.887 15.887 0 1 2 21-Apr-2023 11:14:04 15.886 15.886 15.886 15.886 0 1 1 21-Apr-2023 11:14:09 15.886 15.886 15.886 15.886 0 0 1 21-Apr-2023 11:14:15 15.886 15.886 15.886 15.886 0 0 1 21-Apr-2023 11:14:17 15.886 15.886 15.886 15.886 0 0 2

VN = T1.Properties.VariableNames;

figure

plot(T1{:,1}, T1{:,[2 3 4]})

hold on

patch([T1{:,1}; flip(T1{:,1})], [T1{:,3}; flip(T1{:,4})], [1 1 1]*0.25, 'EdgeColor','none', 'FaceAlpha',0.25)

hold off

grid

axis('padded')

xlabel(VN{1})

ylabel('Value')

legend(VN{[2 3 4]}, 'Location','best')

If you have more of these, it may be necessary to experiment with them individually to see whhich of these works. I also changed the earlier code to be more efficient, once I got this working correctly. (I couldn’t get reshape to work correctly with it initially.)

.

Star Strider il 24 Apr 2023

Apri in MATLAB Online

russfut.xls

I learned something new, that being that this file (and presumably the others) have a ‘,’ decimal separator and a ‘.’ thousands separator. I was not aware of that previously, and did not suspect it. So the ‘str1’ - ‘str4’ conversion steps will likely have to be added to all the previous code as well.

This version also detects the variable names and the edits them to fit the final table (and intermediate operations, specifically combining the dates and times into a single datetime array, although I cannot determine if the other combinations of variable names to make them fit the column-size of the numeric data are correct).

Unfortunatley what I previous believed was reasonably robust code is not as robust as I thought it would be, and gives me new respect for the MathWorks folks who had to code readtable and readmatrix and the rest!

The latest version —

str = fileread('russfut.xls');

str2 = strrep(str,',','#');

str3 = strrep(str2,'.','');

str4 = strrep(str3,'#','.');

vars = regexp(str4,'([A-Z]\w*)','match');

VN1 = vars(:,7:end);

str = str4;

expression = '(\d+\/\d+\/\d+)|(\d+\:\d+\:\d+)|(\d*\.\d*)|\d*';

[tokens,matches] = regexp(str,expression,'tokens','match');

% matches

vmatches = matches(4:end);

arlim = numel(vmatches)

arlim = 69776

cols = numel(VN1)-1;

v = reshape(vmatches, cols, []).';

dnv = str2double(v(:,3:end));

dtv = datetime(v(:,1), 'InputFormat','dd/MM/yy') + timeofday(datetime(v(:,2), 'InputFormat','HH:mm:ss'));

T1 = array2table(dnv);

T1 = addvars(T1, dtv, 'Before',1);

VN1{1} = [VN1{1} ' ' VN1{2}];

VN1{8} = [VN1{7} ' ' VN1{8}];

VN1([2 7]) = [];

T1.Properties.VariableNames = VN1

T1 = 8722×7 table

DATA ORA APERTURA MASSIMO MINIMO CHIUSURA VAR CLOSE VOLUME ____________________ ________ _______ ______ ________ _________ ______ 24-Apr-2023 00:00:00 13068 13068 13068 13068 0 0 24-Apr-2023 00:00:01 13068 13068 13068 13068 0 1 24-Apr-2023 00:00:02 13068 13068 13068 13068 0 8 24-Apr-2023 00:00:03 13069 13069 13069 13069 0.01 8 24-Apr-2023 00:00:04 13069 13069 13069 13069 0 8 24-Apr-2023 00:00:05 13071 13071 13071 13071 0.01 1 24-Apr-2023 00:00:06 13070 13070 13070 13070 0 8 24-Apr-2023 00:00:08 13070 13070 13070 13070 0 14 24-Apr-2023 00:00:09 13069 13069 13069 13069 0.01 6 24-Apr-2023 00:00:10 13070 13070 13070 13070 0.01 6 24-Apr-2023 00:00:11 13069 13069 13069 13069 0.01 3 24-Apr-2023 00:00:12 13070 13070 13070 13070 0.01 2 24-Apr-2023 00:00:13 13069 13069 13069 13069 0.01 4 24-Apr-2023 00:00:14 13069 13069 13069 13069 0 4 24-Apr-2023 00:00:15 13068 13068 13068 13068 0.01 11 24-Apr-2023 00:00:16 13068 13068 13068 13068 0.01 4

VN = T1.Properties.VariableNames;

figure

plot(T1{:,1}, T1{:,[2 3 4]})

hold on

patch([T1{:,1}; flip(T1{:,1})], [T1{:,3}; flip(T1{:,4})], [1 1 1]*0.25, 'EdgeColor','none', 'FaceAlpha',0.25)

hold off

grid

axis('padded')

xlabel(VN{1})

ylabel('Value €')

legend(VN{[2 3 4]}, 'Location','best')

There could be endless variations on the formats of these files, and I cannot envision a robust code to handle all of them. It would be quite helpful to find a way for them to be made compatible with readmatrix. That, unfortunately, is beyond me.

.

Stephen23 il 24 Apr 2023

Modificato: Stephen23 il 24 Apr 2023

"I learned something new, that being that this file (and presumably the others) have a ‘,’ decimal separator and a ‘.’ thousands separator."

It is a simple XML file (not a XLS file, regardless of its file extension). Everything in an XML is text.

Accedi per commentare.

convert html content to table

7 Commenti
Mostra 5 commenti meno recentiNascondi 5 commenti meno recenti

Risposta accettata

7 Commenti
Mostra 5 commenti meno recentiNascondi 5 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

convert html content to table

7 Commenti Mostra 5 commenti meno recentiNascondi 5 commenti meno recenti

Risposta accettata

7 Commenti Mostra 5 commenti meno recentiNascondi 5 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

7 Commenti
Mostra 5 commenti meno recentiNascondi 5 commenti meno recenti

7 Commenti
Mostra 5 commenti meno recentiNascondi 5 commenti meno recenti