retrieve data from a website with multiple pages
6 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Hi all,
I want to pull the data from this website into a table.
it has 185 pages so I wrote a for loop so it will pass the entire table.
the problam is that I'm using webread, which is seems to read everything into char array.
what I want is that in each itteration of the for loop the data from this table will be read, how can it be done?
thanks
4 Commenti
Rik
il 19 Feb 2022
You can search the html for the text in the table and guess the structure from what you see.
Risposta accettata
Ive J
il 19 Feb 2022
Modificato: Ive J
il 20 Feb 2022
My answer doesn't totally solve your problem, but addresses your main questions (hopefully!). Before parsing the HTML itself, webread doesn't read the content of the URL because the website uses some measures against bot attacks (read more: https://stackoverflow.com/questions/53434555/python-requests-enable-cookies-javascript), so that needs to be fixed first.
url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=1";
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
Challenge = extractBetween(top, "Challenge=", ";");
challenge_id = extractBetween(top, "ChallengeId=", ";");
arr = char(Challenge);
last_digit = str2double(arr(end));
arr = sort(arr);
min_digit = str2double(arr(1));
subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
x = double(Challenge) * 3 + subvar1;
y = cos(pi * subvar1);
answer = x * y;
answer = answer - power;
answer = answer + (min_digit - last_digit);
answer = string(floor(answer)) + subvar2;
hdrs = {'X-AA-Challenge' char(Challenge); ...
'X-AA-Challenge-ID' char(challenge_id); ...
'X-AA-Challenge-Result' char(answer)};
% now read the website contents ===========================================
htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
wildcardPattern + "</td><td>", ...
'</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
'VariableNames', hdr);
tab = convertvars(tab, 1:width(tab), @string);
tab.(1) = double(tab.(1));
head(tab)
10 Commenti
Ive J
il 22 Feb 2022
Yes, that's for 3 pages.
Feel free to use the function above! also be aware that sometimes when you send so many requests to a website, they may block your IP (temporarily).
To track possible parsing bugs, you can also save each table as a mat file. In this way, if you expect let's say 120 rows and you get only 100, you can inspect each table individually. You can do this by adding these lines:
for i = 1:n
fprintf('reading page %d of %d\n', i, n)
tab = readEachPage(i);
save("tab.page." + i + ".mat", "tab") % e.g. tab.page.10.mat contains table for page 10
unitab{i} = tab;
end
Più risposte (0)
Vedere anche
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!