retrieve data from a website with multiple pages

Question

0 voti

Hi all,

I want to pull the data from this website into a table.

https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=1

it has 185 pages so I wrote a for loop so it will pass the entire table.

the problam is that I'm using webread, which is seems to read everything into char array.

what I want is that in each itteration of the for loop the data from this table will be read, how can it be done?

thanks

4 Commenti
Mostra 2 commenti meno recenti Nascondi 2 commenti meno recenti

Rik il 19 Feb 2022

You can search the html for the text in the table and guess the structure from what you see.

sani il 19 Feb 2022

I think it is a <table> if I understand correctly, I tried to set weboptions.ContentType to 'table' but it is saying that there is only text.

I'm not sure that this is the way to approach it though

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

Ive J il 19 Feb 2022

Modificato: Ive J il 20 Feb 2022

Apri in MATLAB Online

0 voti

My answer doesn't totally solve your problem, but addresses your main questions (hopefully!). Before parsing the HTML itself, webread doesn't read the content of the URL because the website uses some measures against bot attacks (read more: https://stackoverflow.com/questions/53434555/python-requests-enable-cookies-javascript), so that needs to be fixed first.

url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=1";
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
    Challenge = extractBetween(top, "Challenge=", ";");
    challenge_id = extractBetween(top, "ChallengeId=", ";");
    
    arr = char(Challenge);
    last_digit = str2double(arr(end));
    arr = sort(arr);
    min_digit = str2double(arr(1));
    subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
    subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
    power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
    x = double(Challenge) * 3 + subvar1;
    y = cos(pi * subvar1);
    answer = x * y;
    answer = answer - power;
    answer = answer + (min_digit - last_digit);
    answer = string(floor(answer)) + subvar2;
    
    hdrs = {'X-AA-Challenge' char(Challenge); ...
        'X-AA-Challenge-ID' char(challenge_id); ...
        'X-AA-Challenge-Result' char(answer)};
    
    % now read the website contents ===========================================
    htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
    wildcardPattern + "</td><td>", ...
    '</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table 
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
    'VariableNames', hdr);
tab = convertvars(tab, 1:width(tab), @string);
tab.(1) = double(tab.(1));
head(tab)
ans = 8×7 table
    מספר רישיון                שם יצרן                         כתובת                ישוב           מחוז                                                                  פרטים_סוג מזון (מהות היצור):                                                                        פרטים_קבוצת מזון:         
    ___________    ________________________________    _____________________    _____________    _________    __________________________________________________________________________________________________________________________________________________    ___________________________________

       55678       "א. הקר 2009 גלאט למהדרין בע"מ"     "מרכז ספיר 3 ירושלים"    "ירושלים"        "ירושלים"    "ייצור מוצרי בשר קפואים בלבד: בשר בקר טחון, בשר בעלי כנף טחון ומוצריהם, קישקע ממולא, בשר בקר מעובד, בשר בעלי כנף מעובד, ניסור ואריזת בשר בקר קפוא"    "הסעדה"                            
       68795       "א. כ. התעשיינים בע"מ"              "שד הסנהדרין 3 יבנה"     "יבנה"           "מרכז"       "בשר ומוצריו, לרבות עופות וצייד"                                                                                                                      "הסעדה (קיטרינג)"                  
       52319       "א.א בורקס ליאון"                   "איתן 24 ראשון לציון"    "ראשון לציון"    "מרכז"       "אחסנה בקירור"                                                                                                                                        "אחסון מזון בקירור"                
       69047       "א.א בליסימו בע"מ"                  "איתן 3 ראשון לציון"     "ראשון לציון"    "מרכז"       "קרחונים אכילים, כולל שרבט וסורבט"                                                                                                                    "מחסן קרור/מחסן בטמ' מבוקרת"       
       67457       "א.א מטעמים הכי טעים בע"מ"          "מודיעין 8 פתח תקווה"    "פתח תקווה"      "מרכז"       "ייצור בצקים ממולאים, ייצור עוגיות יבשות"                                                                                                             "לחם, לחמניות, עוגות שמרים ומאפים" 
       52312       "א.א. בליסימו בע"מ"                 "לזרוב 3 ראשון לציון"    "ראשון לציון"    "מרכז"       "מוצרי מאפה, תערובות להכנתם ובצקים"                                                                                                                   "לחמים ולחמניות מאודים"            
       50780       "א.א. דרך האוכל (חיפה) בע"מ"        "שנקר אריה 47 חיפה"      "חיפה"           "חיפה"       "אחסנת בצקים קפואים"                                                                                                                                  "יצור מוצרי בשר בקר וצאן טחון בלבד"
       52587       "א.א. לרנר מוצרי מזון העמק בע"מ"    "הפועלים 2 באר שבע"      "באר שבע"        "דרום"       "מחסן קרור/מחסן בטמ' מבוקרת"                                                                                                                          "בשר ומוצריו, לרבות עופות וצייד"   

10 Commenti
Mostra 8 commenti meno recenti Nascondi 8 commenti meno recenti

Ive J il 21 Feb 2022

Modificato: Ive J il 21 Feb 2022

Apri in MATLAB Online

I'm not sure if I get it right; do you mean you tried something like this?

function parseFoodAndNutrition(n)
if nargin < 1
    n = 3; % read only 3 pages
end
unitab = cell(n, 1);
for i = 1:n
    fprintf('reading page %d of %d\n', i, n)
    unitab{i} = readEachPage(i);
end
unitab = vertcat(unitab{:});
unitab = convertvars(unitab, 1:width(unitab), @string);
unitab.(1) = double(unitab.(1));
end % END
%% subfunctions ===========================================================
function tab = readEachPage(n)
url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=" + n;
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
    Challenge = extractBetween(top, "Challenge=", ";");
    challenge_id = extractBetween(top, "ChallengeId=", ";");
    
    arr = char(Challenge);
    last_digit = str2double(arr(end));
    arr = sort(arr);
    min_digit = str2double(arr(1));
    subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
    subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
    power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
    x = double(Challenge) * 3 + subvar1;
    y = cos(pi * subvar1);
    answer = x * y;
    answer = answer - power;
    answer = answer + (min_digit - last_digit);
    answer = string(floor(answer)) + subvar2;
    
    hdrs = {'X-AA-Challenge' char(Challenge); ...
        'X-AA-Challenge-ID' char(challenge_id); ...
        'X-AA-Challenge-Result' char(answer)};
    
    % now read the website contents ===========================================
    htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
    wildcardPattern + "</td><td>", ...
    '</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table 
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
    'VariableNames', hdr);
% can be done at once in the end
% tab = convertvars(tab, 1:width(tab), @string);
% tab.(1) = double(tab.(1));
end

When I run the above function I get this:

size(unitab)
ans =
    36     7

sani il 22 Feb 2022

I was actually put your entire script in a for loop, and changed the URL as i increase. Than in each loop I was writing the answer from your script to another tanle using vertcat. If I understand correctly, the answer of size(unitab) = (36,7) is for pages 1-3? If so, this is the dimension I'm expecting to receive.

Ive J il 22 Feb 2022

Apri in MATLAB Online

Yes, that's for 3 pages.

Feel free to use the function above! also be aware that sometimes when you send so many requests to a website, they may block your IP (temporarily).

To track possible parsing bugs, you can also save each table as a mat file. In this way, if you expect let's say 120 rows and you get only 100, you can inspect each table individually. You can do this by adding these lines:

for i = 1:n
    fprintf('reading page %d of %d\n', i, n)
    tab = readEachPage(i);
    save("tab.page." + i + ".mat", "tab") % e.g. tab.page.10.mat contains table for page 10
    unitab{i} = tab;
end

Accedi per commentare.

retrieve data from a website with multiple pages

4 Commenti
Mostra 2 commenti meno recenti Nascondi 2 commenti meno recenti

Risposta accettata

10 Commenti
Mostra 8 commenti meno recenti Nascondi 8 commenti meno recenti

Più risposte (0)

Categorie

Tag

Community Treasure Hunt

retrieve data from a website with multiple pages

4 Commenti Mostra 2 commenti meno recenti Nascondi 2 commenti meno recenti

Risposta accettata

10 Commenti Mostra 8 commenti meno recenti Nascondi 8 commenti meno recenti

Più risposte (0)

Categorie

Tag

Vedere anche

Community Treasure Hunt

4 Commenti
Mostra 2 commenti meno recenti Nascondi 2 commenti meno recenti

10 Commenti
Mostra 8 commenti meno recenti Nascondi 8 commenti meno recenti