I want to extract the page buttons/widgets in a website using URLREAD.

Question

Ajpaezm il 13 Set 2017

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/356581-i-want-to-extract-the-page-buttons-widgets-in-a-website-using-urlread

Modificato: Cedric il 14 Set 2017

I want to learn what is the common expression for Buttons/Widgets that contain page numbers of a catalog, e.g. like in this website . In this capture you'll see what are the numbers I'd like to get using URLread command.

Do you know how to do this? You'd help me A LOT if you can. I already tried printing everything into a .txt file but I can't write the whole HTML code into it. My plan was to look for the common expression manually but I couldn't print the whole outcome of URLread into the .txt file.

Thanks a lot,

Aquiles

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

Walter Roberson il 14 Set 2017

The HTML for that section is

                <ul class='pagination'>
  <li class='disabled'><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=0'> < </a></li>
  <li class='active'><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=1'>1</a></li>
  <li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=2'>2</a></li>
  <li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=3'>3</a></li>
  <li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=4'>4</a></li>
  <li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=5'>5</a></li>
  <li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=6'>6</a></li>
  <li class='disabled'><span>...</span></li>
  <li><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=83'>83</a></li>
  <li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=2'> > </a></li>
  </ul>

So you want to look for index.php and page=\d+

Ajpaezm il 14 Set 2017

THANK YOU!

While I was writing "How did you do it?", I remembered Google Chrome had a source code viewer. It was that easy.

Thanks anyways for your time and help! :)

Walter Roberson il 14 Set 2017

Yup, I just visited the page in Firefox and hit command-U and scrolled through the HTML.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Cedric il 14 Set 2017

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/356581-i-want-to-extract-the-page-buttons-widgets-in-a-website-using-urlread#answer_281470

Modificato: Cedric il 14 Set 2017

When you start clicking on pages, the page ID is in the URL, e.g.

 https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=17

you can see it as the last URL parameter. It is therefore easy to build the URL for a given page with SPRINTF e.g. in a loop..

 urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
 for pageId = 1 : 83
    url  = sprintf( '%s%d', urlBase, pageId ) ;
    html = urlread( url ) ;
    % Do something.
 end

Then maybe you want to parse the HTML to get the table data, and you can use regular expressions for this. Training with page 1:

 pageId = 1 ;
 url    = sprintf( '%s%d', urlBase, pageId ) ;
 html   = urlread( url ) ;
 pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
    '(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
    '</td>\s*<td>(?<currency>[^<]+)'] ;
 data = regexp( html, pattern, 'names' ) ;

With that you get:

 >> data
 data = 
  1×100 struct array with fields:
    ibSymbol
    externalUrl
    name
    symbol
    currency
 >> data(1)
 ans = 
  struct with fields:
       ibSymbol: 'AT'
    externalUrl: 'https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=G…'
           name: 'ATLANTIC POWER CORP'
         symbol: 'AT'
       currency: 'USD'

which is a struct array with the 100 entries of the table, including the URL of the page that you get in the popup window when you click on a product. So then you can work on parsing these pages:

 html_ext = urlread( data(1).externalUrl ) ;
 pattern_ext = '...' ;
 data_ext = regexp( html_ext, pattern_ext, ... ) ;

I let you develop that part though! And putting everything together, you get a crawler/parser for the whole thing:

 urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
 pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
    '(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
    '</td>\s*<td>(?<currency>[^<]+)'] ;
 pattern_ext = '...' ;
 for pageId = 1 : 83
    url  = sprintf( '%s%d', urlBase, pageId ) ;
    html = urlread( url ) ;
    data = regexp( html, pattern, 'names' ) ;
    for productId = 1 : numel( data )
       html_ext = urlread( data(productId).externalUrl ) ;
       data_ext = regexp( html_ext, pattern_ext, ... ) ;
       % Do something.
    end
 end