MATLAB Answers

0

I want to extract the page buttons/widgets in a website using URLREAD.

Asked by Ajpaezm on 13 Sep 2017
Latest activity Edited by Cedric Wannaz
on 14 Sep 2017
I want to learn what is the common expression for Buttons/Widgets that contain page numbers of a catalog, e.g. like in this website . In this capture you'll see what are the numbers I'd like to get using URLread command.
Do you know how to do this? You'd help me A LOT if you can. I already tried printing everything into a .txt file but I can't write the whole HTML code into it. My plan was to look for the common expression manually but I couldn't print the whole outcome of URLread into the .txt file.
Thanks a lot,
Aquiles

  3 Comments

The HTML for that section is
<ul class='pagination'>
<li class='disabled'><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=0'> < </a></li>
<li class='active'><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=1'>1</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=2'>2</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=3'>3</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=4'>4</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=5'>5</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=6'>6</a></li>
<li class='disabled'><span>...</span></li>
<li><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=83'>83</a></li>
<li class=''><a href='/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=2'> > </a></li>
</ul>
So you want to look for index.php and page=\d+
THANK YOU!
While I was writing "How did you do it?", I remembered Google Chrome had a source code viewer. It was that easy.
Thanks anyways for your time and help! :)
Yup, I just visited the page in Firefox and hit command-U and scrolled through the HTML.

Sign in to comment.

1 Answer

Answer by Cedric Wannaz
on 14 Sep 2017
Edited by Cedric Wannaz
on 14 Sep 2017
 Accepted Answer

When you start clicking on pages, the page ID is in the URL, e.g.
https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=17
you can see it as the last URL parameter. It is therefore easy to build the URL for a given page with SPRINTF e.g. in a loop..
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
% Do something.
end
Then maybe you want to parse the HTML to get the table data, and you can use regular expressions for this. Training with page 1:
pageId = 1 ;
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
data = regexp( html, pattern, 'names' ) ;
With that you get:
>> data
data =
1×100 struct array with fields:
ibSymbol
externalUrl
name
symbol
currency
>> data(1)
ans =
struct with fields:
ibSymbol: 'AT'
externalUrl: 'https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=G…'
name: 'ATLANTIC POWER CORP'
symbol: 'AT'
currency: 'USD'
which is a struct array with the 100 entries of the table, including the URL of the page that you get in the popup window when you click on a product. So then you can work on parsing these pages:
html_ext = urlread( data(1).externalUrl ) ;
pattern_ext = '...' ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
I let you develop that part though! And putting everything together, you get a crawler/parser for the whole thing:
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
pattern_ext = '...' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
data = regexp( html, pattern, 'names' ) ;
for productId = 1 : numel( data )
html_ext = urlread( data(productId).externalUrl ) ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
% Do something.
end
end
That gives you a series of concepts/tools/examples that could be useful for what may come next in your developments.
PS: if you need to learn regular expressions in MATLAB, download the "MATLAB Programming Fundamentals" PDF document from
and go through the doc and examples on pages 2-42 to 2-73. It is a pretty good introduction/overview.

  0 Comments

Sign in to comment.