Matlab extract url from html source
9 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Hi, I am trying to extract all urls from a HTML source code. I used strfind command to find "http" as the starting of url and ".html", ".php" , ".png" as the end of the url. After that i join the starting and the ending to form a complete URL
But this give very bad result because it usually mix up.
I want to ask if there is any easier way to do this?
I'm thinking about searching for a pattern, a single command to give all urls that start with http:// and end with .html , .php, or .png
In the html source code, there are some other url extension, but i want to ignore all of them.
Thank you very much for any help
0 Commenti
Risposte (3)
Jason Ross
il 2 Mag 2012
I would do this using a series of regular expressions. Take a look at "Parsing Strings with Regular Expressions" on the following page for an example. It uses email addresses, but doing it for a URL is very similar since you know how it starts and ends, and you care about what's in between.
1 Commento
Walter Roberson
il 28 Feb 2015
Walter Roberson
il 2 Mag 2012
regexp(TheString, 'http://.*?\.(html|php|png)')
However, this cannot notice that (say) http://mathworks.com/scripts.htmlx/logo.png should extend to the .png instead of just to the .html . In order to be able to determine that you have reached the end of the URI, you need to know the list of characters which terminate URI in your context. Taking into account that sloppy pages often send URI with embedded blanks, which is syntactically invalid...
0 Commenti
Abhisar Ekka
il 13 Feb 2021
You can run this piece of code and it works.
html = webread("<----paste your url here ---->");
hyperlinks = regexp(html,'https?://[^"]+','match')'
Inside webread, paste your url. Webread does the work of reading & parsing the html code . And upon using regexp which matches regular expression we get all kinds of http and https links in the url.
1 Commento
Gobert
il 13 Giu 2021
How can one check each html code to find emails? For example, see below: How to make this code work?
html = webread("https://edition.cnn.com");
hyperlinks = regexp(html,'https?://[^"]+','match')';
rgx ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
emails = regexpi(hyperlinks,rgx,'match')';
Vedere anche
Categorie
Scopri di più su Web Services in Help Center e File Exchange
Prodotti
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!