wfejflefklefwefwefwe Posted March 12, 2014 Share Posted March 12, 2014 I want to use PHP to extract all A HREF urls and the text that clicks through to the link. e.g. This is Google With this link, I would like to extract 1. This is Google 2. http://google.com I've looked into simplehtmldom_1_5 library but this just seems to get the URL but not the text overlay. Thanks Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/ Share on other sites More sharing options...
Rifts Posted March 12, 2014 Share Posted March 12, 2014 (edited) you can use simple dom to do this Edited March 12, 2014 by Rifts Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472279 Share on other sites More sharing options...
Ch0cu3r Posted March 12, 2014 Share Posted March 12, 2014 I've looked into simplehtmldom_1_5 library but this just seems to get the URL but not the text overlay. Try getting the link text using $element->innertext // Find all links foreach($html->find('a') as $element) echo 'Href: ' . $element->href . ', Link-Text: ' . $element->innertext.'<br>'; Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472307 Share on other sites More sharing options...
Ansego Posted March 12, 2014 Share Posted March 12, 2014 PHP Simple HTML DOM Parser : CLICK Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472317 Share on other sites More sharing options...
wfejflefklefwefwefwe Posted March 12, 2014 Author Share Posted March 12, 2014 (edited) Try getting the link text using $element->innertext // Find all links foreach($html->find('a') as $element) echo 'Href: ' . $element->href . ', Link-Text: ' . $element->innertext.'<br>'; This simply doesn't work. -> href does work -> innertext does not work Edited March 12, 2014 by wfejflefklefwefwefwe Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472353 Share on other sites More sharing options...
Ch0cu3r Posted March 12, 2014 Share Posted March 12, 2014 (edited) Can you post your html here (in tags) Edited March 12, 2014 by Ch0cu3r Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472358 Share on other sites More sharing options...
wfejflefklefwefwefwe Posted March 12, 2014 Author Share Posted March 12, 2014 The HTML is the page source of the website, Guardian.co.uk I'm bascally writing a PHP CURL script to download news sites, extract the headlines and URLs, and then put them all on one page. It's a convenient way to read a wide source of news and saves you from missing anything. Here is a sample from the Guardan site as of now <h1> <a href="http://www.theguardian.com/world/2014/mar/12/mh370-malaysia-airlines-search-expands-third-possible-sighting" class="link-text">Plane search expands after third possible sighting</a> </h1> Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472363 Share on other sites More sharing options...
Ch0cu3r Posted March 12, 2014 Share Posted March 12, 2014 (edited) $element->innertext works fine for me. require_once 'simple_html_dom.php'; $html = str_get_html('<h1> <a href="http://www.theguardian.com/world/2014/mar/12/mh370-malaysia-airlines-search-expands-third-possible-sighting" class="link-text">Plane search expands after third possible sighting</a> </h1>'); foreach($html->find('a') as $element) echo '<b>Href:</b> ' . $element->href . ', <b>Link-Text:</b> ' . $element->innertext.'<br>'; Edited March 12, 2014 by Ch0cu3r Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472365 Share on other sites More sharing options...
wfejflefklefwefwefwe Posted March 12, 2014 Author Share Posted March 12, 2014 It works well enough, I guess. I am finding issues where it shows me an image rather than link text. This is when an image is the link. I guess there is some way to remove all images from the page? Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472367 Share on other sites More sharing options...
Ansego Posted March 12, 2014 Share Posted March 12, 2014 Do they have a RSS Feed? would that not have the data your looking for? Quote Link to comment https://forums.phpfreaks.com/topic/286899-extract-url-and-link-name-from-html-page/#findComment-1472393 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.