SJH Posted October 21, 2008 Share Posted October 21, 2008 Hi everyone, I'm new here. I'm trying to pull the href attributes from all link elements in a remote HTML file. Here is the code I'm currently working with: $sSourceData = file_get_contents($_POST['url']); $oDOMDoc = new DOMDocument(); @$oDOMDoc->loadHTML($sSourceData); $oNodeList = $oDOMDoc->getElementsByTagName('link'); $aLinkHrefs = array(); foreach ($oNodeList as $oLinkNode) { if( $oLinkNode->hasAttribute('href') ) { if( strlen($oLinkNode->getAttribute('href')) > 0 ) { $aLinkHrefs[] = $oLinkNode->getAttribute('href'); } } } This works fine most of the time, however sometimes the script fails to grab all of the href attributes as expected. If you take this page, for example, which contains the following lines of code... <link rel="alternate" type="application/rss+xml" title="RSS - sam hastings - Google News " href="http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=&nolr=1&output=rss"> <link rel="alternate" type="application/atom+xml" title="ATOM - sam hastings - Google News " href="http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=&nolr=1&output=atom"> ... the script doesn't seem to recognise the link elements at all. I'm fairly new to DOM XML and my PHP code is recycled from a snippet I found elsewhere. Does anyone know what I've done wrong? Many thanks! Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/ Share on other sites More sharing options...
rhodesa Posted October 21, 2008 Share Posted October 21, 2008 i copied/pasted your code, tested it on the page you provided, and it worked fine Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670794 Share on other sites More sharing options...
SJH Posted October 21, 2008 Author Share Posted October 21, 2008 You mean it's returning an array containing the URLs in the link elements? It's not working at all for me for this particular URL - I set up a page with an input field so you can see for yourself: http://local.civicsurf.org.uk/action/test2.php Very strange. Anyone else know what could be happening? Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670823 Share on other sites More sharing options...
rhodesa Posted October 21, 2008 Share Posted October 21, 2008 What is the entire code for test2.php page? Cus with this as my code: <?php $url = 'http://news.google.com/news?ie=utf8&oe=utf8&q=sam+hastings&hl=en&gl='; $sSourceData = file_get_contents($url); $oDOMDoc = new DOMDocument(); @$oDOMDoc->loadHTML($sSourceData); $oNodeList = $oDOMDoc->getElementsByTagName('link'); $aLinkHrefs = array(); foreach ($oNodeList as $oLinkNode) { if( $oLinkNode->hasAttribute('href') ) { if( strlen($oLinkNode->getAttribute('href')) > 0 ) { $aLinkHrefs[] = $oLinkNode->getAttribute('href'); } } } print_r($aLinkHrefs); ?> i get Array ( [0] => http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=&nolr=1&output=rss [1] => http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=&nolr=1&output=atom ) Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670826 Share on other sites More sharing options...
SJH Posted October 21, 2008 Author Share Posted October 21, 2008 Here's the entire code: if($_POST['submit'] == 'true') { $sSourceData = file_get_contents($_POST['url']); $oDOMDoc = new DOMDocument(); @$oDOMDoc->loadHTML($sSourceData); $oNodeList = $oDOMDoc->getElementsByTagName('link'); $aLinkHrefs = array(); foreach ($oNodeList as $oLinkNode) { if( $oLinkNode->hasAttribute('href') ) { if( strlen($oLinkNode->getAttribute('href')) > 0 ) { $aLinkHrefs[] = $oLinkNode->getAttribute('href'); } } } } echo '<form action="test2.php" method="post">'; echo '<input type="text" name="url" id="url" style="width:700px;" />'; echo '<input type="submit" value="Submit" />'; echo '<input type="hidden" name="submit" value="true" />'; echo '<textarea style="width:800px;height:600px;display:block;">'; if(is_array($aLinkHrefs)) { print_r($aLinkHrefs); } echo '</textarea>'; echo '</form>'; Just to check, I just tried it with the exact code you provided and the array returned is empty. Could this be a problem with the PHP installation I have on my server? Here's my phpinfo() page. Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670837 Share on other sites More sharing options...
rhodesa Posted October 21, 2008 Share Posted October 21, 2008 yeah, the code you sent me works fine. try this and see where it fails. <?php if($_SERVER['REQUEST_METHOD'] == 'POST') { $sSourceData = file_get_contents($_POST['url']) or die("Failed to get URL contents"); $oDOMDoc = new DOMDocument(); $oDOMDoc->loadHTML($sSourceData); $oNodeList = $oDOMDoc->getElementsByTagName('link') or die("No Elements Found"); $aLinkHrefs = array(); foreach ($oNodeList as $oLinkNode) { if( $oLinkNode->hasAttribute('href') ) { if( strlen($oLinkNode->getAttribute('href')) > 0 ) { $aLinkHrefs[] = $oLinkNode->getAttribute('href'); } } } } echo '<form action="" method="post">'; echo '<input type="text" name="url" id="url" style="width:700px;" />'; echo '<input type="submit" value="Submit" />'; echo '<textarea style="width:800px;height:600px;display:block;">'; if(is_array($aLinkHrefs)) { print_r($aLinkHrefs); } echo '</textarea>'; echo '</form>'; ?> p.s. - i don't see anything in your phpinfo() that would prevent the script from working Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670846 Share on other sites More sharing options...
SJH Posted October 21, 2008 Author Share Posted October 21, 2008 Your revised code brings up the following error: Warning: DOMDocument::loadHTML() [function.DOMDocument-loadHTML]: Opening and ending tag mismatch: td and font in Entity, line: 25 in /home/civicsur/public_html/local/test4.php on line 8 Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670850 Share on other sites More sharing options...
rhodesa Posted October 21, 2008 Share Posted October 21, 2008 i get many other warnings, but not that....and the page you are using is: http://news.google.com/news?ie=utf8&oe=utf8&q=sam+hastings&hl=en&gl= Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670855 Share on other sites More sharing options...
SJH Posted October 21, 2008 Author Share Posted October 21, 2008 i get many other warnings, but not that....and the page you are using is: http://news.google.com/news?ie=utf8&oe=utf8&q=sam+hastings&hl=en&gl= Yup. Very strange. I guess it's to do with validation errors in the page I'm providing. Perhaps there's a way of ignoring such errors in the code? Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670865 Share on other sites More sharing options...
rhodesa Posted October 21, 2008 Share Posted October 21, 2008 another option you can use: <?php if($_SERVER['REQUEST_METHOD'] == 'POST') { $sSourceData = file_get_contents($_POST['url']) or die("Failed to get URL contents"); if(preg_match_all('/<link rel="alternate" type="application\/(\w+)\+xml" title="(.+?)" href="(.+?)">/',$sSourceData,$matches)){ $aLinkHrefs = array(); foreach(array_keys($matches[0]) as $n){ $type = $matches[1][$n]; $title = $matches[2][$n]; $url = $matches[3][$n]; $aLinkHrefs[] = $url; } }else{ die("REGEX failed"); } } echo '<form action="" method="post">'; echo '<input type="text" name="url" id="url" style="width:700px;" />'; echo '<input type="submit" value="Submit" />'; echo '<textarea style="width:800px;height:600px;display:block;">'; if(is_array($aLinkHrefs)) { print_r($aLinkHrefs); } echo '</textarea>'; echo '</form>'; ?> Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670884 Share on other sites More sharing options...
SJH Posted October 21, 2008 Author Share Posted October 21, 2008 While that code may work for this particular instance, there's no guarantee as to the order (or even presence) of all of the attributes in the link tag (e.g. <link href="..." title="..."> vs. <link title="..." href="...">). Anyway, I've just got off the phone to the client who I'm building this for, and he's happy to copy the feeds from the Google Alerts page into Google Reader which publishes feeds which my code can read. Thanks for all your help though, much appreciated. Quote Link to comment https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670888 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.