Jump to content

Recommended Posts

Hi everyone, I'm new here.

 

I'm trying to pull the href attributes from all link elements in a remote HTML file. Here is the code I'm currently working with:

 

$sSourceData = file_get_contents($_POST['url']);

$oDOMDoc = new DOMDocument();

@$oDOMDoc->loadHTML($sSourceData);

$oNodeList = $oDOMDoc->getElementsByTagName('link');

$aLinkHrefs = array();

foreach ($oNodeList as $oLinkNode)
{
if( $oLinkNode->hasAttribute('href') )
{
	if( strlen($oLinkNode->getAttribute('href')) > 0 )
	{
		$aLinkHrefs[] = $oLinkNode->getAttribute('href');
	}
}
}

 

This works fine most of the time, however sometimes the script fails to grab all of the href attributes as expected.

 

If you take this page, for example, which contains the following lines of code...

 

<link rel="alternate" type="application/rss+xml" title="RSS - sam hastings - Google News " href="http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=&nolr=1&output=rss">
<link rel="alternate" type="application/atom+xml" title="ATOM - sam hastings - Google News " href="http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=&nolr=1&output=atom">

 

... the script doesn't seem to recognise the link elements at all.

 

I'm fairly new to DOM XML and my PHP code is recycled from a snippet I found elsewhere. Does anyone know what I've done wrong?

 

Many thanks!

Link to comment
https://forums.phpfreaks.com/topic/129391-dom-xml-problems/
Share on other sites

You mean it's returning an array containing the URLs in the link elements? It's not working at all for me for this particular URL - I set up a page with an input field so you can see for yourself:

 

http://local.civicsurf.org.uk/action/test2.php

 

Very strange. Anyone else know what could be happening?

Link to comment
https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670823
Share on other sites

What is the entire code for test2.php page? Cus with this as my code:

<?php
$url = 'http://news.google.com/news?ie=utf8&oe=utf8&q=sam+hastings&hl=en&gl=';
$sSourceData = file_get_contents($url);
   
$oDOMDoc = new DOMDocument();
   
@$oDOMDoc->loadHTML($sSourceData);
   
$oNodeList = $oDOMDoc->getElementsByTagName('link');
   
$aLinkHrefs = array();
   
foreach ($oNodeList as $oLinkNode)
{
   if( $oLinkNode->hasAttribute('href') )
   {
      if( strlen($oLinkNode->getAttribute('href')) > 0 )
      {
         $aLinkHrefs[] = $oLinkNode->getAttribute('href');
      }
   }
}
print_r($aLinkHrefs);
?>

i get

Array
(
    [0] => http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=&nolr=1&output=rss
    [1] => http://news.google.com/news?ie=UTF-8&oe=utf8&q=sam+hastings&hl=en&gl=&nolr=1&output=atom
)

Link to comment
https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670826
Share on other sites

Here's the entire code:

 

if($_POST['submit'] == 'true') {

$sSourceData = file_get_contents($_POST['url']);

$oDOMDoc = new DOMDocument();

@$oDOMDoc->loadHTML($sSourceData);

$oNodeList = $oDOMDoc->getElementsByTagName('link');

$aLinkHrefs = array();

foreach ($oNodeList as $oLinkNode)
{
	if( $oLinkNode->hasAttribute('href') )
	{
		if( strlen($oLinkNode->getAttribute('href')) > 0 )
		{
			$aLinkHrefs[] = $oLinkNode->getAttribute('href');
		}
	}
}	
}

echo '<form action="test2.php" method="post">';
echo '<input type="text" name="url" id="url" style="width:700px;" />';
echo '<input type="submit" value="Submit" />';
echo '<input type="hidden" name="submit" value="true" />';

echo '<textarea style="width:800px;height:600px;display:block;">';
if(is_array($aLinkHrefs)) { print_r($aLinkHrefs); }
echo '</textarea>';

echo '</form>';

 

Just to check, I just tried it with the exact code you provided and the array returned is empty. Could this be a problem with the PHP installation I have on my server? Here's my phpinfo() page.

Link to comment
https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670837
Share on other sites

yeah, the code you sent me works fine. try this and see where it fails.

 

<?php
if($_SERVER['REQUEST_METHOD'] == 'POST') {

   $sSourceData = file_get_contents($_POST['url'])
     or die("Failed to get URL contents");
   
   $oDOMDoc = new DOMDocument();
   $oDOMDoc->loadHTML($sSourceData);
   $oNodeList = $oDOMDoc->getElementsByTagName('link')
    or die("No Elements Found");
   
   $aLinkHrefs = array();
   foreach ($oNodeList as $oLinkNode)
   {
      if( $oLinkNode->hasAttribute('href') )
      {
         if( strlen($oLinkNode->getAttribute('href')) > 0 )
         {
            $aLinkHrefs[] = $oLinkNode->getAttribute('href');
         }
      }
   }   
}

echo '<form action="" method="post">';
echo '<input type="text" name="url" id="url" style="width:700px;" />';
echo '<input type="submit" value="Submit" />';
echo '<textarea style="width:800px;height:600px;display:block;">';
if(is_array($aLinkHrefs)) { print_r($aLinkHrefs); }
echo '</textarea>';
echo '</form>';
?>

 

p.s. - i don't see anything in your phpinfo() that would prevent the script from working

Link to comment
https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670846
Share on other sites

i get many other warnings, but not that....and the page you are using is:

http://news.google.com/news?ie=utf8&oe=utf8&q=sam+hastings&hl=en&gl=

 

Yup. Very strange. I guess it's to do with validation errors in the page I'm providing. Perhaps there's a way of ignoring such errors in the code?

Link to comment
https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670865
Share on other sites

another option you can use:

 

<?php
if($_SERVER['REQUEST_METHOD'] == 'POST') {

  $sSourceData = file_get_contents($_POST['url'])
    or die("Failed to get URL contents");
  
  if(preg_match_all('/<link rel="alternate" type="application\/(\w+)\+xml" title="(.+?)" href="(.+?)">/',$sSourceData,$matches)){
    $aLinkHrefs = array();
    foreach(array_keys($matches[0]) as $n){
      $type = $matches[1][$n];
      $title = $matches[2][$n];
      $url = $matches[3][$n];
      $aLinkHrefs[] = $url;
    }
  }else{
    die("REGEX failed");
  }
}

echo '<form action="" method="post">';
echo '<input type="text" name="url" id="url" style="width:700px;" />';
echo '<input type="submit" value="Submit" />';
echo '<textarea style="width:800px;height:600px;display:block;">';
if(is_array($aLinkHrefs)) { print_r($aLinkHrefs); }
echo '</textarea>';
echo '</form>';
?>

Link to comment
https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670884
Share on other sites

While that code may work for this particular instance, there's no guarantee as to the order (or even presence) of all of the attributes in the link tag (e.g. <link href="..." title="..."> vs. <link title="..." href="...">).

 

Anyway, I've just got off the phone to the client who I'm building this for, and he's happy to copy the feeds from the Google Alerts page into Google Reader which publishes feeds which my code can read. Thanks for all your help though, much appreciated.

Link to comment
https://forums.phpfreaks.com/topic/129391-dom-xml-problems/#findComment-670888
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.