Jump to content

Help fixing relative links in HTML pulled from CURL


linuxdream

Recommended Posts

Hello all,
Thanks in advance for your help. I have been having some problems trying to fix all relative links within HTML pulled from various web sites. I have written a class called RipCURL that makes web parsing and extracting of data quick and easy. However, the original method I used to fix the relative links did not work for all types of links allowed (single, double or no quote after the =, etc.). For instance, with my class you can do something like this:

[code]
<?php
$url = "http://www.google.com";
$html = $rip->ripRun($url, 1);
echo $html;
?>
[/code]

and the output would be the HTML provided by the $url. However, I'm having a problem with the preg_replace(). Here is the actual method that takes the base url to fix all inks with (or overrides it if there is a <BASE> tag in the document, as well as the actual HTML to parse. It returns that parsed data as well as sets a class variable with the clean HTML:

[code]
<?php
public function fixLinks($baseUrl, $html = null){
    if(is_null($html)){
$html = $this->rawHtml;
    }

$tagAttributes=array(
      'table'=>'background',
      'td'=>'background',
      'tr'=>'background',
      'th'=>'background',
      'body'=>'background',
      'a'=>'href',
      'link'=>'href',
      'area'=>'href',
      'form'=>'action',
      'script'=>'src',
      'img'=>'src',
      'iframe'=>'src',
      'frame'=>'src',
      'embed'=>'src');

   
    //Get hostname for relative URL's
    $host = parse_url($baseUrl);
    $host = $host['scheme'] . "://" . $host['host'] . "/";
     
    if(preg_match('/<base(?:.*?)href=["\']?([^\'"\s]*)[\'"\s]?/is', $html, $base)){
    $baseUrl = $base[1];
    $host = $baseUrl;
    }
       
    // Append a trailing slash to the url if it doesn't exist
    if (substr($baseUrl, -1, 1) !='/'){
      $baseUrl.='/';
    }

    //Works for everything but relative paths like href="someimage.jpg"
    foreach($tagAttributes as $tag=>$attribute){
    $pattern="/<$tag([^>]*)$attribute=[\"']*(?!http:|ftp:|https:|javascript:)\/([^\"'\s>]*)[\"']?/is";
    $replace="<$tag\${1}$attribute=\"$host\${2}\"";
$html=preg_replace($pattern, $replace, $html);
    }

    //This was added recently to make the direct paths work...but i don't know it doesn't match???
    //Does not work for direct paths like href=someimage.jpg
    foreach($tagAttribute as $tag=>$attribute){
        $pattern="/<$tag([^>]*)$attribute=[\"']*(?!http:|ftp:|https:|javascript:)([^\"'\s>]*)[\"']?/is";
        $replace="<$tag\${1}$attribute=\"$baseUrl\${2}\"";
        $html=preg_replace($pattern, $replace, $html);
    }
$this->rawHtml = $html;
return $html;
  }
?>[/code]

Now I know it's not the most efficient way but at this point I'm just trying to get it work, then I'll work on efficiency. The problem is that any relative path never seems to match the second regex. So everything like href=/mydir/image.gif and /newfile.php always works and gets replaced with the proper href=http://www.mysite.com/mydir/image.gif, etc. But relative ones don't. They simply remain relative without the added host or base url.

Any ideas would be a great help. This project is on Sourceforge.org if anyone is interested. http://sourceforge.net/projects/ripcurl/

Again, thanks a lot. This is one of the last little things that has been really bugging me with this project.

Brandon C.
Link to comment
Share on other sites

Make sure the enclosing quotes match (the addition of \2), make the slash optional, and require content (changed * to +).

[code]
<pre>
<?php
// Hardcoded for testing
$host = 'www.phpfreaks.com/';
// Reduced for testing
$tagAttributes=array('a'=>'href');
// Test data
$tests = array(
'<a href="/abc.jpg">...</a>',
'<a target="_blank" href="../page.php">...</a>',
"<a href='../../123.html' target='_top'>...</a>",
'<a href=1.cgi>...</a>',
'<a target="_top" href="http://www.google.com">...</a>',
);
// Run tests
    foreach ($tests as $test) {
//Works for everything but relative paths like href="someimage.jpg"
    foreach($tagAttributes as $tag =>$attribute){
    $pattern="/<$tag([^>]*)$attribute=([\"'])?(?!https?:|ftp:|javascript:)\/?([^\"'\s>]+)(?(2)\\2)/is";
    $replace="<$tag\${1}$attribute=\"$host\${3}\"";
// $html changed to $test
$test = preg_replace($pattern, $replace, $test);
echo htmlspecialchars($test), '<br>';
    }
}
?>
</pre>
[/code]

[b]Yields:[/b]

[code]
<a href="www.phpfreaks.com/abc.jpg">...</a>
<a target="_blank" href="www.phpfreaks.com/../page.php">...</a>
<a href="www.phpfreaks.com/../../123.html" target='_top'>...</a>
<a href="www.phpfreaks.com/1.cgi">...</a>
<a target="_top" href="http://www.google.com">...</a>
[/code]
Link to comment
Share on other sites

Thanks for the quick response, effigy. Works real well when the host is a base url with no other directories after it (ie. change $host to www.phpreaks.com/links/ and a link to /myimage.gif should read www.phpfreaks.com/myimage.gif, not www.phpfreaks.com/links/myimage.gif), so my only other question is about the host/baseurl that get substituted. Towards the top of the method, I set both a $host, which is simply the host address part of the url passed into the function to be used for a document root link (/myweb/mypage.htm), and a $baseUrl which is use when there is no absolute link (myweb/mypage.htm).
So my question is, since the root address can be different depending on if the link is relative or is a document root link, how do I test for that and place the right value in there? I tried working with preg's if/else (which I saw you so elegantly use in your solution) but unfortunately my knowledge of that special feature is rather limited.
Can you think of anything similar to this? I know the replace expression can't use regex, but I'm hoping you get what I mean:
[code]<?php
$pattern="/<$tag([^>]*)$attribute=([\"'])?(?!https?:|ftp:|javascript:)(\/)([^\"'\s>]+)(?(2)\\2)/is";
$replace="<$tag\${1}$attribute=\"(?(3)$host|$baseUrl)\${4}\"";
[/code]

Thanks again for your help and ideas.

Brandon C.
Link to comment
Share on other sites

Thanks ShogunWarrior. That's a pretty freaking cool function(s). So my basic layout right now is like so:

[code]
<?php
$host = "http://www.somesite.com/";

foreach($tagAttributes as $tag=>$attribute){
    $pattern="/<$tag([^>]*)$attribute=[\"']?(?!https?:|ftp:|javascript:)([^\"'\s>]+)(?(2)\\2)/is";
    $html =  preg_replace_callback($pattern, 'preg_callback', $html);
    }
   
function preg_callback($matches){
    global $tag, $attribute;
    $url = $this->abs_url($host, $matches[2], 1); //abs_url is included into the class as a private method below
    $replace = "<$tag\${1}$attribute=\"$url\${3}\"";
    return $replace;
}

echo $html; //Should output corrected links, src's, etc. but doesn't. Just outputs the same input, so basically nothing matches.
?>[/code]

But it's still not replacing the proper full name. I'm sure it's because I'm not using the callback function properly...any corrections you see? The output is simply the same as the input thus nothing is matching. I have ZERO experience with preg's callback, I didn't even know it existed until effigy pointed it out. The doc's says to return the replace value thus this should be working.

Thanks,
B
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.