linuxdream Posted January 17, 2007 Share Posted January 17, 2007 Hello all,Thanks in advance for your help. I have been having some problems trying to fix all relative links within HTML pulled from various web sites. I have written a class called RipCURL that makes web parsing and extracting of data quick and easy. However, the original method I used to fix the relative links did not work for all types of links allowed (single, double or no quote after the =, etc.). For instance, with my class you can do something like this:[code]<?php$url = "http://www.google.com";$html = $rip->ripRun($url, 1);echo $html;?>[/code]and the output would be the HTML provided by the $url. However, I'm having a problem with the preg_replace(). Here is the actual method that takes the base url to fix all inks with (or overrides it if there is a <BASE> tag in the document, as well as the actual HTML to parse. It returns that parsed data as well as sets a class variable with the clean HTML:[code]<?phppublic function fixLinks($baseUrl, $html = null){ if(is_null($html)){ $html = $this->rawHtml; } $tagAttributes=array( 'table'=>'background', 'td'=>'background', 'tr'=>'background', 'th'=>'background', 'body'=>'background', 'a'=>'href', 'link'=>'href', 'area'=>'href', 'form'=>'action', 'script'=>'src', 'img'=>'src', 'iframe'=>'src', 'frame'=>'src', 'embed'=>'src'); //Get hostname for relative URL's $host = parse_url($baseUrl); $host = $host['scheme'] . "://" . $host['host'] . "/"; if(preg_match('/<base(?:.*?)href=["\']?([^\'"\s]*)[\'"\s]?/is', $html, $base)){ $baseUrl = $base[1]; $host = $baseUrl; } // Append a trailing slash to the url if it doesn't exist if (substr($baseUrl, -1, 1) !='/'){ $baseUrl.='/'; } //Works for everything but relative paths like href="someimage.jpg" foreach($tagAttributes as $tag=>$attribute){ $pattern="/<$tag([^>]*)$attribute=[\"']*(?!http:|ftp:|https:|javascript:)\/([^\"'\s>]*)[\"']?/is"; $replace="<$tag\${1}$attribute=\"$host\${2}\""; $html=preg_replace($pattern, $replace, $html); } //This was added recently to make the direct paths work...but i don't know it doesn't match??? //Does not work for direct paths like href=someimage.jpg foreach($tagAttribute as $tag=>$attribute){ $pattern="/<$tag([^>]*)$attribute=[\"']*(?!http:|ftp:|https:|javascript:)([^\"'\s>]*)[\"']?/is"; $replace="<$tag\${1}$attribute=\"$baseUrl\${2}\""; $html=preg_replace($pattern, $replace, $html); }$this->rawHtml = $html;return $html; }?>[/code]Now I know it's not the most efficient way but at this point I'm just trying to get it work, then I'll work on efficiency. The problem is that any relative path never seems to match the second regex. So everything like href=/mydir/image.gif and /newfile.php always works and gets replaced with the proper href=http://www.mysite.com/mydir/image.gif, etc. But relative ones don't. They simply remain relative without the added host or base url.Any ideas would be a great help. This project is on Sourceforge.org if anyone is interested. http://sourceforge.net/projects/ripcurl/Again, thanks a lot. This is one of the last little things that has been really bugging me with this project.Brandon C. Quote Link to comment Share on other sites More sharing options...
effigy Posted January 17, 2007 Share Posted January 17, 2007 Make sure the enclosing quotes match (the addition of \2), make the slash optional, and require content (changed * to +).[code]<pre><?php // Hardcoded for testing $host = 'www.phpfreaks.com/'; // Reduced for testing $tagAttributes=array('a'=>'href'); // Test data $tests = array( '<a href="/abc.jpg">...</a>', '<a target="_blank" href="../page.php">...</a>', "<a href='../../123.html' target='_top'>...</a>", '<a href=1.cgi>...</a>', '<a target="_top" href="http://www.google.com">...</a>', ); // Run tests foreach ($tests as $test) { //Works for everything but relative paths like href="someimage.jpg" foreach($tagAttributes as $tag =>$attribute){ $pattern="/<$tag([^>]*)$attribute=([\"'])?(?!https?:|ftp:|javascript:)\/?([^\"'\s>]+)(?(2)\\2)/is"; $replace="<$tag\${1}$attribute=\"$host\${3}\""; // $html changed to $test $test = preg_replace($pattern, $replace, $test); echo htmlspecialchars($test), '<br>'; } }?></pre>[/code][b]Yields:[/b][code]<a href="www.phpfreaks.com/abc.jpg">...</a><a target="_blank" href="www.phpfreaks.com/../page.php">...</a><a href="www.phpfreaks.com/../../123.html" target='_top'>...</a><a href="www.phpfreaks.com/1.cgi">...</a><a target="_top" href="http://www.google.com">...</a>[/code] Quote Link to comment Share on other sites More sharing options...
linuxdream Posted January 17, 2007 Author Share Posted January 17, 2007 Thanks for the quick response, effigy. Works real well when the host is a base url with no other directories after it (ie. change $host to www.phpreaks.com/links/ and a link to /myimage.gif should read www.phpfreaks.com/myimage.gif, not www.phpfreaks.com/links/myimage.gif), so my only other question is about the host/baseurl that get substituted. Towards the top of the method, I set both a $host, which is simply the host address part of the url passed into the function to be used for a document root link (/myweb/mypage.htm), and a $baseUrl which is use when there is no absolute link (myweb/mypage.htm). So my question is, since the root address can be different depending on if the link is relative or is a document root link, how do I test for that and place the right value in there? I tried working with preg's if/else (which I saw you so elegantly use in your solution) but unfortunately my knowledge of that special feature is rather limited.Can you think of anything similar to this? I know the replace expression can't use regex, but I'm hoping you get what I mean: [code]<?php$pattern="/<$tag([^>]*)$attribute=([\"'])?(?!https?:|ftp:|javascript:)(\/)([^\"'\s>]+)(?(2)\\2)/is";$replace="<$tag\${1}$attribute=\"(?(3)$host|$baseUrl)\${4}\"";[/code]Thanks again for your help and ideas.Brandon C. Quote Link to comment Share on other sites More sharing options...
effigy Posted January 17, 2007 Share Posted January 17, 2007 Use[tt] preg_replace_callback [/tt]to get full programming capabilities during the replace. Quote Link to comment Share on other sites More sharing options...
ShogunWarrior Posted January 17, 2007 Share Posted January 17, 2007 I wrote a function that makes absolute URLs for a similiar project.It creates all URLs according to the RFC specification.http://daviddoranmedia.com/projects/abs-urlHope it helps, that's what I wrote it for. Quote Link to comment Share on other sites More sharing options...
linuxdream Posted January 17, 2007 Author Share Posted January 17, 2007 Thanks ShogunWarrior. That's a pretty freaking cool function(s). So my basic layout right now is like so:[code]<?php$host = "http://www.somesite.com/";foreach($tagAttributes as $tag=>$attribute){ $pattern="/<$tag([^>]*)$attribute=[\"']?(?!https?:|ftp:|javascript:)([^\"'\s>]+)(?(2)\\2)/is"; $html = preg_replace_callback($pattern, 'preg_callback', $html); } function preg_callback($matches){ global $tag, $attribute; $url = $this->abs_url($host, $matches[2], 1); //abs_url is included into the class as a private method below $replace = "<$tag\${1}$attribute=\"$url\${3}\""; return $replace;}echo $html; //Should output corrected links, src's, etc. but doesn't. Just outputs the same input, so basically nothing matches.?>[/code]But it's still not replacing the proper full name. I'm sure it's because I'm not using the callback function properly...any corrections you see? The output is simply the same as the input thus nothing is matching. I have ZERO experience with preg's callback, I didn't even know it existed until effigy pointed it out. The doc's says to return the replace value thus this should be working.Thanks,B Quote Link to comment Share on other sites More sharing options...
ShogunWarrior Posted January 17, 2007 Share Posted January 17, 2007 Yeah, it should just be abs_url() , no need for [b]$this->[/b] and it should work fine. I saw your comment about it being included in the class, so as long as the other utility function [b]parse_segments[/b] and the constants are also included then it should work. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.