gerkintrigg Posted December 17, 2009 Share Posted December 17, 2009 Hello, I'm writing a screen scraper application and want to be able to get absolute addresses for images from relative links. So a link like this: <img src="../e-commerce_in_a_box_small.jpg" alt="E-Commerce" width="100" height="134" border="0" /> might link to http://www.myointernational.com/furniture/e-commerce_in_a_box_small.jpg If I am analysing a web address, I understand that the pseudo code would be something like this: <?php $string='<img src="../../e-commerce_in_a_box_small.jpg" alt="E-Commerce" width="100" height="134" border="0" />'; // we need to find the system root and replace the ../ with REAL values. $url='http://www.myointernational.com/test_dir/'; if($string contains '../'){ $number_of_them=count(the number of them); } $i=1 while($i<=$number_of_them){ $tmp_url=go up one level from the $url; $i++; } ?> <img src="<?php echo $tmp_url;?>" alt="E-Commerce" width="100" height="134" border="0" /> How would I go about finding the code to make the pseudo code work? Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/ Share on other sites More sharing options...
cags Posted December 17, 2009 Share Posted December 17, 2009 I doubt this is perfect and it's probably not the best way of doing it, but something like this might work. $url = "http://www.google.com/something1/something2/something3"; $path = "../../hello.jpg"; $url = rtrim($url, '/'); $path = ltrim($path, '\\'); if(($num_of_them = substr_count($path, '../')) > 0) { $url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $url); $path = $url . '/'. str_replace('../', '', $path); echo $path; } Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-979085 Share on other sites More sharing options...
gerkintrigg Posted December 19, 2009 Author Share Posted December 19, 2009 Thanks cags, that's great! The only issue I have with this is that I want to replace the path for every src="WHATEVER" Is there a way i can embed this within another preg_replace() function to loop through an entire document? Also it doesn't work if the path contains no ../ but that's no major big deal because I can just replace ../ with the URL using str_replace(); Thanks again. Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-980349 Share on other sites More sharing options...
cags Posted December 19, 2009 Share Posted December 19, 2009 If the path contains no ../ then there is no real complications, you simply need an else block that will check if a full path needs concatinating to the start or not and job done. With regards to looping through the page I'm not sure. The solution might be to use preg_replace_callback, but I don't have any real experience using this. Another method would be to use a preg_match_all to create an array of items to replace then loop through them creating the replacements, then calling preg_replace passing in the two arrays. Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-980492 Share on other sites More sharing options...
thebadbad Posted December 19, 2009 Share Posted December 19, 2009 Have a look at this post: http://www.phpfreaks.com/forums/index.php/topic,276359.msg1307247.html#msg1307247 Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-980504 Share on other sites More sharing options...
gerkintrigg Posted December 19, 2009 Author Share Posted December 19, 2009 thanks cags, I sorted the issue if there is no '../' by doing this: <?php $url = "myointernational.com/1/2/3/4/"; $path = '../../../../furniture/health_check_box.jpg'; $url = rtrim($url, '/'); $path = ltrim($path, '\\'); if(($num_of_them = substr_count($path, '../')) > 0) { $url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $url); $path = $url . '/'. str_replace('../', '', $path); } else{ $path=str_replace($path,($url.'/'.$path),$path); } echo $path; ?> I'll look into preg_match_all() though. Thanks. Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-980506 Share on other sites More sharing options...
gerkintrigg Posted December 19, 2009 Author Share Posted December 19, 2009 oh, now that is very handy! Thanks thebadbad Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-980516 Share on other sites More sharing options...
gerkintrigg Posted December 19, 2009 Author Share Posted December 19, 2009 I think that this may work within a preg_replace function: <?php $url='http://www.myointernational.com/1/2/3/4/5/6/7'; $path = '../../../../../../../furniture/health_check_box.jpg'; ?> <?php function get_src($url,$path){ $url = rtrim($url, '/'); $path = ltrim($path, '\\'); if(($num_of_them = substr_count($path, '../')) > 0) { $url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $url); $path = $url . '/'. str_replace('../', '', $path); } else{ $path=str_replace($path,($url.'/'.$path),$path); } return $path; } echo '<img src="'.get_src($url,$path).'">'; ?> Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-980538 Share on other sites More sharing options...
gerkintrigg Posted December 19, 2009 Author Share Posted December 19, 2009 I wonder if someone could provide the right syntax for this:? # now replace the "src=" urls with real ones: $pattern='~src="(.?*)"~'; $new_url='~src="get_src($1,$path)"~'; $page = preg_replace($pattern, $new_url, $page); Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-980542 Share on other sites More sharing options...
cags Posted December 19, 2009 Share Posted December 19, 2009 To my knowledge there is no 'right syntax' for using a function in a replace pattern using preg_replace in a capture value, which is why I suggested preg_match_callback. Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-980553 Share on other sites More sharing options...
thebadbad Posted December 25, 2009 Share Posted December 25, 2009 @OP Why aren't you using the method from the link I posted? It's much more robust. Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-984019 Share on other sites More sharing options...
gerkintrigg Posted December 26, 2009 Author Share Posted December 26, 2009 @thebadbad, mainly because I didn't understand it and couldn't get it to work... if I used your function to parse the url of the root page (not necessarily a directory, but also a file) as $absolute and the link as $relative; would that work? I can try it, but need to test the current system first... I have a few issues with Firefox and ajax which need sorting before I implement any changes to the current script. Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-984115 Share on other sites More sharing options...
gerkintrigg Posted December 27, 2009 Author Share Posted December 27, 2009 Now, it is pretty-much working as I want it to, but I do have an issue with file name extensions: If the URL is http://www.myointernational.com or http://www.myointernational.com/ (with the forward slash), both work fine. If, however it is: http://www.myointernational.com/index.php then the grabbing of images fails. Any suggestions? (And I re-ask thebadbad about the post above too). Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-984493 Share on other sites More sharing options...
oni-kun Posted December 27, 2009 Share Posted December 27, 2009 Now, it is pretty-much working as I want it to, but I do have an issue with file name extensions: If the URL is http://www.myointernational.com or http://www.myointernational.com/ (with the forward slash), both work fine. If, however it is: http://www.myointernational.com/index.php then the grabbing of images fails. Any suggestions? (And I re-ask thebadbad about the post above too). Can you not strip out 'index.(php|php5|htm|html|asp|aspx)' from the current path? (root or folder) Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-984496 Share on other sites More sharing options...
gerkintrigg Posted December 27, 2009 Author Share Posted December 27, 2009 in theory... I'm unsure of the syntax as, rather than just index.(whatever) , it would also need to cope with whatever.whatever Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-984592 Share on other sites More sharing options...
gerkintrigg Posted December 28, 2009 Author Share Posted December 28, 2009 eventually I did it like this... I'm not sure if it's the best way, but it works, just fine: $url_file_name=explode('/',$url); foreach($url_file_name as $key=>$value){ $my_file_name=$value; } $url=str_replace($my_file_name,'',$url); Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-984876 Share on other sites More sharing options...
thebadbad Posted December 28, 2009 Share Posted December 28, 2009 @thebadbad, mainly because I didn't understand it and couldn't get it to work... if I used your function to parse the url of the root page (not necessarily a directory, but also a file) as $absolute and the link as $relative; would that work? I can try it, but need to test the current system first... I have a few issues with Firefox and ajax which need sorting before I implement any changes to the current script. Yes, that would work. And it's actually really simple to do it (given the relative2absolute() function), and by far the best solution as far as I know. echo relative2absolute('http://example.com/folder/page.php', '../relative/link/file.php'); //http://example.com/relative/link/file.php Quote Link to comment https://forums.phpfreaks.com/topic/185447-writing-a-screen-scraper/#findComment-985094 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.