gerkintrigg Posted December 19, 2009 Share Posted December 19, 2009 I want to get absolute URLs for my screen-scraper app that will get HTML code and render it in the browser with a few changes to spelling etc. To make sure it grabs CSS and images properly I have played about with different ways of getting absolute URLs I ended up with the following code: <?php $url='http://www.google.com/one_level_up/2_levels_up/3levels_up/'; $path = '../../intl/en/images/logo.gif'; #this should display $path2 = '../../../intl/en/images/logo.gif'; # this should not $page='<img src="'.$path.'"> Don\'t Display This: <img src="'.$path2.'">'; #------------------------- Function Below -------------------------------- function get_src($tmp_url,$path){ $tmp_url = rtrim($tmp_url, '/'); $path = ltrim($path, '\\'); if(($num_of_them = substr_count($path, '../')) > 0) { $tmp_url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $tmp_url); $path = $tmp_url . '/'. str_replace('../', '', $path); } else{ $path=str_replace($path,($tmp_url.'/'.$path),$path); } return $path; } ?>Display This: <?php #-------------------------- get the right URLs ------------------------------ function real_links($matches){ return 'src="'.get_src('http://www.google.com/one/two/',$matches[1]).'"'; # you'll note that the Google url needs to be defined, rather than a variable... why is that? } $page=preg_replace_callback('~src="(.*?)"~','real_links',$page); echo $page; ?> You'll note that the Google url needs to be defined, rather than a variable... why is that? How can I replace it with the $url variable at the top of the code, without causing an error? Quote Link to comment Share on other sites More sharing options...
dreamwest Posted December 20, 2009 Share Posted December 20, 2009 getcwd(); Quote Link to comment Share on other sites More sharing options...
gerkintrigg Posted December 20, 2009 Author Share Posted December 20, 2009 Thanks, but the $url variable should be able to be any url, not only a local one... Quote Link to comment Share on other sites More sharing options...
dreamwest Posted December 20, 2009 Share Posted December 20, 2009 So you have 2 paths - the url (which is dynamic): http://www.google.com/one_level_up/2_levels_up/3levels_up/ or http://www.google.com/one_level_up/2_levels_up/ or http://www.google.com/one_level_up/ and the path (which is also dynamic): ../../intl/en/images/logo.gif So what your lookig for is the full image url ?? eg.. http://www.google.com/one_level_up/2_levels_up/intl/en/images/logo.gif Quote Link to comment Share on other sites More sharing options...
gerkintrigg Posted December 20, 2009 Author Share Posted December 20, 2009 In my first example I showed how I could get the correct URLs from any web page. By replacing the $url variable with $_POST['url'], this can easily be changed to react to user input, but it's the fact that the (currently static) variable in the callback function doesn't like to be made a variable that's causing all my problems. while the web page url can be defined from a form post, the callback needs to be defined in the code itself. I'm not sure whether the syntax is wrong or I am just doing something that's not strictly allowed. To clarify, this line: return 'src="'.get_src('http://www.google.com/one/two/',$matches[1]).'"'; will not work if it read like this: return 'src="'.get_src($tmp_url,$matches[1]).'"'; Quote Link to comment Share on other sites More sharing options...
trq Posted December 20, 2009 Share Posted December 20, 2009 To clarify, this line: return 'src="'.get_src('http://www.google.com/one/two/',$matches[1]).'"'; will not work if it read like this: return 'src="'.get_src($tmp_url,$matches[1]).'"'; That is because $tmp_url is not defined within is not defined within real_links(). Considering its a callback and can only have specific arguments passed to it you may need to use the global keyword to have it see the $tmp_url variable. Quote Link to comment Share on other sites More sharing options...
gerkintrigg Posted December 20, 2009 Author Share Posted December 20, 2009 Thanks Thorpe. That looks like a handy tip. I tried that out and read through the literature for the global command but the following code doesn't remove the folder options like the get_src() function should and doesn't act quite like the hard-coded version: <?php $url='http://www.google.com/one_level_up/2_levels_up/3levels_up/'; $my_url=$url; $path = '../../intl/en/images/logo.gif'; #this should display $path2 = '../../../intl/en/images/logo.gif'; # this should not $page='<img src="'.$path.'"> Don\'t Display This: <img src="'.$path2.'">'; #------------------------- Function Below -------------------------------- function get_src($tmp_url,$path){ $tmp_url = rtrim($tmp_url, '/'); $path = ltrim($path, '\\'); if(($num_of_them = substr_count($path, '../')) > 0) { $tmp_url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $tmp_url); $path = $tmp_url . '/'. str_replace('../', '', $path); } else{ $path=str_replace($path,($tmp_url.'/'.$path),$path); } return $path; } ?>Display This: <?php #-------------------------- get the right URLs ------------------------------ function real_links($matches){ global $my_url; return 'src="'.get_src($my_url,$matches[1]).'"'; } $page=preg_replace_callback('~src="(.*?)"~','real_links',$page); echo $page; ?> Quote Link to comment Share on other sites More sharing options...
trq Posted December 20, 2009 Share Posted December 20, 2009 What do you mean by cuts off the url? Can you show us an example? Quote Link to comment Share on other sites More sharing options...
gerkintrigg Posted December 20, 2009 Author Share Posted December 20, 2009 sorry, I edited that last post because I thought it was confusing too... it outputs this code: Display This: <img src="http://www.google.com/one_level_up/2_levels_up/3levels_up/intl/en/images/logo.gif"> Don't Display This: <img src="http://www.google.com/one_level_up/2_levels_up/3levels_up/intl/en/images/logo.gif"> The $my_url variable doesn't perform the operations to it that the get_src() function is supposed to be performing (and DOES perform when it's hard-coded)... If it's not a global, it returns: Display This: <img src="/intl/en/images/logo.gif"> Don't Display This: <img src="/intl/en/images/logo.gif"> which is not right either... I think the global is handy. Can i define the get_src() function as global too? Quote Link to comment Share on other sites More sharing options...
gerkintrigg Posted December 20, 2009 Author Share Posted December 20, 2009 Okay, so I still don't know how to get the variable to work, but I removed the need for it by putting the content of the get_src() function inside the callback like this: #-------------------------- get the right URLs ------------------------------ function real_links($matches){ global $my_url; #------------ function -------------- $tmp_url = rtrim($my_url, '/'); $path = ltrim($matches[1], '\\'); if(($num_of_them = substr_count($path, '../')) > 0) { $tmp_url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $tmp_url); $path = $tmp_url . '/'. str_replace('../', '', $path); } else{ $path=str_replace($path,($tmp_url.'/'.$path),$path); } #--------- end function --------- return 'src="'.$path.'"'; } $page=preg_replace_callback('~src="(.*?)"~','real_links',$page); echo $page; Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.