Jump to content

writing a screen scraper


gerkintrigg

Recommended Posts

Hello,

 

I'm writing a screen scraper application and want to be able to get absolute addresses for images from relative links.

 

So a link like this:

<img src="../e-commerce_in_a_box_small.jpg" alt="E-Commerce" width="100" height="134" border="0" />

might link to http://www.myointernational.com/furniture/e-commerce_in_a_box_small.jpg

 

If I am analysing a web address, I understand that the pseudo code would be something like this:

<?php 

$string='<img src="../../e-commerce_in_a_box_small.jpg" alt="E-Commerce" width="100" height="134" border="0" />';
// we need to find the system root and replace the ../ with REAL values.

$url='http://www.myointernational.com/test_dir/';
if($string contains '../'){
$number_of_them=count(the number of them);
}
$i=1
while($i<=$number_of_them){
$tmp_url=go up one level from the $url;
$i++;
}
?>
<img src="<?php echo $tmp_url;?>" alt="E-Commerce" width="100" height="134" border="0" />

 

How would I go about finding the code to make the pseudo code work?

Link to comment
Share on other sites

I doubt this is perfect and it's probably not the best way of doing it, but something like this might work.

 

$url = "http://www.google.com/something1/something2/something3";
$path = "../../hello.jpg";

$url = rtrim($url, '/');
$path = ltrim($path, '\\');
if(($num_of_them = substr_count($path, '../')) > 0) {
    $url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $url);
    $path = $url . '/'. str_replace('../', '', $path);
    echo $path;
}

Link to comment
Share on other sites

Thanks cags, that's great!

 

The only issue I have with this is that I want to replace the path for every

src="WHATEVER"

Is there a way i can embed this within another preg_replace() function to loop through an entire document?

 

Also it doesn't work if the path contains no ../ but that's no major big deal because I can just replace ../ with the URL using str_replace();

 

Thanks again.

Link to comment
Share on other sites

If the path contains no ../ then there is no real complications, you simply need an else block that will check if a full path needs concatinating to the start or not and job done. With regards to looping through the page I'm not sure. The solution might be to use preg_replace_callback, but I don't have any real experience using this. Another method would be to use a preg_match_all to create an array of items to replace then loop through them creating the replacements, then calling preg_replace passing in the two arrays.

Link to comment
Share on other sites

thanks cags, I sorted the issue if there is no '../' by doing this:

<?php 
$url = "myointernational.com/1/2/3/4/";
$path = '../../../../furniture/health_check_box.jpg';

$url = rtrim($url, '/');
$path = ltrim($path, '\\');
if(($num_of_them = substr_count($path, '../')) > 0) {
    $url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $url);
    $path = $url . '/'. str_replace('../', '', $path);
}
else{
$path=str_replace($path,($url.'/'.$path),$path);
}
    echo $path;
?>

 

I'll look into preg_match_all() though.

 

Thanks.

Link to comment
Share on other sites

I think that this may work within a preg_replace function:

<?php
$url='http://www.myointernational.com/1/2/3/4/5/6/7';
$path = '../../../../../../../furniture/health_check_box.jpg';
?>

<?php 
function get_src($url,$path){
$url = rtrim($url, '/');
$path = ltrim($path, '\\');
if(($num_of_them = substr_count($path, '../')) > 0) {
    $url = preg_replace("#(/[a-z0-9-]+){{$num_of_them}}$#iD", '', $url);
    $path = $url . '/'. str_replace('../', '', $path);
}
else{
$path=str_replace($path,($url.'/'.$path),$path);
}
    return $path;
}

echo '<img src="'.get_src($url,$path).'">';
?>

Link to comment
Share on other sites

@thebadbad, mainly because I didn't understand it and couldn't get it to work...

if I used your function to parse the url of the root page (not necessarily a directory, but also a file) as $absolute and the link as $relative; would that work? I can try it, but need to test the current system first... I have a few issues with Firefox and ajax which need sorting before I implement any changes to the current script.

Link to comment
Share on other sites

Now, it is pretty-much working as I want it to, but I do have an issue with file name extensions:

If the URL is http://www.myointernational.com or http://www.myointernational.com/ (with the forward slash), both work fine. If, however it is: http://www.myointernational.com/index.php then the grabbing of images fails.

 

Any suggestions? (And I re-ask thebadbad about the post above too).

Link to comment
Share on other sites

Now, it is pretty-much working as I want it to, but I do have an issue with file name extensions:

If the URL is http://www.myointernational.com or http://www.myointernational.com/ (with the forward slash), both work fine. If, however it is: http://www.myointernational.com/index.php then the grabbing of images fails.

 

Any suggestions? (And I re-ask thebadbad about the post above too).

 

Can you not strip out 'index.(php|php5|htm|html|asp|aspx)' from the current path? (root or folder)

Link to comment
Share on other sites

@thebadbad, mainly because I didn't understand it and couldn't get it to work...

if I used your function to parse the url of the root page (not necessarily a directory, but also a file) as $absolute and the link as $relative; would that work? I can try it, but need to test the current system first... I have a few issues with Firefox and ajax which need sorting before I implement any changes to the current script.

 

Yes, that would work. And it's actually really simple to do it (given the relative2absolute() function), and by far the best solution as far as I know.

 

echo relative2absolute('http://example.com/folder/page.php', '../relative/link/file.php');
//http://example.com/relative/link/file.php

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.