file_get_contents($url) vs WWW::Mechanize::Firefox - a methological question!

dilbertone · December 11, 2011

good day dear php-freaks

This is a posting that is related to an image-display-topic. I ve got a list of 5500 websites and need to grab a little screenshot of them- to create a thumbnail that is ready to show - as a thumbnail of course - on a website. How do i do that.

Dynamically - with by using file_get_contents($url):

$url = 'http://www.exmaple.com;

$output = file_get_contents($url);

or should i download all the images first

secondly store it on a folder (as a thumbnail) on the server and

thrdly: retrieve it with a certain call.

The goal: i want to retrieve the image of a given website - as a screenshot. As an example - what i have in mind we can have a look at the site www.drupal.org and there - see "Sites Made with Drupal" You see that there the image is changing from time to time. It changes every visit (i guess). Well how do they do that?! whats the solution?

But: with PHP, it is easy to get the HTML contents of a web page by using file_get_contents($url):

$url = 'http://www.exmaple.com;

$output = file_get_contents($url);

Some musings about the method:

well - what do you think. Can i add a list of URLS into a database and then let the above mentioned image gallery do a call and show the image, or should i fetch all the images with a perl - programme (see below) or httrack and store it locally to do calls to the locally based file. Hmm - i hope that you understand my question ... or do i have to expalin it more... ?! Which method is more smart is just less difficult and just easiser to accomplish? Thats pretty easy -no scraping that goes into the deepnes of the site. Thank god it is that easy! With the second code i can store the files into and folder using the corresponding names

To sum it up: this is a question that is related to a method - fetching data on the fly eg with $output = file_get_contents($url); ...or getting the data (more than 5500 images - that are screenshots from given webpages [nothing more nothing less] and store it here locally - and do calls to them ...

Which method is smarter!?

love to hear from you

greetings

dilbertone

Note: i only need the screenshots - nothing more. Thats pretty easy - noscraping that goes into the deepnes of the site. Thank god it is that easy!

Here is Perl solution:

#!/usr/bin/perl

    use WWW::Mechanize::Firefox;
    my $mech = WWW::Mechanize::Firefox->new();

    open(INPUT, "urls.txt") or die "Can't open file: $!";

    while (<INPUT>) {
      chomp;
      $mech->get($_);
      my $png = $mech->content_as_png();
    }
    close(INPUT);
    exit;

From the docs: Returns the given tab or the current page rendered as PNG image. All parameters are optional. $tab defaults to the current tab. If the

coordinates are given, that rectangle will be cut out. The coordinates should be a hash with the four usual entries, left,top,width,height. Well this is specific to WWW::Mechanize::Firefox.

Currently, the data transfer between Firefox and Perl is done Base64-encoded.It would be beneficial to find what's necessary to make JSON handle binary

data more gracefully.

Well the source is here:

Filename: urls.txt (for example like here shown)

    www.google.com
    www.cnn.com
    www.msnbc.com
    news.bbc.co.uk
    www.bing.com
    www.yahoo.com

open my $out, '>', "$_.png" or die "could not open '$_.png' for output $!";
print $out $png;
close $out;

Again: Note: i only need the screenshots - nothing more. Thats pretty easy - no scraping that goes into the deepnes of the site. Thank god it is that

easy! And the alternative is - to work with the dynamically solution - with by using file_get_contents($url):

$url = 'http://www.exmaple.com;
$output = file_get_contents($url);

which is the smarter solution!?

love to hear from you!

dilbertone · December 11, 2011

hello dear folks

again me. Well i am musing about the most clever and smart way to do a job...

Hmmm - i guess there is a main difference between retrieving HTML (on the one handside) and retrieving an image (on the other handside).

Retrieving a image - with the Perl-code and the FireFox[/b] (see the code that includes the FireFox part in Mechanize) seems to be much much smarter than -

for example doing it with httrack (the famous tool). With the little Perl-snippet we re able to do nice rendering, and interpreting css/js. The regular browser (automated) such as firefox is able do a good job here. On a sidenote: Considering to do the fetching-job this little Perl-Snippet is far more powerful -than httrack - since this job is not something httrack would do easily. HTTrack is only able to grab part of website(s), but is not able to do any rendering of any sort, nor interpreting css/js.

#!/usr/bin/perl

    use WWW::Mechanize::Firefox;
    my $mech = WWW::Mechanize::Firefox->new();

    open(INPUT, "urls.txt") or die "Can't open file: $!";

    while (<INPUT>) {
      chomp;
      $mech->get($_);
      my $png = $mech->content_as_png();
    }
    close(INPUT);
    exit;

Well: There is absolutly no need to fetch HTML-Contents.

Caching the image is done easily with the Perl-Snippet. And therefore Httrack is (absolutley) not the tool that i should take into consideration.

what do you think !?

Drongo_III · December 11, 2011

Hi mate

I reckon this should put you on the write track http://www.php.net/manual/en/function.imagegrabwindow.php

Drongo

dilbertone · December 11, 2011

hi there

good day - great to hear from you Drongo,!!!

Hi mate

I reckon this should put you on the write track http://www.php.net/manual/en/function.imagegrabwindow.php

Drongo

well good catch - great thoghts - i am happy to hear that from you.... I will digg deeper and try to solve the things with your ideas.

i come back later the day.

greetings

db1

update:

great catch .- all looks interesting!

<?php
$browser = new COM("InternetExplorer.Application");
$handle = $browser->HWND;
$browser->Visible = true;
$im = imagegrabwindow($handle);
$browser->Quit();
imagepng($im, "iesnap.png");
imagedestroy($im);
?>

Capture a window (IE for example) but with its content

<?php
$browser = new COM("InternetExplorer.Application");
$handle = $browser->HWND;
$browser->Visible = true;
$browser->Navigate("http://www.libgd.org");

/* Still working? */
while ($browser->Busy) {
    com_message_pump(4000);
}
$im = imagegrabwindow($handle, 0);
$browser->Quit();
imagepng($im, "iesnap.png");
imagedestroy($im);
?>

well i guess that i should try this out - interesting stuff in deed

QuickOldCar · December 12, 2011

Here's a script I use to take thumbs of websites, I save them as md5, you could save them something else if wanted to

I run a different script for display purposes that looks if the image exists in many multiple ways, because that's the way urls work, and then resize them with gd

<div align="center">
<form action="" method="GET" align="center">
<input type="text" name="url" size="100" id="url" placeholder="Insert a Url" />
<br />
<input type="submit" value="Snap IT" />
<br />
</form>

<?php
if (isset($_GET['url'])){

//parse the url to host
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }

//get website url from browser
$input_url = mysql_real_escape_string(trim($_GET['url']));

//clean the url
$input_url = str_ireplace(array("http://www.","http://","feed://","ftp://","https://","https://www."), "", $input_url);
$input_url = rtrim($input_url, "/");
$url = "http://$input_url";

//use parsed url versus full urls
$url = "http://".getparsedHost($url);

//if empty url show message
if($url == "" || $url == "http://"){
echo "Insert a valid url.";
die;
}

//make md5 hash for filename
$md5_url = md5($url);

//resize function
function resize($img, $w, $h, $newfilename) {

//Check if GD extension is loaded
if (!extension_loaded('gd') && !extension_loaded('gd2')) {
  trigger_error("GD is not loaded", E_USER_WARNING);
  return false;
}

//Get Image size info
$imgInfo = getimagesize($img);
switch ($imgInfo[2]) {
  case 1: $im = imagecreatefromgif($img); break;
  case 2: $im = imagecreatefromjpeg($img);  break;
  case 3: $im = imagecreatefrompng($img); break;
  default:  trigger_error('Unsupported filetype!', E_USER_WARNING);  break;
}

//If image dimension is smaller, do not resize
if ($imgInfo[0] <= $w && $imgInfo[1] <= $h) {
  $nHeight = $imgInfo[1];
  $nWidth = $imgInfo[0];
}else{
                //yeah, resize it, but keep it proportional
  if ($w/$imgInfo[0] > $h/$imgInfo[1]) {
   $nWidth = $w;
   $nHeight = $imgInfo[1]*($w/$imgInfo[0]);
  }else{
   $nWidth = $imgInfo[0]*($h/$imgInfo[1]);
   $nHeight = $h;
  }
}
$shrink = 0.40;//shrink by %
$nWidth = round($nWidth);
$nHeight = round($nHeight);
$nWidth = $nWidth * $shrink;
$nHeight = $nHeight * $shrink;

$newImg = imagecreatetruecolor($nWidth, $nHeight);

/* Check if this image is PNG or GIF, then set if Transparent*/  
if(($imgInfo[2] == 1) OR ($imgInfo[2]==3)){
  imagealphablending($newImg, false);
  imagesavealpha($newImg,true);
  $transparent = imagecolorallocatealpha($newImg, 255, 255, 255, 127);
  imagefilledrectangle($newImg, 0, 0, $nWidth, $nHeight, $transparent);
}
imagecopyresampled($newImg, $im, 0, 0, 0, 0, $nWidth, $nHeight, $imgInfo[0], $imgInfo[1]);

//Generate the file, and rename it to $newfilename
switch ($imgInfo[2]) {
  case 1: imagegif($newImg,$newfilename); break;
  case 2: imagejpeg($newImg,$newfilename);  break;
  case 3: imagepng($newImg,$newfilename); break;
  default:  trigger_error('Failed resize image!', E_USER_WARNING);  break;
}
   
   return $newfilename;
}

//load url fullscreen in IE browser
$browser = new COM("InternetExplorer.Application") or die ("Could not initiate IE object."); 
$handle = $browser->HWND;
$browser->Visible = true;
$browser->FullScreen = true; 
$browser->Navigate($input_url);

$seconds = 7;
$delay_time = $seconds * 1000;

if($browser->Busy) {
com_message_pump($delay_time);
}

$im = imagegrabwindow($handle, 0);
//$im = imagegrabscreen($handle, 0);//grabs entire primary window
$browser->Quit();
$browser=null;
unset($browser);
imagepng($im, "./thumb/$md5_url.png");

//image location
$image_location = "./thumb/$md5_url.png";
    
//browser snap size in fullscreen
$w = 1024;
$h = 768;
    
//resize the image
$thumbnail = resize($image_location, $w, $h, $image_location);
    
//show the thumbnail and href links
echo "<a href='$url' TARGET='_blank'><img src='$image_location' alt='$url' /><br />";
echo " <a href='$url' TARGET='_blank'>$url</a><br />";
echo "<a href='thumb/$md5_url.png'>Thumb Location</a>";

//always destroy the temp image in GD
imagedestroy($im);

}
?>

There is also a good plugin for firefox that works

Pearl Crescent Page Saver

you can install the basic version, save them as %5 for md5, I set them to 40% of size which is 401 pixels

I run a command like this to save as png

exec("Psexec.exe -i -d ./firefox/firefox.exe -savepng $url -savedelay 3000");

You could also check out webshot

it can snap all your images from a list, save as certain sizes

I also wanted to add, the only way to render everything correctly on a page is to use a browser.

Using firefox and adblock is nice to block the ads.

dilbertone · December 12, 2011

hello dear QuickOldCar, well - in one word -

many thanks for sharing your code - this looks damned cool! Very very well done!

Thank you so much for all this interesing lines of code! You are great! You've created a monster--congratulations!

I have a quick look at the code! It looks great and impressive - and contains all necessary things.

Well i have to make up my ideas about saving the results as md5 - i never did this. But this is a very very cool method!

Dear QuickOldCar, thank you for the service you provide the ham community.

At the weekend i will give your code and your plan a try - and then i come back and report all. Untill then

Have a great week!

Best regards and warm greetings

dilbertone

Sign In

file_get_contents($url) vs WWW::Mechanize::Firefox - a methological question!

Recommended Posts

dilbertone

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

Drongo_III

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information