Jump to content

Scraping with PHP


phpsycho

Recommended Posts

I am trying to build a small script that will scrape links for other links and images. Already got things like asking robots.txt and I'm using curl, not file get contents.

Although.. I am using file() to get info from the robots.txt.. that way I can get info from each line.

 

Problem is though.. no links are being added.. and when I save an image to my server I want to read its info like is it color, width, height, extension, etc

But I keep getting these errors:

IMG ADDED: http://***.com/images/status-busy.png
PHP Warning:  exif_read_data(5525611904.png): File not supported in /var/www/alpha/my/bots/crawl10.php on line 172
IMG ADDED: http://***.com/images/status-busy.png
PHP Warning:  exif_read_data(2322371467.png): File not supported in /var/www/alpha/my/bots/crawl10.php on line 172
IMG ADDED: http://***.com/images/status-busy.png
PHP Fatal error:  Cannot break/continue 1 level in /var/www/alpha/my/bots/crawl10.php on line 120

 

Just noticed its trying to add the images more than once.. which is odd. but the last one is what I was really wondering about..

Lines 120ish:

<?php
$parse = parse_url($url);
if(isset($parse['path'])){
$haystack = pathinfo($parse['path'], PATHINFO_EXTENSION);
if(!preg_match("/(php|html|htm|asp|aspx|shtml|php4|php5|cfm|pl|jsp)/is", $haystack)){ continue; }
}
?>

 

And for this query here..

<?php
	mysql_query("INSERT INTO `search_images` (`url`,`file`,`name`,`from`,`width`,`height`,`color`,`size`,`type`,`datetime`) 
	values ('$img[2]','$file','$name','$link[2]','$width','$height','$color','$size','$extention','$datetime')");
?>

 

I have in a foreach loop. foreach($imgs as $img) can I just add another foreach inside that one that says foreach($links as $link)? so I can get $link[2] which is where the image came from.

Link to comment
https://forums.phpfreaks.com/topic/243326-scraping-with-php/
Share on other sites

http://www.php.net/manual/en/function.exif-read-data.php

 

png is not supported for exif data

 

You can use GD locally on the image after you download it.

http://www.php.net/manual/en/function.gd-info.php

http://www.php.net/manual/en/function.getimagesize.php

http://www.php.net/manual/en/function.image-type-to-mime-type.php

 

And I guess show the rest of your code for how you associate the mysql inserts.

Link to comment
https://forums.phpfreaks.com/topic/243326-scraping-with-php/#findComment-1249591
Share on other sites

ah okay. Well those errors are fixed. but..

PHP Fatal error:  Cannot break/continue 1 level in /var/www/alpha/my/bots/crawl10.php on line 120

Fatal error: Cannot break/continue 1 level in /var/www/alpha/my/bots/crawl10.php on line 120

 

It must be something with my if statement, but I think I am doing it right..

 

<?php
$parse = parse_url($url);
if(isset($parse['path'])){
$haystack = pathinfo($parse['path'], PATHINFO_EXTENSION);
if(!preg_match("/(php|html|htm|asp|aspx|shtml|php4|php5|cfm|pl|jsp)/is", $haystack)){ continue; }
}?>

second if statement..

Link to comment
https://forums.phpfreaks.com/topic/243326-scraping-with-php/#findComment-1249602
Share on other sites

I only see that message when it didn't match the preg_match filetype

 

$parse = parse_url($url);
if(isset($parse['path'])){
$haystack = pathinfo($parse['path'], PATHINFO_EXTENSION);
if(!preg_match("/(php|html|htm|asp|aspx|shtml|php4|php5|cfm|pl|jsp)/is", $haystack)){
echo "Didn't match file type";
die;
} else {
echo $haystack;
//rest of code
}
}

 

Link to comment
https://forums.phpfreaks.com/topic/243326-scraping-with-php/#findComment-1249606
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.