Jump to content

Scraping with PHP


phpsycho

Recommended Posts

I am trying to build a small script that will scrape links for other links and images. Already got things like asking robots.txt and I'm using curl, not file get contents.

Although.. I am using file() to get info from the robots.txt.. that way I can get info from each line.

 

Problem is though.. no links are being added.. and when I save an image to my server I want to read its info like is it color, width, height, extension, etc

But I keep getting these errors:

IMG ADDED: http://***.com/images/status-busy.png
PHP Warning:  exif_read_data(5525611904.png): File not supported in /var/www/alpha/my/bots/crawl10.php on line 172
IMG ADDED: http://***.com/images/status-busy.png
PHP Warning:  exif_read_data(2322371467.png): File not supported in /var/www/alpha/my/bots/crawl10.php on line 172
IMG ADDED: http://***.com/images/status-busy.png
PHP Fatal error:  Cannot break/continue 1 level in /var/www/alpha/my/bots/crawl10.php on line 120

 

Just noticed its trying to add the images more than once.. which is odd. but the last one is what I was really wondering about..

Lines 120ish:

<?php
$parse = parse_url($url);
if(isset($parse['path'])){
$haystack = pathinfo($parse['path'], PATHINFO_EXTENSION);
if(!preg_match("/(php|html|htm|asp|aspx|shtml|php4|php5|cfm|pl|jsp)/is", $haystack)){ continue; }
}
?>

 

And for this query here..

<?php
	mysql_query("INSERT INTO `search_images` (`url`,`file`,`name`,`from`,`width`,`height`,`color`,`size`,`type`,`datetime`) 
	values ('$img[2]','$file','$name','$link[2]','$width','$height','$color','$size','$extention','$datetime')");
?>

 

I have in a foreach loop. foreach($imgs as $img) can I just add another foreach inside that one that says foreach($links as $link)? so I can get $link[2] which is where the image came from.

Link to comment
Share on other sites

http://www.php.net/manual/en/function.exif-read-data.php

 

png is not supported for exif data

 

You can use GD locally on the image after you download it.

http://www.php.net/manual/en/function.gd-info.php

http://www.php.net/manual/en/function.getimagesize.php

http://www.php.net/manual/en/function.image-type-to-mime-type.php

 

And I guess show the rest of your code for how you associate the mysql inserts.

Link to comment
Share on other sites

ah okay. Well those errors are fixed. but..

PHP Fatal error:  Cannot break/continue 1 level in /var/www/alpha/my/bots/crawl10.php on line 120

Fatal error: Cannot break/continue 1 level in /var/www/alpha/my/bots/crawl10.php on line 120

 

It must be something with my if statement, but I think I am doing it right..

 

<?php
$parse = parse_url($url);
if(isset($parse['path'])){
$haystack = pathinfo($parse['path'], PATHINFO_EXTENSION);
if(!preg_match("/(php|html|htm|asp|aspx|shtml|php4|php5|cfm|pl|jsp)/is", $haystack)){ continue; }
}?>

second if statement..

Link to comment
Share on other sites

I only see that message when it didn't match the preg_match filetype

 

$parse = parse_url($url);
if(isset($parse['path'])){
$haystack = pathinfo($parse['path'], PATHINFO_EXTENSION);
if(!preg_match("/(php|html|htm|asp|aspx|shtml|php4|php5|cfm|pl|jsp)/is", $haystack)){
echo "Didn't match file type";
die;
} else {
echo $haystack;
//rest of code
}
}

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.