Jump to content

Archived

This topic is now archived and is closed to further replies.

beba

Download photos from a website

Recommended Posts

Hi

 

ive got a project im working on where i need to collect a whole bunch of images from a website (photo.net) and decided a PHP script might help automate the process. A few other people may work on the project in the future so im leaving detailed comments on most lines to help them understand the code

so far im stuck with a few aspects of the coding process:
1. I have a URL, but need to know how to open a random page on that URL, specifically a page with a picture on it, such as http://photo.net/photodb/photo?photo_id=18012098, i have thought about using "http://photo.net/photodb/photo?photo_id=" as the URL and randomly generating an 8 digit number at the end, but manually tested it out and there are alot of combinations that result in an error page (such as http://photo.net/photodb/photo?photo_id=10012098), so would need a method more reliable. 

 

2. As far as the file name ill be using goes theres certain criteria, ill need the name of the photo, the rating and the number of votes (all of which are located in the details tab of each photo), these details are combined to create a file name, in the example of the following photo http://photo.net/photodb/photo?photo_id=18012098, the filename id need the generate would be something like "Chameleon Snatches Hornworm 6.23 13.jpg"
Im having trouble figuring out how to get my php script to scan a page for info and store it in a variable

 

3. and finally i need to know how to actually rip the image from the page, saving it under the $foldername and #filename variables

 

 

the code ive written so far is below, any help i can get on any of these 3 issues would be very much appreciated

thank you,

 

function download_image
{ 
//File Path variable, change to path you wish to download files to (be sure to end with a slash "\")
$folderpath = "E:\Pictures\";


//Website Variable, Change to the website youre downloading photos from
$url = "http://photo.net/"; 


//Number of Photos, change variable to indicate the number of photos you wish to download at any one time
$photos=10;


//Main Loop, this will find photos in the URL, assign a file name and save the photo to the specified file path
for ($i = 0; $i <= $photos; $i++) {
    //Opens picture random picture from URL








    // Assigns filename
    // Gets the name of the picture (adjust if using a website other than photo.net)
    $picturename = ;
    // Gets the rating of the picture (adjust if using a website other than photo.net)
    $rating = ;
    // Gets the number of votes for the picture (adjust if using a website other than photo.net)
    $votes= ;
    //combines all these variables to create the filename
    $filename= $picturename . " " . $rating . " " . $votes . ".jpg";


    //checks rating, if "critique only" then move on to another photo
    if ($rating = "Critique Only") {
        //if no proper rating deduct 1 from $i
        i--;
    } else {
        //Checks Folder to see if file already exists (if so, it deducts one from $i variable, this will ensure that the right number of photos are downloaded without downloading doubles)
        $filepath = $folderpath . $filename;
        if (file_exists($filepath)) {
            //if file exists deduct 1 from $i
            $i--;
        } else {
            //if file doesnt exist download picture to folder


        }
    }
} 


}

Share this post


Link to post
Share on other sites

"Unless otherwise indicated, all photographs on photo.net are copyrighted by the photographers, whose permission is required for any usage."

Share this post


Link to post
Share on other sites

yet its okay to view them? and you can download them by clicking on the image and selecting save as

either way i do not plan on publicizing them or using them for any profit making purpose

what i intend to do with them is a long story (involves a deep learning package and the question "can a computer distinguish between a good photo and a bad one?"), however it is nothing nefarious, these photos will not be used in any public way, nor any way that would be much different from viewing them on the webpage itself

Share this post


Link to post
Share on other sites

btw, just an update, continuing to work on the script, and have made progress on the first part, was able to use a program to create an XML sitemap of photo.net, so now all i need is to loop through the XML file to get urls from it, this i believe i can do on my own, so no longer need help with step 1

still stuck on 2 & 3 though

Share this post


Link to post
Share on other sites

yet its okay to view them? and you can download them by clicking on the image and selecting save as

 

either way i do not plan on publicizing them or using them for any profit making purpose

 

what i intend to do with them is a long story (involves a deep learning package and the question "can a computer distinguish between a good photo and a bad one?"), however it is nothing nefarious, these photos will not be used in any public way, nor any way that would be much different from viewing them on the webpage itself

 

Yes it's ok to view them. No, actually, it's not ok to right click and download them.  That's breach of copyright unless you have explicit written permission from the image owner.  But here's the thing: most sites know some people will download an image here and there, and it's just let slide because a cost Vs return model makes it unrealistic to go chasing every person that nicks an image here and there - what your talking about is scraping/ripping the site and that's just wrong.  It's 99.9% certain that they will have a monitor on traffic volumes and when they see hundreds of megabytes getting sucked down a single IP address, or even the multiple connection streams that will be established, they will bring in their legal guys round and rip you a new one before you can finish checking that all those nice pictures even downloaded. 

 

The only thing more <insert word(s) here> than thinking it's ok to activly rip off people for thier photo's just because you want them for nothing (not even the time it takes you to right click and manually steal them) is that you would come on here, advertise the fact and ask people to help you do it.  You want free photo's go take them yourself.  You want other peoples photo's for free then contact the photographer and make a request - explain your endeavour and take what you get (which in most cases will be nice, generic, "nothing special", "seen that a hundred times already", kinds of images that they know are worth nothing financialy, but will still serve for your claimed purposes).

Share this post


Link to post
Share on other sites

okay firstly yes, i can take my own photos (in fact i already have), however for the purposes of my project i need the ratings of the photos as much as i need the photos themselves (without the rating the photo is pointless). and yes "nothing special", "seen that a hundred times already" photos will suit me just fine (in fact thats EXACTLY what i need), however again, without the ratings theyre meaningless

 

also, as i require literally thousands of photos the idea of spending hours manually quickly on each one is pretty ridicules, in fact one member of our group has spent over a week putting together 200 photos and it has been very time consuming

after over 2 hours on google this is pretty much the only site ive found that allows users to upload photos and rate others, then averaging the ratings together (Something that is crucial to the project), if anyone knows of any other sites that do not have a disclaimer then please let me know, id literally love nothing more
 

 

The only thing more <insert word(s) here> than thinking it's ok to activly rip off people for thier photo's just because you want them for nothing (not even the time it takes you to right click and manually steal them) is that you would come on here, advertise the fact and ask people to help you do it.

 

i had not seen the disclaimer until joel24 posted it, tbh i never even thought of checking. so your comment is irrelevant as i asked before i knew anything was even wrong. this is an assignment to me, period. one im working hard to get through. meeting with my lecturer tomorrow so ill let him know of your opinions, however since hes already approved this site i think he wouldve already seen the disclaimer himself

either way, the fact remains, i need a script for the project (whether or not it will actually be used this semester is uncertain) and will continue working on it
any help on the script will be appreciated, any derogatory comments regarding me or my "claimed purposes" will be ignored, the script can easily be adapted to work on other sites (in theory) and probably will be when another group takes over this assignment next semester (provided they can find another site that meets the criteria)
thank you

Share this post


Link to post
Share on other sites

Next semester, so... which educational institute is getting you to do this?

 

My point is, if you're struggling at this part then how are you going to achieve the next part???

Share this post


Link to post
Share on other sites

OK, fistly

 

had not seen the disclaimer until joel24 posted it, tbh i never even thought of checking. so your comment is irrelevant as i asked before i knew anything was even wrong. this is an assignment to me, period.
isn't an excuse.  It's your responsability - no one elses, just yours - to make sure that you are using any service (online or otherwise) within the agread usage of the provider.  You should have checked before spending a week stealing other peoples property off the internet.

 

Secondly : If your tutor has consented to you taking this cource of action in full knowledge that it is against the law, I would very much like to know which educational establishment this person is employed in that would encourage such an action. 

 

Thirdly : It sounds like your after more than just the images from the site, but the create input of it's members aswell.  I'd suggest you actually contact the website directly and present them with what you are trying to do (just don't open with the fact that you have already stolen 200+ images off their site) and see what support they are willing to give you.  Short of that there's not too much your going to get out of the people on here.

Share this post


Link to post
Share on other sites

The project you describe will be pretty advanced in the end and not as simple as one would think.

Make an image scraper/crawler

You can make a site specific one or try to make something more general to do additional websites.

 

I personally wouldn't download and store images but instead just save it's url location.

You would be able to examine these images initially and also at later times if desired.

It's entirely possible to save these images into a folder and mark in your database it's location/filename.

If you need to examine images in more detail it needs to be downloaded or in some cases just partially downloaded, but you don't need to keep that image, set an expire time on the image for deletion and refer back to the original sites image for display purposes.

 

Discover every page a site has (all sites are different with their url patterns and pagination, also the same page could have different dynamic content than the last time visited it)

Discover all images per page, save the images href,title,alt into a database using a unique constraint to try and eliminate duplicates.

 

You can try the random approach but it takes longer and does not guarantee will even get every page.

If the goal is to just amass piles of images and not caring what site had them, just start visiting a random url from domain or url lists and grab whatever you find.

 

Would be possible to scrape every href found on a page and either save them a session,list or database, then select a random one from that the next round.

It's possible on some sites to follow their pagination by doing a +1 on the page numbers, can stop your scraper when it no longer finds content or set a max limit manually.

For this approach you would most likely want to only store href urls from that domain or possibly even from certain sections of the site.

 

If your goal is to make something like an image search many sites I would make a system that just stores image locations scraped any page visited, then you can automate that script in many possible ways, use your imagination.

 

Not every site has a sitemap, even if they do is not usually every single url that exists on a site, usually the latest data such as a feed or merely the links to their sections.

 

So it will be up to you depending on how the website is designed and determine the best way to find their content.

 

As for the scraping aspect:

curl (to me is the best method to connect and can also follow redirects)

file_get_contents (fast and easy, can create a stream context but still limited in what you can do, it will fail a lot)

preg_match or preg_match_all

simplehtmldom

dom

simplexml

 

You will also have to fix relative urls, determine and convert/replace character,language and document encoding

 

I have my own search engine,website index, piles of tools and automation for scraping, it's not items willing to just hand out but if have any particular questions about something feel free to post.

 

I'm trying to give you some information to research.

One method that can work for this particular site is to gather the photo related links and read their opengraph data, save that information.

 

As an example:

http://photo.net/photodb/photo?photo_id=17635199

<meta property="og:title" content="SMP_0016a copy: Photo by Photographer Jiri Subrt">
<meta property="og:type" content="article">
<meta property="og:url" content="http://photo.net/photodb/photo?photo_id=17635199"/>
<meta property="og:image" content="http://gallery.photo.net/photo/17635199-lg.jpg">

Some code to find the opengraph data and place into an array.

<?php
$html = file_get_contents('http://photo.net/photodb/photo?photo_id=17635199');
preg_match_all('~<\s*meta\s+property=["\'](og:[^"\']+)["\']\s+content=["\']([^"\']*)~is', $html, $matches);

$og_array = array();

foreach ($matches[1] as $k => $match) {
                $match = str_replace(":", "_", $match);
                //echo "(".$match.") ".$matches[2][$k]."<br />";
                
                $og_array[trim($match)] = trim($matches[2][$k]);
                $$match                 = trim($matches[2][$k]);
            }

echo "<pre>";
print_r($og_array);
echo "</pre>";
?>

Results:

Array
(
    [og_title] => SMP_0016a copy: Photo by Photographer Jiri Subrt
    [og_type] => article
    [og_url] => http://photo.net/photodb/photo?photo_id=17635199
    [og_image] => http://gallery.photo.net/photo/17635199-lg.jpg
)

Share this post


Link to post
Share on other sites

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.