file_get_contents: & parsing - review

dilbertone · May 17, 2011

hello dear community

i try to find a way to use file_get_contents: a download of set of pages: Can any body review my approach

.. and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:

How to mechanize with a loop from 0 to 10000 and throw out 404 responses

once you reach the page we then could use beautifulsoup to get the content (in our case the image file address)

but we also could just loop trough the images directely with simple webrequests.

Well - how to proceed:

like this:

<?php

// creating a stream!

$opts = array(

'http'=>array(

'method'=>"GET",

'header'=>"Accept-language: en\r\n" .

"Cookie: foo=bar\r\n"

)

);

// opens a file

$file = file_get_contents('http://www.example.com/', false, $context);

?>

after downloading the images we will need to OCR them to extract any useful info,

so at some stage we need to look at OCR libs.

I think google opensourced one, and since its google it has a good chance it has a python API

can anybody review the approach - look forward to hear from you

Recommended Posts