dilbertone Posted May 17, 2011 Share Posted May 17, 2011 hello dear community i try to find a way to use file_get_contents: a download of set of pages: Can any body review my approach .. and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop: http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=949&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=20011&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=10579&InterfaceLanguage=1&Type=Html How to mechanize with a loop from 0 to 10000 and throw out 404 responses once you reach the page we then could use beautifulsoup to get the content (in our case the image file address) but we also could just loop trough the images directely with simple webrequests. Well - how to proceed: like this: <?php // creating a stream! $opts = array( 'http'=>array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" ) ); // opens a file $file = file_get_contents('http://www.example.com/', false, $context); ?> a typical page is http://www.foundationfinder.ch/ShowDetails.php?Id=134&InterfaceLanguage=&Type=Html and the related image is at http://www.foundationfinder.ch/ShowDetails.php?Id=134&InterfaceLanguage=&Type=Image after downloading the images we will need to OCR them to extract any useful info, so at some stage we need to look at OCR libs. I think google opensourced one, and since its google it has a good chance it has a python API can anybody review the approach - look forward to hear from you Link to comment https://forums.phpfreaks.com/topic/236686-file_get_contents-parsing-review/ Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.