Jump to content

file_get_contents: & parsing - review


dilbertone

Recommended Posts

hello dear community

 

 

i try to find a way to use  file_get_contents: a download of set of pages:  Can any body review my approach

 

.. and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:

 

http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html

http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html

http://www.foundationfinder.ch/ShowDetails.php?Id=949&InterfaceLanguage=1&Type=Html

http://www.foundationfinder.ch/ShowDetails.php?Id=20011&InterfaceLanguage=1&Type=Html

http://www.foundationfinder.ch/ShowDetails.php?Id=10579&InterfaceLanguage=1&Type=Html

 

How to mechanize with a loop from 0 to 10000 and throw out 404 responses

once you reach the page we then could use beautifulsoup to get the content (in our case the image file address)

but we also  could just loop trough the images directely with simple webrequests.

 

Well  - how to proceed:

 

like this:

 

 

<?php

// creating a stream!

$opts = array(

  'http'=>array(

    'method'=>"GET",

    'header'=>"Accept-language: en\r\n" .

              "Cookie: foo=bar\r\n"

  )

);

 

// opens a file

 

$file = file_get_contents('http://www.example.com/', false, $context);

?>

 

 

a typical page is http://www.foundationfinder.ch/ShowDetails.php?Id=134&InterfaceLanguage=&Type=Html

and the related image is at http://www.foundationfinder.ch/ShowDetails.php?Id=134&InterfaceLanguage=&Type=Image

 

after downloading the images we will need to OCR them to extract any useful info,

so at some stage we need to look at OCR libs.

 

 

I think google opensourced one, and since its google it has a good chance it has a python API

 

 

can anybody review the approach - look forward to hear from you

 

Link to comment
https://forums.phpfreaks.com/topic/236686-file_get_contents-parsing-review/
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.