Jump to content

dilbertone

Members
  • Posts

    122
  • Joined

  • Last visited

Everything posted by dilbertone

  1. good day dear php-freaks This is a posting that is related to an image-display-topic. I ve got a list of 5500 websites and need to grab a little screenshot of them- to create a thumbnail that is ready to show - as a thumbnail of course - on a website. How do i do that. Dynamically - with by using file_get_contents($url): $url = 'http://www.exmaple.com; $output = file_get_contents($url); or should i download all the images first secondly store it on a folder (as a thumbnail) on the server and thrdly: retrieve it with a certain call. The goal: i want to retrieve the image of a given website - as a screenshot. As an example - what i have in mind we can have a look at the site www.drupal.org and there - see "Sites Made with Drupal" You see that there the image is changing from time to time. It changes every visit (i guess). Well how do they do that?! whats the solution? But: with PHP, it is easy to get the HTML contents of a web page by using file_get_contents($url): $url = 'http://www.exmaple.com; $output = file_get_contents($url); Some musings about the method: well - what do you think. Can i add a list of URLS into a database and then let the above mentioned image gallery do a call and show the image, or should i fetch all the images with a perl - programme (see below) or httrack and store it locally to do calls to the locally based file. Hmm - i hope that you understand my question ... or do i have to expalin it more... ?! Which method is more smart is just less difficult and just easiser to accomplish? Thats pretty easy -no scraping that goes into the deepnes of the site. Thank god it is that easy! With the second code i can store the files into and folder using the corresponding names To sum it up: this is a question that is related to a method - fetching data on the fly eg with $output = file_get_contents($url); ...or getting the data (more than 5500 images - that are screenshots from given webpages [nothing more nothing less] and store it here locally - and do calls to them ... Which method is smarter!? love to hear from you greetings dilbertone Note: i only need the screenshots - nothing more. Thats pretty easy - noscraping that goes into the deepnes of the site. Thank god it is that easy! Here is Perl solution: #!/usr/bin/perl use WWW::Mechanize::Firefox; my $mech = WWW::Mechanize::Firefox->new(); open(INPUT, "urls.txt") or die "Can't open file: $!"; while (<INPUT>) { chomp; $mech->get($_); my $png = $mech->content_as_png(); } close(INPUT); exit; From the docs: Returns the given tab or the current page rendered as PNG image. All parameters are optional. $tab defaults to the current tab. If the coordinates are given, that rectangle will be cut out. The coordinates should be a hash with the four usual entries, left,top,width,height. Well this is specific to WWW::Mechanize::Firefox. Currently, the data transfer between Firefox and Perl is done Base64-encoded.It would be beneficial to find what's necessary to make JSON handle binary data more gracefully. Well the source is here: Filename: urls.txt (for example like here shown) www.google.com www.cnn.com www.msnbc.com news.bbc.co.uk www.bing.com www.yahoo.com open my $out, '>', "$_.png" or die "could not open '$_.png' for output $!"; print $out $png; close $out; Again: Note: i only need the screenshots - nothing more. Thats pretty easy - no scraping that goes into the deepnes of the site. Thank god it is that easy! And the alternative is - to work with the dynamically solution - with by using file_get_contents($url): $url = 'http://www.exmaple.com; $output = file_get_contents($url); which is the smarter solution!? love to hear from you!
  2. on a opensuse 11.4 i installed and configured Apache and Mysql (with lamp ) with the following guide i followed this path - http://mntechblog.wordpress.com/2011/04/29/webserver-installation-unter-opensuse-11-4/ then i got the results in the browser, while i entered localhost : it works! Well that is great . now i need to know how to ´"talk to the Mysql-SERVER - which is configured also? Btw: what bout the installation and the addition of a PHPMYADMIN as a Tool that helps me to manage a MySQL database!? That is the most improtrant question new!! look forward to hear from you
  3. hello dear community! good day! I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval! see an example: I need to get all the data out of this site. www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder . I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. see an example: Bürgerstiftung Lebensraum Aachen rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Hubert Schramm Alexanderstr. 69/ 71 52062 Aachen Telefon: 0241 - 4500130 Telefax: 0241 - 4500131 Email: info@buergerstiftung-aachen.de www.buergerstiftung-aachen.de >> Weitere Details zu dieser Stiftung Bürgerstiftung Achim rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Helga Kühn Rotkehlchenstr. 72 28832 Achim Telefon: 04202-84981 Telefax: 04202-955210 Email: info@buergerstiftung-achim.de www.buergerstiftung-achim.de >> Weitere Details zu dieser Stiftung I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!? well i could do this with XPahts - in PHP or Perl - (with mechanize) i started with an php-approach: But -if i run the code (see below) i get this results PHP Fatal error: Call to undefined function file_get_html() in /home/martin/perl/foundations/arbie_finder_de.php on line 5 martin@suse-linux:~/perl/foundations> cd foundations caused by this code here <?php // Create DOM from URL or file $html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder'); // split it via body, so you only get to the contents inside body tag $split = split('<body>', $html); // it is usually in the top of the array but just check to be sure $body = $split[1]; // split again with, say,<p class="divider">A</p> $split = split('<p class="divider">A</p>', $body); // now this should contain just the data table you want to process $data = $split[1]; // Find all links from original html foreach($html->find('a') as $element) { $link = $element->href; // check if this link is in our data table if(substr_count($data, $link) > 0) { // link is in our data table, follow the link $html = file_get_html($link); // do what you have to do } } ?> well some musings about my approach: the standard practice for scrapping the pages would be: 1. read the page into a string (file_get_html or whatever is being used now) 2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by <p class="divider">A</p> , since it has the link we described above: 3. If we wish to follow the link, just repeat the same process, but using the link. 4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way. Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward Update: Hmm - i could try this: addmiting that it doesn't get any simpler than using simple_html_dom. $records = array(); foreach($html->find('#content dl') as $contact) { $record = array(); $record["name"] = $contact->find("dt", 0)->plaintext; foreach($contact->find("dd") as $field) { /* parse each $field->plaintext in order to obtain $fieldname */ $record[$fieldname] = $field->plaintext; } $records[] = $record; } Well - i try to work from here. Perhaps i use a recent version of PHP to get the jQuery-like syntax.... hmmm... any ideas look forward
  4. hello community many many thanks for running this board. I love this site. It has helped me so often! You are great fellows. What i do today is workin on a little php-parser! I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval! see an example: I need to get all the data out of this site. see the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!? well i could do this with XPahts - in PHP or Perl - (with mechanize) i started with an php-approach: But -if i run the code (see below) i get this results martin@suse-linux:~> cd perl martin@suse-linux:~/perl> cd foundations martin@suse-linux:~/perl/foundations> php arbie_finder_de.php PHP Parse error: syntax error, unexpected '*' in /home/martin/perl/foundations/arbie_finder_de.php on line 3 martin@suse-linux:~/perl/foundations> php arbie_finder_de.php PHP Parse error: syntax error, unexpected T_FOREACH in /home/martin/perl/foundations/arbie_finder_de.php on line 17 martin@suse-linux:~/perl/foundations> ^C martin@suse-linux:~/perl/foundations> caused by this code here <?php // Create DOM from URL or file $html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder'); // split it via body, so you only get to the contents inside body tag $split = split('<body>', $html); // it is usually in the top of the array but just check to be sure $body = $split[1]; // split again with, say,<p class="divider">A</p> $split = split('<p class="divider">A</p>', $body); // now this should contain just the data table you want to process $data = $split[1] // Find all links from original html foreach($html->find('a') as $element) { $link = $element->href; // check if this link is in our data table if(substr_count($data, $link) > 0) { // link is in our data table, follow the link $html = file_get_html($link); // do what you have to do } } ?> **well some musings about my approach:** the standard practice for scrapping the pages would be: 1. read the page into a string (file_get_html or whatever is being used now) 2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by <p class="divider">A</p> , since it has the link we described above: 3. If we wish to follow the link, just repeat the same process, but using the link. 4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way. Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward
  5. hi Mikesta thx for the quick answer. I will try to write the code for 1. fetching 2. parsing 3. storing on a local mysql database i come back and report all.. - Then yo can have closer look and give a code review... greetings.... db1
  6. hello dear mikesta many thanks for the quick answer. thx - great to hear from you! Well many thanks - you help me alot! one question - since i am a php-newbie ;-) can i apply this code also on another target!? http://buergerstiftungen.de/cps/rde/xchg/SID-F8780E81-ABF20567/buergerstiftungen/hs.xsl/db.htm is this possible to fetch the results ( approx 230 different records) too and parse them!? look forward to hear from you best regards db1
  7. good day dear community, this is a big issue. I have to decide: between native PHP DOM Extension or of simple DOM html parser well i want to parse the site here: http://buergerstiftungen.de/cps/rde/xchg/SID-A7DCD0D1-702CE0FA/buergerstiftungen/hs.xsl/db.htm http://buergerstiftungen.de/cps/rde/xchg/SID-A7DCD0D1-702CE0FA/buergerstiftungen/hs.xsl/db.htm I will suggest to use the native PHP "DOM" Extension instead of "simple html parser", since it will be much faster and easier What do you think about this one here...: $doc = new DOMDocument @$doc->loadHTMLFile('...URL....'); // Using the @ operator to hide parse errors $contents = $doc->getElementById('content')->nodeValue; // Text contents of #content look forward to hear from you best regards db1
  8. hi - many thanks for the answer! I will try to figure it out. At the weekend - i want to figure all out. I come back and report all my findings... regards db1
  9. am trying to scrape the datas from a webpage, but I get need to get all the data in the http://www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder see this link - follow it and get full insights include 'simple_html_dom.php'; $html1 = file_get_html('http://www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder '); $info1 = $html1->find('b[class=[what to enter herer ]',0); well - what is wanted: i need to have all the data out of this site;: see this link - follow it and get full insights Bürgerstiftung Lebensraum Aachen rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Hubert Schramm Alexanderstr. 69/ 71 52062 Aachen Telefon: 0241 - 4500130 Telefax: 0241 - 4500131 Email: info@buergerstiftung-aachen.de www.buergerstiftung-aachen.de >> Weitere Details zu dieser Stiftung Bürgerstiftung Achim rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Helga Kühn Rotkehlchenstr. 72 28832 Achim Telefon: 04202-84981 Telefax: 04202-955210 Email: info@buergerstiftung-achim.de www.buergerstiftung-achim.de >> Weitere Details zu dieser Stiftung well - what is needed - i need to have the datas that are "behind" the link - is there any way to do this with a ease and understandable parser - one that can be understood and wrote by a newbie!? that would be more than great Well one word regardng Regex: I have not tooo much experience - but i guess DOM-Document is the smarter way here...
  10. Hi Folks, I am trying to scrape the price data from a webpage, but how to get the variable, see here the target http://www.buergerstiftungen.de/cps/rde/xchg/SID-635E10F9-BFA1A4BF/buergerstiftungen/hs.xsl/db.htm include 'simple_html_dom.php'; $html1 = file_get_html('http://www.buergerstiftungen.de/cps/rde/xchg/SID-635E10F9-BFA1A4BF/buergerstiftungen/hs.xsl/db.htm'); $info1 = $html1->find('b[class=[what to enter herer ]',0); The variable then also contains data such as <b class="info"> info </b> Is there a way to trim the unwanted data out ? I just need all the infos out of the site. I am unsure if I do it during the find, or perhaps when i echo out the variable, then do I specify what I want ? well how t o do it witout a regex but with DOM-Document!?
  11. hello CueL3SS - many many thanks great to read you and your ideas! i will try all that is written in the thread! Greetings
  12. hello dear xyph many many thanks to you for the quick reply! thx for the hint i will try it out! With this ...: function do_reg($text, $regex, $regs) { if (preg_match($regex, $text, $regs)) { $result = $regs[0]; } else { $result = ""; } return $result; } or this:: function do_reg($text, $regex) { preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER); for ($i = 0; $i < count($result[0]); $i++) { $result[0][$i]; } } i will try out and see which regex fits the most again many thanks for any and all help! db1
  13. hello gitzmola - many many thanks i will have a look! on a sidenote: - and i also want to look if the parser-part runs ... Guess the part that FETCHES the data is not so complicated as the part that parses the data... anyway - many many thanks for any and alll help! i come back and report all my findings!! greetings db1 UPDATE: - HELLO DEAR FRIENDS: -this gives back something very very intersting. ... well see: (i have postet only a part.... the whole stuff is tooo big to post here.... ) But wait: what can i do!? note - i want to get all the data out of this database: http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de At the moment we have these results: how to proceed: ?! .....and so forth and so forth!! ...
  14. i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder Note: see the link here - >> Weitere Details zu dieser Stiftung i need to grab the datas that is "behind" this link!
  15. hello dear friends this code first of all: a big big SORRY for being the Newbie - @gizmola: i did the changes: well i guess that i did the changes... but i had no luck... look and see - this is so mystic! i do not know whats going one here: <?PHP // Original PHP code by Chirp Internet: http://www.chirp.com.au // Please acknowledge use of this code by including this header. $url = "http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de"; //$input = @file_get_contents($url) or die("Could not access file: $url"); $input = file_get_contents($url) or die("Could not access file: $url"); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { // $match[2] = all the data i want to collect... // $match[3] = text that i need to collect - see a detail-page print_r($matches) } } ?> gives back martin@suse-linux:~/perl/foundations> php fondations_de.php PHP Parse error: syntax error, unexpected '}' in /home/martin/perl/foundations/fondations_de.php on line 20 martin@suse-linux:~/perl/foundations> ^C martin@suse-linux:~/perl/foundations>
  16. hi gizmola hello xyph thanks for answering! i am happy! many many thanks.- That means i have to rewrite this part!? - gizmola - i am a newbie; I will try it. But perhaps you can help me!? Update; i will definitly try this one: print_r($matches) to see what's going on lookforward - best regards martin
  17. hello dear community, good day i have a problem - a parser that does not parse. IT does not work! It gives not back anything! <?PHP // Original PHP code by Chirp Internet: http://www.chirp.com.au // Please acknowledge use of this code by including this header. $url = "http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de"; //$input = @file_get_contents($url) or die("Could not access file: $url"); $input = file_get_contents($url) or die("Could not access file: $url"); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { // $match[2] = all the data i want to collect... // $match[3] = text that i need to collect - see a detail-page } } ?> well - it goes a bit over my head: "what - i have did" - it does not give back any results!? i look forward to hear from you! regards
  18. i currently write a little parser & harvester that collects the data of this website: (see below) http://www.aktive-buergerschaft.de/buergerstiftungsfinder i want to have all foundations that are listed on this page (see examples below).- Well i think, that i need to choose between file_get_contents and curl - to fetch the datas. And i have tu use some ideas of a parser - i do not know which one i should use here. Can you give me some hints!? first .- i present my FETCHING-Part: with curl: well I've never needed to use curl myself, but, obvious resource php.net's example is; <?php // create a new cURL resource $ch = curl_init(); // set URL and other appropriate options curl_setopt($ch, CURLOPT_URL, "http://www.example.com/"); curl_setopt($ch, CURLOPT_HEADER, 0); // grab URL and pass it to the browser $data = curl_exec($ch); // close cURL resource, and free up system resources curl_close($ch); //Then you can use $data for parsing ?> well to be frank: If we dont have curl a slower function is file_get_contents() - this will work too! Well i think that it just is about 1-2 seconds slower, but the call is much easier! <?php $html = file_get_contents('http://www.example.com'); //now all the html is the $html ?> anyway - i think the much more interesting part is the parsing i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder Note: see the link here - >> Weitere Details zu dieser Stiftung i need to grab the datas that is "behind" this link!
  19. Hello dear friends, hello community, i want to parse a site http://www.aktive-buergerschaft.de/buergerstiftungsfinder Well therefore i have a true beginner-question regarding Array in order to work with the data: beginner question The results should be stored in a MySQL-Database: well you see. i have a Curl-Approach: and i want to work with the parsed data: and in this case i want to give over the data into an array (if viewing the output in web browser, do view-source to see structured array). Here the musings that led to the code: Depending on num dimensions, we could do, foreach($result as $k => $v) {echo $k." ".$v} to view and work with data (adding sub foreach loops accordingly; i.e. if $v is itself an array) see the results: What do you think: $ch = curl_init("http://www.aktive-buergerschaft.de/buergerstiftungsfinder"); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $buffer = curl_exec($ch); curl_close($ch); $result = explode(",", $buffer); foreach($result as $k => $v) {echo $k." ".$v} if $v is itself an array) // print_r($result); look forward to hear from you!! db1
  20. well - if we enlarge the code to this - i need to have some ideas how to store the data into mysql: $ch = curl_init("http://www.aktive-buergerschaft.de/buergerstiftungsfinder"); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $buffer = curl_exec($ch); curl_close($ch); $result = explode(",", $buffer); print_r($result); Well - what is needed now: I need to have the details for the stuff - in order to store all the results in a MySQL-DB. Can any body enlarge the codebits that can be seen above - and give me a hint!? I look forward to this...
  21. hello dear community, today i want to apply Curl on a very simple example that gets a HTTP page - review in order to harvest the data of this pretty simple! Well: see this totally easy page: http://www.aktive-buergerschaft.de/buergerstiftungsfinder here we have a list of Foundations: We can see a bunch (approx 1000 records on Foundations). Well - my intention is to store the data in a locally database (favourite db is MySQL): here is my simple approach - what is missing are two parts: The processing of the results and the storing of the results of the parser - into the MySQL-DB. This part goes somewhat over my head. Well the result of the Fetching with Curl should be given to arrays - shouldn ´t it!? If someone can help me here i would be glad! <?php // // A very simple example that gets a HTTP page. // $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, "http://www.aktive-buergerschaft.de/buergerstiftungsfinder"); curl_setopt ($ch, CURLOPT_HEADER, 0); curl_exec ($ch); curl_close ($ch); // well here i want to put the results into an array... [multidimensioal or not !?! ] ?> see the example here on this page: (at the top) The link "Weitere Details zu dieser Stiftung" which is: in english More details to this Foundation" This link has to be followed and the results (also) to be parsed. If you follow the link then you see: There are some additional infos that should be stored too! Well - what is needed now: I need to have the details for the arrays Can any body enlarge the codebits that can be seen above - and give me a hint!? I look forward to this... db1
  22. hello dear folks, good evening dear community. I need a starting-point! A German DB that collects all the data from all German Foundations... see: http://www.suche.stiftungen.org/index.php?strg=87_124&baseID=129 Here we find all Foundations in Germany: : 8074 different foundations You get the full results if you choose % as wildcard in the Search-field. How to do this with PHP: i think that we have to do this with curl or with file_get_contents_ - those are the best methods for doing this: What do you think, personally. I am curious to get your ideas to know! please. lemme know what you think!? BTW - probably - the XPATH and DOM-Technique can be used too. I guess so!? on a sidenote: But if you do that - then you get some kind of overflow... 350 results are the limit. More is not possible to show. So the question is: How can we create a spider that runs across the site and asks step by step - that we get all : 8074 results. The second question is: We get the following dataset: Name: Allers'sche Tagelöhnerstiftung Landesstube des alten Landes Wursten Street: Westerbüttel 13 Postal-code and town: 27632 Dorum additional infos: Fördernd: Ja additional infos: Operativ: Ja webpage: http://www.sglandwursten.de main area of work: Aufgabengebiete: Mildtätigkeit Kinder-/Jugendhilfe regional-base: Regionale Einschränkungen: please 27632, 27637, 27638, 27607, Mitgliedsgemeinden im Bereich der Samtgemeinde Land Wursten, Nordholz, Imsum, verschiedene Gemeinden im Bereich der Samtgemeinde, Land Wursten, Gemeinde Nadholz Target-group: Zielgruppen: Feste Destinatäre: Bewohner DRK-Alten- und Pflegeheim. Kinder, Jugendliche, Landarbeiter All the dataset are simmilar! They seem to look exactly like this... Th question is. Can this be stored directly into a MySQL-DB!? Note; some descriptions are quite very very long. Guess that a Excel-Sheet can be overloaded by this!? What do you think - is this doable!? Love to hear from you - best regards db1
  23. hello dear community _ good evening! For the purpose of scraping this dataset with ++ 2700 records on foundation - in Switzerland you see it here http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de <?PHP // Original PHP code by Chirp Internet: www.chirp.com.au // Please acknowledge use of this code by including this header. $url = "http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de"; $input = @file_get_contents($url) or die("Could not access file: $url"); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { // $match[2] = all the data i want to collect... // $match[3] = text that i need to collect - see a detail-page } } ?> well to be frank - i am not sure - my console gives back some bad errors... can you help me please in this issue. love to hear from you db1 btw: see a detailpage: http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=3221 with the following information: Name: "baiji.org" Foundation Schlüsselwort: BAIJI Adresse: Seefeldstr. 94 8008 Zürich Mail: august@baiji.com Zweck: btw: see a translation; Name: - > name Schlüsselwort: - keyword Adresse: - adress Mail: - mail Zweck: - purpose
  24. A German DB that collects all the data from all German Foundations... see: http://www.suche.stiftungen.org/index.php?strg=87_124&baseID=129 Here we find all Foundations in Germany: : 8074 different foundations You get the full results if you choose % as wildcard in the Search-field. But if we do that - then you get some kind of overflow... 350 results are the limit. More is not possible to show. So the question is: How can we create a spider that runs across the site and asks step by step - that we get all : 8074 results. The way to get through this database is to search combinations of letters eg "ac" and select search only titles. Then go through every pair of letters. If you still get too many results for a particular pair, use 3 letters. aca, acb,... Can i do this with File_get_contents_ or with Curl!? (eg -MultiCurl ) Well - i want to make a little script thatdoes this - i need to create a little automation - that does this task automatically. Regarding the destination database, it's all going into sqlite, which we believe can handle large enough sets of data without any problems. We can download the database as a file too. For capabilities, see here: http://www.sqlite.org/limits.html The question is _ how to create the first approach of the parser...!`? Can any body assist!
  25. hello dear community i try to find a way to use file_get_contents: a download of set of pages: Can any body review my approach .. and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop: http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=949&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=20011&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=10579&InterfaceLanguage=1&Type=Html How to mechanize with a loop from 0 to 10000 and throw out 404 responses once you reach the page we then could use beautifulsoup to get the content (in our case the image file address) but we also could just loop trough the images directely with simple webrequests. Well - how to proceed: like this: <?php // creating a stream! $opts = array( 'http'=>array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" ) ); // opens a file $file = file_get_contents('http://www.example.com/', false, $context); ?> a typical page is http://www.foundationfinder.ch/ShowDetails.php?Id=134&InterfaceLanguage=&Type=Html and the related image is at http://www.foundationfinder.ch/ShowDetails.php?Id=134&InterfaceLanguage=&Type=Image after downloading the images we will need to OCR them to extract any useful info, so at some stage we need to look at OCR libs. I think google opensourced one, and since its google it has a good chance it has a python API can anybody review the approach - look forward to hear from you
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.