loop constructing the URLs and use PHP to fetch up to thousand pages

dilbertone · November 6, 2010

i am new to PHP - and i want to learn some thing bout PHP -

currently i have a little project - in order to get the links visited that this site presents

http://www.educa.ch/dyn/79363.asp?action=search

[search with wildcard % ]

i parse with a loop.

<?php
$data = file_get_contents('http://www.educa.ch/dyn/79363.asp?action=search');
$regex = '/Page 1 of (.+?) results/';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>

in order to get the following pages

http://www.educa.ch/dyn/79376.asp?id=4438

http://www.educa.ch/dyn/79376.asp?id=2939

If we are looping over a set of values,

then we need to supply it as an array.

I would guess something like this.

As i am not sure which numbers which are filled with content -

i therefore have to loop from 1 to 10000. So i make sure that i get all data.

What do you think!?

for ($i = 1; $i <= 10000; $i++) {

// body of loop

}

according the following description: http://www.php.net/manual/en/control-structures.for.php

joel24 · November 7, 2010

To try and read 10,000 pages from an external source and search those pages for content will take up a lot of resources and time.

What exactly are you trying to get from those pages?

dilbertone · November 7, 2010

hi Joel24

thx for writing

To try and read 10,000 pages from an external source and search those pages for content will take up a lot of resources and time. What exactly are you trying to get from those pages?

see the pages - [this is a open - for everybody free readable and usuable server - a

governmental database - runned in swizzerland. This serer provides adresses for schools -

have a closer look;

http://www.educa.ch/dyn/79376.asp?id=4438

http://www.educa.ch/dyn/79376.asp?id=2939

nothing harmful

i want o read the adresses with php or perl

joel24 · November 7, 2010

as you probably know, the 'detail' link displays the address. That link is called by a javascript onclick function with a dynamic id at the end which calls the page.

<a href="#73" onclick="javascript: window.open('79376.asp?id=375','Detail','width=400,height=300,left=0,top=0');">Detail</a>

To lessen the server load, I would set up a database and then create a program to crawl educa.ch and use regular expressions to extract data from each url ('79376.asp?id=375', '79376.asp?id=324', etc) from the onclick function, then store the contents in a database, preferably sorted into corresponding fields; address, email etc.

Then you would need to extract the address from that detail page, how you would go about separating the address from the other content I am unsure. A crafty regular expression may do the job, you could easily pull the email as it is an anchor link with href='mailto:[email protected]'

I'm not experienced enough with regular expressions so you'll have to find someone who is. Good luck

dilbertone · November 7, 2010

hello joel24

many thanks for the reply. REGEX is a solution. I currently read some docs that cover Dom_Document. Probably a solution for the Parser-Job.

Concerning the fetching i muse about using Curl. It is pretty powerful.

i will come back and report all my findings

regards

Sign In

loop constructing the URLs and use PHP to fetch up to thousand pages

Recommended Posts

dilbertone

Link to comment

Share on other sites

joel24

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

joel24

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information