Jump to content

loop constructing the URLs and use PHP to fetch up to thousand pages


dilbertone

Recommended Posts

 

i am new to PHP - and i want to learn some thing bout PHP -

 

currently i have a little project - in order to get the links visited that this site presents

http://www.educa.ch/dyn/79363.asp?action=search

[search with wildcard % ]

 

i parse with a loop.

<?php
$data = file_get_contents('http://www.educa.ch/dyn/79363.asp?action=search');
$regex = '/Page 1 of (.+?) results/';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>

in order to get the following pages

 

http://www.educa.ch/dyn/79376.asp?id=4438

http://www.educa.ch/dyn/79376.asp?id=2939

 

If we are looping over a set of values,

then we need to supply it as an array.

I would guess something like this.

 

As i am not sure which numbers which are filled with content -

i therefore have to loop from 1 to 10000. So i make sure that i get all data.

 

What do you think!?

 

for ($i = 1; $i <= 10000; $i++) {

  // body of loop

}

 

 

according the following description: http://www.php.net/manual/en/control-structures.for.php

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

hi Joel24

 

thx for writing

 

To try and read 10,000 pages from an external source and search those pages for content will take up a lot of resources and time. What exactly are you trying to get from those pages?

 

see the pages - [this is a open - for everybody free readable and usuable server - a

governmental database  - runned in swizzerland. This serer  provides  adresses for schools -

 

have a closer look;

 

http://www.educa.ch/dyn/79376.asp?id=4438

 

http://www.educa.ch/dyn/79376.asp?id=2939

 

 

nothing harmful

 

i want o read  the adresses with php or perl

 

 

as you probably know, the 'detail' link displays the address. That link is called by a javascript onclick function with a dynamic id at the end which calls the page.

<a href="#73" onclick="javascript: window.open('79376.asp?id=375','Detail','width=400,height=300,left=0,top=0');">Detail</a>

 

To lessen the server load, I would set up a database and then create a program to crawl educa.ch and use regular expressions to extract data from each url ('79376.asp?id=375', '79376.asp?id=324', etc) from the onclick function, then store the contents in a database, preferably sorted into corresponding fields; address, email etc.

 

Then you would need to extract the address from that detail page, how you would go about separating the address from the other content I am unsure. A crafty regular expression may do the job, you could easily pull the email as it is an anchor link with href='mailto:[email protected]'

 

I'm not experienced enough with regular expressions so you'll have to find someone who is. Good luck

hello joel24

 

many thanks for the reply. REGEX is a solution. I currently read some docs that cover Dom_Document. Probably a solution for the Parser-Job.

 

Concerning the fetching i muse about using Curl. It is pretty powerful.

 

i will come back and report all my findings

 

regards

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.