Jump to content

DomDocument - parser: i need a Starting point


dilbertone

Recommended Posts

good day dear PHPFreaks - hello  to everybody.

 

 

i want to create a link parser. i have choosen to do it with Curl. I have some lines together now. Love to hear  your review... Since i am new to programming i love to get some hints from experienced devs.

 

Here some details: well since we have several hundred of resultpages  derived from this one: http://www.educa.ch/dyn/79362.asp?action=search

 

Note: i want to itterate over the resultpages - with a loop.

 

http://www.educa.ch/dyn/79376.asp?id=1568

http://www.educa.ch/dyn/79376.asp?id=2149

 

 

i take this loop:

for($i=1;$i<=$match[1];$i++)
{
  $url = "http://www.example.com/page?page={$i}";
  // access new sub-page, extract necessary data
}

 

what do you think? What about the Loop over the target-Urls?

 

BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.

 

well this is what i want to. And now i need to have a good parser-script.

 

Note:  this is a tree-part-job:

 

1. fetching the sub-pages

2. parsing them

3. storing the data in a mysql-db

 

Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to

leave them aside - unless i do not want to populate my mysql-db with too much infos..

 

Btw- parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this.

 

The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job.

No Problem here: But how  to do the DOM-Document-Job ...

 

i have installed FireBug into  the FireFox...

 

now i have the Xpaths for the sites:

http://www.educa.ch/dyn/79376.asp?id=1187

http://www.educa.ch/dyn/79376.asp?id=2939

http://www.educa.ch/dyn/79376.asp?id=1515

http://www.educa.ch/dyn/79376.asp?id=1469

 

 

Altes Schulhaus Ossingen    :: /html/body/div[2]

Guntibachstrasse 10  :: /html/body/div[4]

8475  Ossingen  :: /html/body/div[6]

[email protected] :: /html/body/div[9]/a

Tel:052 317 15 45 ::  /html/body/div[11]

Fax:052 317 04 42 ::  /html/body/div[12]

 

 

but how to appyl in the Simple DomDocument - i want to use this here: http://simplehtmldom.sourceforge.net/

 

 

look forward to a hint that gives me a starting point

 

 

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.