Parserscript with cURL & Xpath needs some final reviews - all ready have a look!

dilbertone · November 23, 2010

Hi dear Freaks

i am very new to Programming - and i want to code for a little project. So - i have some things to learn in PHP.

I currently play around with http://simplehtmldom.sourceforge.net/ - and struggle a bit with my project!

Well - i want to have you to have a closer look a tthe Parserscript with cURL & Xpath. I have all the parts. But i guess that i have messed up a bit: I need some final reviews - have a look - and give me some hints for the final arrangement of the code!

Thx in advance!

What is aimed: i want to create a parser. And here there are the parts:

a. the fetching part and the

b. parser-part (see below)

c. storing part (into a Mysql-DB)

The fetching-part: i have choosen to do it with Curl. I thought of running CurL since this is pretty powerful.

I have some lines together now. Eugene, i iove to hear your review...Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one:

http://www.educa.ch/dyn/79362.asp?action=search

Note: i want to itterate over the resultpages - with a loop.

http://www.educa.ch/dyn/79376.asp?id=1568

http://www.educa.ch/dyn/79376.asp?id=2149

i take this loop:

PHP Code:
for($i=1;$i<=$match[1];$i++)
{
  $url = "http://www.example.com/page?page={$i}";
  // access new sub-page, extract necessary data
}

as the example we can set in here this domain: http://www.educa.ch/dyn/79362.asp?action=search

Note - you see that we have lots of targets....:

http://www.educa.ch/dyn/79376.asp?id=1568

http://www.educa.ch/dyn/79376.asp?id=2149

and lots of others more:

what do you think? What about the Loop over the target-Urls?

BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff.

well this is what i want to. And now i need to have a good parser-script.

Note: this is a tree-part-job:

1. fetching the sub-pages

2. parsing them and if all goes well .... then we would have a third part:

3. storing the data in a mysql-db

b. the Paser-Part:

Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos..

Btw: parsing should be a part that can be done with DomDocument - What do you think?

I need to combine the first part with tthe second - can you give me some starting points and hints to get this.

The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. No Problem here: But how to do the DOM-Document-Job ...

i have installed FireBug into the FireFox...

now i have the Xpaths for the sites:

http://www.educa.ch/dyn/79376.asp?id=1187

http://www.educa.ch/dyn/79376.asp?id=2939

see the details:

Altes Schulhaus Ossingen :: /html/body/div[2]

Guntibachstrasse 10 :: /html/body/div[4]

8475 Ossingen :: /html/body/div[6]

sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a

Tel:052 317 15 45 :: /html/body/div[11]

Fax:052 317 04 42 :: /html/body/div[12]

But how to appyl in the Simple DomDocument - i want to use this here: http://simplehtmldom.sourceforge.net/

If we already have the Xpaths, we can use them – in PHP there is literally a thousand ways to skin a cat

(no cruelty intended – I love cats) If the data we return looks like this:

Altes Schulhaus Ossingen    :: /html/body/div[2]
Guntibachstrasse 10  :: /html/body/div[4]
8475  Ossingen  :: /html/body/div[6]
sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a
Tel:052 317 15 45 ::  /html/body/div[11]
Fax:052 317 04 42 ::  /html/body/div[12]

Solutions: We can clean it up a bit by using the trim() and preg_replace() function:

$data = " Altes Schulhaus Ossingen    :: /html/body/div[2]
Guntibachstrasse 10  :: /html/body/div[4]
8475  Ossingen  :: /html/body/div[6]
sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a
Tel:052 317 15 45 ::  /html/body/div[11]
Fax:052 317 04 42 ::  /html/body/div[12]";

$cleanthis = array(
                       ":: \/html\/body\/div\[[0-9]\]",
                       "Tel:",
                       "Fax:"
                       );
$cleandata = trim(preg_replace($cleanthis, "", $data));

This should give us the following

Altes Schulhaus Ossingen

Guntibachstrasse 10

8475 Ossingen

sekretariat.psossingen@bluewin.ch

052 317 15 45

052 317 04 42

Then we can explode if:

list($arr['name'], $arr['address1'], $arr['address2'], $arr['email'],
$arr['tel'], $arr['fax']) = explode("\r", $cleandata);
list($arr['postcode'], $arr['town']) = explode(" ", $arr['address2']);

This should give us the following array:

array(
       'name' => 'Altes Schulhaus Ossingen',
       'address1' => 'Guntibachstrasse 10',
       'address2' => '8475  Ossingen',
       'email' => 'sekretariat.psossingen@bluewin.ch',
       'tel' => '052 317 15 45',
       'fax' => '052 317 04 42',
       'postcode' => '8475',
       'town' => 'Ossingen',
       );

Now, we can wrap it in a nice function:

function parse_data($data) {
       $cleanthis = array(

                               ":: \/html\/body\/div\[[0-9]\]",
                               "Tel:",
                               "Fax:"
                               );
       $cleandata = trim(preg_replace($cleanthis, "", $data));
       $arr = NULL;
       list($arr['name'], $arr['address1'], $arr['address2'],
$arr['email'], $arr['tel'], $arr['fax']) = explode("\r", $cleandata);
       list($arr['postcode'], $arr['town']) = explode(" ",
$arr['address2']);
       return $arr;
}

// Now that we have the nice formatted results, it's time to save the data:

CREATE TABLE IF NOT EXISTS my_table (
`school_id` int(255) NOT NULL auto_increment,
`school _title` text default NULL,
`school _address1` text default NULL,
`school _postcode` varchar(29) default NULL,
`school _town` varchar(255) default NULL,
`school _email` varchar(255) default NULL,
`school _tel` varchar(15) default NULL,
`school _fax` varchar(15) default NULL,
PRIMARY KEY  (`data_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

INSERT INTO my_table(school_title, school_address1, school_town,
school_postcode, school_email, school_tel, school_fax)
VALUES(
'".mysql_escape_string($arr['school_title'])."',
'".mysql_escape_string($arr['school_address1'])."',
'".mysql_escape_string($arr['school_town'])."',
'".mysql_escape_string($arr['school_postcode'])."',
'".mysql_escape_string($arr['school_email'])."',
'".mysql_escape_string($arr['school_tel'])."',
'".mysql_escape_string($arr['school_fax'])."'
);

Here's the wrapper:

for($i=1;$i<=$match[1];$i++) {

$url = "http://www.example.com/page?page={$i}";
// perform our Curl and access the new sub-page, extract necessary data to

$data

$data = <--results variable from your dom-->

$arr = parse_data($data);

mysql_query("INSERT INTO my_table(
school_title, school_address1, school_town, school_postcode, school_email,
school_tel, school_fax
)
VALUES(
'".mysql_escape_string($arr['school_title'])."',
'".mysql_escape_string($arr['school_address1'])."',
'".mysql_escape_string($arr['school_town'])."',
'".mysql_escape_string($arr['school_postcode'])."',
'".mysql_escape_string($arr['school_email'])."',
'".mysql_escape_string($arr['school_tel'])."',
'".mysql_escape_string($arr['school_fax'])."'
)");

}

BTW; Curl is definitely the way to go and I presume that you are returning the output for Curl?

function get_page_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
if($output!=false && $_POST['dt']=='No')
   return $output;
curl_close($ch);
}

This will output:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS -
http://www.webweaver.de">
<title>educa.ch</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" href="101.htm">
<script src="102.htm">
</script>
<script language="JavaScript">
<!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// -->
</script>
</head>
<body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0"
marginheight="0" onload="check();">
<table cellspacing="0" cellpadding="0" border="0" width="100%">
<tr><td width="15" class="popuphead">
<img src="/0.gif" alt="" width="15" height="16">
</td><td width="99%" class="popuphead">
Adresse - Schulen in der Schweiz
</td><td width="20" class="popuphead" valign="middle">
<a href="#" title="Print" onclick="window.print(); return false;">
<img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13">
</a>
</td><td width="20" class="popuphead" valign="middle">
<a href="#" title="close" onclick="window.close(); return false;">
<img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13">
</a>
</td></tr>
<tr bgcolor="#B2B2B2"><td colspan="4">
<img src="/0.gif" alt="" width="1" height="1">
</td></tr>
</table>
<div class="leerzeile">&#160;</div>
<div class="leerzeile"><img src="/0.gif" alt="" width="15" height="8">Ecoles
primaire et enfantine de Bassecourt    </div>
<div class="leerzeile">&#160;</div>
<div><img src="/0.gif" alt="" width="15" height="8"></div>
<div><img src="/0.gif" alt="" width="15" height="8"></div>
<div><img src="/0.gif" alt="" width="15"
height="8">2854&#160;Bassecourt</div>
<div class="leerzeile">&#160;</div>
<div><img src="/0.gif" alt="" width="15" height="8"><a href=""
target="_blank"></a></div>
<div><img src="/0.gif" alt="" width="15" height="8"><a href="mailto:
ep.bassecourt@ju.educanet2.ch">ep.bassecourt@ju.educanet2.ch</a></div>
<div class="leerzeile">&#160;</div>
<div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif"
alt="" width="6" height="8">032 426 74 72</div>
<div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif"
alt="" width="4" height="8"></div>
<div>&#160;</div>
</body>
</html>

1st of all, we would want to remove any redundant data, for example, the header and footer
So: [i'm doing a quick cheat here]

$url = 'http://www.educa.ch/dyn/79376.asp?id=1568';

$data = get_page_data($url);

if($data) {
// This will clean all the unneeded top and bottom content and return only
the table and divs data

$cleaned = string_between('onload="check();">', '</body>', $data);

// From here it's easy, clean out any unneeded content such as images and
divs
// Setting the second parameter, allows us to specify which tags NOT to
remove, ie. tables, divs, paragraphs etc.
// If we don't want any html tags, simply leave it as
strip_tags($cleaned);
// This will remove ALL the html tags and return only the content between.
return  = stip_tags($cleaned, '<table><tr><td><div>');
}

And now you will only be left with:

<table cellspacing="0" cellpadding="0" border="0" width="100%">
<tr><td width="15" class="popuphead">
</td><td width="99%" class="popuphead">
Adresse - Schulen in der Schweiz
</td><td width="20" class="popuphead" valign="middle">
</td><td width="20" class="popuphead" valign="middle">
</td></tr>
<tr bgcolor="#B2B2B2"><td colspan="4">
</td></tr>
</table>
<div class="leerzeile">Ecoles primaire et enfantine de Bassecourt    </div>
<div>2854&#160;Bassecourt</div>
<div>ep.bassecourt@ju.educanet2.ch</div>
<div>Tel: 032 426 74 72</div>
<div>Fax: </div>

Let us quickly sum that up:

function string_between($start, $end, $string, $return=NULL){
$string = " ".$string;
$ini = strpos($string,$start);
if($ini==0)
   return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
if($return)
   return $start.substr($string,$ini,$len).$end;
else
   return substr($string,$ini,$len);
}

function get_page_data($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
if($output!=false && $_POST['dt']=='No')
   return $output;
curl_close($ch);
}

for($i=1;$i<=$match[1];$i++)
{
$url = "http://www.example.com/page?page={$i}";
$data = get_page_data($url);
if($data) {
   $cleaned = string_between('onload="check();">', '</body>', $data);
   return = stip_tags($cleaned, '<table><tr><td><div>');
}
}

Well i am a bit confuesd? Can anybody clear up a bit - and put together the snippets in the right manner?

love to hear from you

greeetings

dilbertone

Sign In

Parserscript with cURL & Xpath needs some final reviews - all ready have a look!

Recommended Posts

dilbertone

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information