dilbertone Posted November 23, 2010 Share Posted November 23, 2010 Hi dear Freaks i am very new to Programming - and i want to code for a little project. So - i have some things to learn in PHP. I currently play around with http://simplehtmldom.sourceforge.net/ - and struggle a bit with my project! Well - i want to have you to have a closer look a tthe Parserscript with cURL & Xpath. I have all the parts. But i guess that i have messed up a bit: I need some final reviews - have a look - and give me some hints for the final arrangement of the code! Thx in advance! What is aimed: i want to create a parser. And here there are the parts: a. the fetching part and the b. parser-part (see below) c. storing part (into a Mysql-DB) The fetching-part: i have choosen to do it with Curl. I thought of running CurL since this is pretty powerful. I have some lines together now. Eugene, i iove to hear your review...Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search Note: i want to itterate over the resultpages - with a loop. http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 i take this loop: PHP Code: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // access new sub-page, extract necessary data } as the example we can set in here this domain: http://www.educa.ch/dyn/79362.asp?action=search Note - you see that we have lots of targets....: http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 and lots of others more: what do you think? What about the Loop over the target-Urls? BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff. well this is what i want to. And now i need to have a good parser-script. Note: this is a tree-part-job: 1. fetching the sub-pages 2. parsing them and if all goes well .... then we would have a third part: 3. storing the data in a mysql-db b. the Paser-Part: Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos.. Btw: parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this. The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. No Problem here: But how to do the DOM-Document-Job ... i have installed FireBug into the FireFox... now i have the Xpaths for the sites: http://www.educa.ch/dyn/79376.asp?id=1187 http://www.educa.ch/dyn/79376.asp?id=2939 see the details: Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] [email protected] :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12] But how to appyl in the Simple DomDocument - i want to use this here: http://simplehtmldom.sourceforge.net/ If we already have the Xpaths, we can use them – in PHP there is literally a thousand ways to skin a cat (no cruelty intended – I love cats) If the data we return looks like this: Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] [email protected] :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12] Solutions: We can clean it up a bit by using the trim() and preg_replace() function: $data = " Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] [email protected] :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12]"; $cleanthis = array( ":: \/html\/body\/div\[[0-9]\]", "Tel:", "Fax:" ); $cleandata = trim(preg_replace($cleanthis, "", $data)); This should give us the following Altes Schulhaus Ossingen Guntibachstrasse 10 8475 Ossingen [email protected] 052 317 15 45 052 317 04 42 Then we can explode if: list($arr['name'], $arr['address1'], $arr['address2'], $arr['email'], $arr['tel'], $arr['fax']) = explode("\r", $cleandata); list($arr['postcode'], $arr['town']) = explode(" ", $arr['address2']); This should give us the following array: array( 'name' => 'Altes Schulhaus Ossingen', 'address1' => 'Guntibachstrasse 10', 'address2' => '8475 Ossingen', 'email' => '[email protected]', 'tel' => '052 317 15 45', 'fax' => '052 317 04 42', 'postcode' => '8475', 'town' => 'Ossingen', ); Now, we can wrap it in a nice function: function parse_data($data) { $cleanthis = array( ":: \/html\/body\/div\[[0-9]\]", "Tel:", "Fax:" ); $cleandata = trim(preg_replace($cleanthis, "", $data)); $arr = NULL; list($arr['name'], $arr['address1'], $arr['address2'], $arr['email'], $arr['tel'], $arr['fax']) = explode("\r", $cleandata); list($arr['postcode'], $arr['town']) = explode(" ", $arr['address2']); return $arr; } // Now that we have the nice formatted results, it's time to save the data: CREATE TABLE IF NOT EXISTS my_table ( `school_id` int(255) NOT NULL auto_increment, `school _title` text default NULL, `school _address1` text default NULL, `school _postcode` varchar(29) default NULL, `school _town` varchar(255) default NULL, `school _email` varchar(255) default NULL, `school _tel` varchar(15) default NULL, `school _fax` varchar(15) default NULL, PRIMARY KEY (`data_id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; INSERT INTO my_table(school_title, school_address1, school_town, school_postcode, school_email, school_tel, school_fax) VALUES( '".mysql_escape_string($arr['school_title'])."', '".mysql_escape_string($arr['school_address1'])."', '".mysql_escape_string($arr['school_town'])."', '".mysql_escape_string($arr['school_postcode'])."', '".mysql_escape_string($arr['school_email'])."', '".mysql_escape_string($arr['school_tel'])."', '".mysql_escape_string($arr['school_fax'])."' ); Here's the wrapper: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // perform our Curl and access the new sub-page, extract necessary data to $data $data = <--results variable from your dom--> $arr = parse_data($data); mysql_query("INSERT INTO my_table( school_title, school_address1, school_town, school_postcode, school_email, school_tel, school_fax ) VALUES( '".mysql_escape_string($arr['school_title'])."', '".mysql_escape_string($arr['school_address1'])."', '".mysql_escape_string($arr['school_town'])."', '".mysql_escape_string($arr['school_postcode'])."', '".mysql_escape_string($arr['school_email'])."', '".mysql_escape_string($arr['school_tel'])."', '".mysql_escape_string($arr['school_fax'])."' )"); } BTW; Curl is definitely the way to go and I presume that you are returning the output for Curl? function get_page_data($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $output = curl_exec($ch); if($output!=false && $_POST['dt']=='No') return $output; curl_close($ch); } This will output: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> <title>educa.ch</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <link rel="stylesheet" href="101.htm"> <script src="102.htm"> </script> <script language="JavaScript"> <!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --> </script> </head> <body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="check();"> <table cellspacing="0" cellpadding="0" border="0" width="100%"> <tr><td width="15" class="popuphead"> <img src="/0.gif" alt="" width="15" height="16"> </td><td width="99%" class="popuphead"> Adresse - Schulen in der Schweiz </td><td width="20" class="popuphead" valign="middle"> <a href="#" title="Print" onclick="window.print(); return false;"> <img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13"> </a> </td><td width="20" class="popuphead" valign="middle"> <a href="#" title="close" onclick="window.close(); return false;"> <img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13"> </a> </td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"> <img src="/0.gif" alt="" width="1" height="1"> </td></tr> </table> <div class="leerzeile"> </div> <div class="leerzeile"><img src="/0.gif" alt="" width="15" height="8">Ecoles primaire et enfantine de Bassecourt </div> <div class="leerzeile"> </div> <div><img src="/0.gif" alt="" width="15" height="8"></div> <div><img src="/0.gif" alt="" width="15" height="8"></div> <div><img src="/0.gif" alt="" width="15" height="8">2854 Bassecourt</div> <div class="leerzeile"> </div> <div><img src="/0.gif" alt="" width="15" height="8"><a href="" target="_blank"></a></div> <div><img src="/0.gif" alt="" width="15" height="8"><a href="mailto: [email protected]">[email protected]</a></div> <div class="leerzeile"> </div> <div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">032 426 74 72</div> <div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif" alt="" width="4" height="8"></div> <div> </div> </body> </html> 1st of all, we would want to remove any redundant data, for example, the header and footer So: [i'm doing a quick cheat here] $url = 'http://www.educa.ch/dyn/79376.asp?id=1568'; $data = get_page_data($url); if($data) { // This will clean all the unneeded top and bottom content and return only the table and divs data $cleaned = string_between('onload="check();">', '</body>', $data); // From here it's easy, clean out any unneeded content such as images and divs // Setting the second parameter, allows us to specify which tags NOT to remove, ie. tables, divs, paragraphs etc. // If we don't want any html tags, simply leave it as strip_tags($cleaned); // This will remove ALL the html tags and return only the content between. return = stip_tags($cleaned, '<table><tr><td><div>'); } And now you will only be left with: <table cellspacing="0" cellpadding="0" border="0" width="100%"> <tr><td width="15" class="popuphead"> </td><td width="99%" class="popuphead"> Adresse - Schulen in der Schweiz </td><td width="20" class="popuphead" valign="middle"> </td><td width="20" class="popuphead" valign="middle"> </td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"> </td></tr> </table> <div class="leerzeile">Ecoles primaire et enfantine de Bassecourt </div> <div>2854 Bassecourt</div> <div>[email protected]</div> <div>Tel: 032 426 74 72</div> <div>Fax: </div> Let us quickly sum that up: function string_between($start, $end, $string, $return=NULL){ $string = " ".$string; $ini = strpos($string,$start); if($ini==0) return ""; $ini += strlen($start); $len = strpos($string,$end,$ini) - $ini; if($return) return $start.substr($string,$ini,$len).$end; else return substr($string,$ini,$len); } function get_page_data($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $output = curl_exec($ch); if($output!=false && $_POST['dt']=='No') return $output; curl_close($ch); } for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; $data = get_page_data($url); if($data) { $cleaned = string_between('onload="check();">', '</body>', $data); return = stip_tags($cleaned, '<table><tr><td><div>'); } } Well i am a bit confuesd? Can anybody clear up a bit - and put together the snippets in the right manner? love to hear from you greeetings dilbertone Link to comment https://forums.phpfreaks.com/topic/219593-parserscript-with-curl-xpath-needs-some-final-reviews-all-ready-have-a-look/ Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.