dilbertone Posted November 23, 2010 Share Posted November 23, 2010 Hi dear Freaks i am very new to Programming - and i want to code for a little project. So - i have some things to learn in PHP. I currently play around with http://simplehtmldom.sourceforge.net/ - and struggle a bit with my project! Well - i want to have you to have a closer look a tthe Parserscript with cURL & Xpath. I have all the parts. But i guess that i have messed up a bit: I need some final reviews - have a look - and give me some hints for the final arrangement of the code! Thx in advance! What is aimed: i want to create a parser. And here there are the parts: a. the fetching part and the b. parser-part (see below) c. storing part (into a Mysql-DB) The fetching-part: i have choosen to do it with Curl. I thought of running CurL since this is pretty powerful. I have some lines together now. Eugene, i iove to hear your review...Since i am new to programming i love to get some hints from experienced devs. Here some details: well since we have several hundred of resultpages derived from this one: http://www.educa.ch/dyn/79362.asp?action=search Note: i want to itterate over the resultpages - with a loop. http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 i take this loop: PHP Code: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // access new sub-page, extract necessary data } as the example we can set in here this domain: http://www.educa.ch/dyn/79362.asp?action=search Note - you see that we have lots of targets....: http://www.educa.ch/dyn/79376.asp?id=1568 http://www.educa.ch/dyn/79376.asp?id=2149 and lots of others more: what do you think? What about the Loop over the target-Urls? BTW: you see - there will be some pages empty. Note - the empty pages should be thrown away. I do not want to store "empty" stuff. well this is what i want to. And now i need to have a good parser-script. Note: this is a tree-part-job: 1. fetching the sub-pages 2. parsing them and if all goes well .... then we would have a third part: 3. storing the data in a mysql-db b. the Paser-Part: Well - the problem - some of the above mentioned pages are empty. so i need to find a solution to leave them aside - unless i do not want to populate my mysql-db with too much infos.. Btw: parsing should be a part that can be done with DomDocument - What do you think? I need to combine the first part with tthe second - can you give me some starting points and hints to get this. The fetching-job should be done with CuRL - and to process the data into a DomDocument-Parser-Job. No Problem here: But how to do the DOM-Document-Job ... i have installed FireBug into the FireFox... now i have the Xpaths for the sites: http://www.educa.ch/dyn/79376.asp?id=1187 http://www.educa.ch/dyn/79376.asp?id=2939 see the details: Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12] But how to appyl in the Simple DomDocument - i want to use this here: http://simplehtmldom.sourceforge.net/ If we already have the Xpaths, we can use them – in PHP there is literally a thousand ways to skin a cat (no cruelty intended – I love cats) If the data we return looks like this: Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12] Solutions: We can clean it up a bit by using the trim() and preg_replace() function: $data = " Altes Schulhaus Ossingen :: /html/body/div[2] Guntibachstrasse 10 :: /html/body/div[4] 8475 Ossingen :: /html/body/div[6] sekretariat.psossingen@bluewin.ch :: /html/body/div[9]/a Tel:052 317 15 45 :: /html/body/div[11] Fax:052 317 04 42 :: /html/body/div[12]"; $cleanthis = array( ":: \/html\/body\/div\[[0-9]\]", "Tel:", "Fax:" ); $cleandata = trim(preg_replace($cleanthis, "", $data)); This should give us the following Altes Schulhaus Ossingen Guntibachstrasse 10 8475 Ossingen sekretariat.psossingen@bluewin.ch 052 317 15 45 052 317 04 42 Then we can explode if: list($arr['name'], $arr['address1'], $arr['address2'], $arr['email'], $arr['tel'], $arr['fax']) = explode("\r", $cleandata); list($arr['postcode'], $arr['town']) = explode(" ", $arr['address2']); This should give us the following array: array( 'name' => 'Altes Schulhaus Ossingen', 'address1' => 'Guntibachstrasse 10', 'address2' => '8475 Ossingen', 'email' => 'sekretariat.psossingen@bluewin.ch', 'tel' => '052 317 15 45', 'fax' => '052 317 04 42', 'postcode' => '8475', 'town' => 'Ossingen', ); Now, we can wrap it in a nice function: function parse_data($data) { $cleanthis = array( ":: \/html\/body\/div\[[0-9]\]", "Tel:", "Fax:" ); $cleandata = trim(preg_replace($cleanthis, "", $data)); $arr = NULL; list($arr['name'], $arr['address1'], $arr['address2'], $arr['email'], $arr['tel'], $arr['fax']) = explode("\r", $cleandata); list($arr['postcode'], $arr['town']) = explode(" ", $arr['address2']); return $arr; } // Now that we have the nice formatted results, it's time to save the data: CREATE TABLE IF NOT EXISTS my_table ( `school_id` int(255) NOT NULL auto_increment, `school _title` text default NULL, `school _address1` text default NULL, `school _postcode` varchar(29) default NULL, `school _town` varchar(255) default NULL, `school _email` varchar(255) default NULL, `school _tel` varchar(15) default NULL, `school _fax` varchar(15) default NULL, PRIMARY KEY (`data_id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; INSERT INTO my_table(school_title, school_address1, school_town, school_postcode, school_email, school_tel, school_fax) VALUES( '".mysql_escape_string($arr['school_title'])."', '".mysql_escape_string($arr['school_address1'])."', '".mysql_escape_string($arr['school_town'])."', '".mysql_escape_string($arr['school_postcode'])."', '".mysql_escape_string($arr['school_email'])."', '".mysql_escape_string($arr['school_tel'])."', '".mysql_escape_string($arr['school_fax'])."' ); Here's the wrapper: for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; // perform our Curl and access the new sub-page, extract necessary data to $data $data = <--results variable from your dom--> $arr = parse_data($data); mysql_query("INSERT INTO my_table( school_title, school_address1, school_town, school_postcode, school_email, school_tel, school_fax ) VALUES( '".mysql_escape_string($arr['school_title'])."', '".mysql_escape_string($arr['school_address1'])."', '".mysql_escape_string($arr['school_town'])."', '".mysql_escape_string($arr['school_postcode'])."', '".mysql_escape_string($arr['school_email'])."', '".mysql_escape_string($arr['school_tel'])."', '".mysql_escape_string($arr['school_fax'])."' )"); } BTW; Curl is definitely the way to go and I presume that you are returning the output for Curl? function get_page_data($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $output = curl_exec($ch); if($output!=false && $_POST['dt']=='No') return $output; curl_close($ch); } This will output: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"> <title>educa.ch</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <link rel="stylesheet" href="101.htm"> <script src="102.htm"> </script> <script language="JavaScript"> <!-- var did='d79376'; var root=new Array('d200','d205','d73137','d1566','d79376','d'); var usefocus = 1; function check() { if ((self.focus) && (usefocus)) { self.focus(); } } // --> </script> </head> <body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="check();"> <table cellspacing="0" cellpadding="0" border="0" width="100%"> <tr><td width="15" class="popuphead"> <img src="/0.gif" alt="" width="15" height="16"> </td><td width="99%" class="popuphead"> Adresse - Schulen in der Schweiz </td><td width="20" class="popuphead" valign="middle"> <a href="#" title="Print" onclick="window.print(); return false;"> <img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13"> </a> </td><td width="20" class="popuphead" valign="middle"> <a href="#" title="close" onclick="window.close(); return false;"> <img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13"> </a> </td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"> <img src="/0.gif" alt="" width="1" height="1"> </td></tr> </table> <div class="leerzeile"> </div> <div class="leerzeile"><img src="/0.gif" alt="" width="15" height="8">Ecoles primaire et enfantine de Bassecourt </div> <div class="leerzeile"> </div> <div><img src="/0.gif" alt="" width="15" height="8"></div> <div><img src="/0.gif" alt="" width="15" height="8"></div> <div><img src="/0.gif" alt="" width="15" height="8">2854 Bassecourt</div> <div class="leerzeile"> </div> <div><img src="/0.gif" alt="" width="15" height="8"><a href="" target="_blank"></a></div> <div><img src="/0.gif" alt="" width="15" height="8"><a href="mailto: ep.bassecourt@ju.educanet2.ch">ep.bassecourt@ju.educanet2.ch</a></div> <div class="leerzeile"> </div> <div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">032 426 74 72</div> <div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif" alt="" width="4" height="8"></div> <div> </div> </body> </html> 1st of all, we would want to remove any redundant data, for example, the header and footer So: [i'm doing a quick cheat here] $url = 'http://www.educa.ch/dyn/79376.asp?id=1568'; $data = get_page_data($url); if($data) { // This will clean all the unneeded top and bottom content and return only the table and divs data $cleaned = string_between('onload="check();">', '</body>', $data); // From here it's easy, clean out any unneeded content such as images and divs // Setting the second parameter, allows us to specify which tags NOT to remove, ie. tables, divs, paragraphs etc. // If we don't want any html tags, simply leave it as strip_tags($cleaned); // This will remove ALL the html tags and return only the content between. return = stip_tags($cleaned, '<table><tr><td><div>'); } And now you will only be left with: <table cellspacing="0" cellpadding="0" border="0" width="100%"> <tr><td width="15" class="popuphead"> </td><td width="99%" class="popuphead"> Adresse - Schulen in der Schweiz </td><td width="20" class="popuphead" valign="middle"> </td><td width="20" class="popuphead" valign="middle"> </td></tr> <tr bgcolor="#B2B2B2"><td colspan="4"> </td></tr> </table> <div class="leerzeile">Ecoles primaire et enfantine de Bassecourt </div> <div>2854 Bassecourt</div> <div>ep.bassecourt@ju.educanet2.ch</div> <div>Tel: 032 426 74 72</div> <div>Fax: </div> Let us quickly sum that up: function string_between($start, $end, $string, $return=NULL){ $string = " ".$string; $ini = strpos($string,$start); if($ini==0) return ""; $ini += strlen($start); $len = strpos($string,$end,$ini) - $ini; if($return) return $start.substr($string,$ini,$len).$end; else return substr($string,$ini,$len); } function get_page_data($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $output = curl_exec($ch); if($output!=false && $_POST['dt']=='No') return $output; curl_close($ch); } for($i=1;$i<=$match[1];$i++) { $url = "http://www.example.com/page?page={$i}"; $data = get_page_data($url); if($data) { $cleaned = string_between('onload="check();">', '</body>', $data); return = stip_tags($cleaned, '<table><tr><td><div>'); } } Well i am a bit confuesd? Can anybody clear up a bit - and put together the snippets in the right manner? love to hear from you greeetings dilbertone Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.