Pramit Posted December 27, 2007 Share Posted December 27, 2007 I developed a code to navigate each link on the sitemap, open those pages, read the page structure parse the page and then insert the required data to my database. But this is a crap since it takes a hell lot of time. Any suggestion on easier an faster method is most welcome. ??? Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/ Share on other sites More sharing options...
php? Posted December 27, 2007 Share Posted December 27, 2007 srry accidental post Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423893 Share on other sites More sharing options...
tinker Posted December 27, 2007 Share Posted December 27, 2007 it'd b gud 2 show sum cude! or at least explain what method your using? Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423917 Share on other sites More sharing options...
Pramit Posted December 27, 2007 Author Share Posted December 27, 2007 I have tried using DOM, simple PHP file_get_contents(), cURL. buta ll in vain. Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423927 Share on other sites More sharing options...
tinker Posted December 27, 2007 Share Posted December 27, 2007 how are you parsing the page, what info extract, pls show eg... how many pages, how store and correlate links, any timings??? p.s. i dont use curl Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423930 Share on other sites More sharing options...
Pramit Posted December 27, 2007 Author Share Posted December 27, 2007 function parseHTML($path) { /* The data will come as XML format. So check whether valid XML loaded or not. */ $file = file_get_contents($path); if(strlen($file) > 0) { //echo strstr($file,"photoframe"); # for extracting property image $pics = explode("photoframe",$file); if(isset($pics[1])) { $img = explode("src=\"",$pics[1]); $img = explode("\"",$img[1]); $property_image = $img[0]; //echo $property_image; } # for extracting property bedrooms $beds = explode("<p class=\"bedrooms\">",$file); if(isset($beds[1])) { //echo $beds[1]; $beds = explode(" ",$beds[1]); $property_beds = $beds[1]; } ?> This is a sample code using simple PHP. Here $path can be few hundreds. ANd I have to navigate further into $path to get more similar paths. Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423936 Share on other sites More sharing options...
cooldude832 Posted December 27, 2007 Share Posted December 27, 2007 i've done a few reverse engineering of xhtml -> database and the best method is this 1) File_get_contents raw data 2) Regex all tags down to just raw tags i.e <div class="fish"> becomes <div> 3) Strip_tags except divs or tds 4) explode at the divs/tds 5) Phrase based on this, dynamic portions are then treated with while loops until a strstr is not matched works fairly well so long as the strucutre is constant Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423969 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.