Pramit Posted December 27, 2007 Share Posted December 27, 2007 I developed a code to navigate each link on the sitemap, open those pages, read the page structure parse the page and then insert the required data to my database. But this is a crap since it takes a hell lot of time. Any suggestion on easier an faster method is most welcome. ??? Quote Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/ Share on other sites More sharing options...
php? Posted December 27, 2007 Share Posted December 27, 2007 srry accidental post Quote Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423893 Share on other sites More sharing options...
tinker Posted December 27, 2007 Share Posted December 27, 2007 it'd b gud 2 show sum cude! or at least explain what method your using? Quote Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423917 Share on other sites More sharing options...
Pramit Posted December 27, 2007 Author Share Posted December 27, 2007 I have tried using DOM, simple PHP file_get_contents(), cURL. buta ll in vain. Quote Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423927 Share on other sites More sharing options...
tinker Posted December 27, 2007 Share Posted December 27, 2007 how are you parsing the page, what info extract, pls show eg... how many pages, how store and correlate links, any timings??? p.s. i dont use curl Quote Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423930 Share on other sites More sharing options...
Pramit Posted December 27, 2007 Author Share Posted December 27, 2007 function parseHTML($path) { /* The data will come as XML format. So check whether valid XML loaded or not. */ $file = file_get_contents($path); if(strlen($file) > 0) { //echo strstr($file,"photoframe"); # for extracting property image $pics = explode("photoframe",$file); if(isset($pics[1])) { $img = explode("src=\"",$pics[1]); $img = explode("\"",$img[1]); $property_image = $img[0]; //echo $property_image; } # for extracting property bedrooms $beds = explode("<p class=\"bedrooms\">",$file); if(isset($beds[1])) { //echo $beds[1]; $beds = explode(" ",$beds[1]); $property_beds = $beds[1]; } ?> This is a sample code using simple PHP. Here $path can be few hundreds. ANd I have to navigate further into $path to get more similar paths. Quote Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423936 Share on other sites More sharing options...
cooldude832 Posted December 27, 2007 Share Posted December 27, 2007 i've done a few reverse engineering of xhtml -> database and the best method is this 1) File_get_contents raw data 2) Regex all tags down to just raw tags i.e <div class="fish"> becomes <div> 3) Strip_tags except divs or tds 4) explode at the divs/tds 5) Phrase based on this, dynamic portions are then treated with while loops until a strstr is not matched works fairly well so long as the strucutre is constant Quote Link to comment https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423969 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.