webscraper

Pramit · December 27, 2007

I developed a code to navigate each link on the sitemap, open those pages, read the page structure parse the page and then insert the required data to my database. But this is a crap since it takes a hell lot of time. Any suggestion on easier an faster method is most welcome. ???

php? · December 27, 2007

srry accidental post

tinker · December 27, 2007

it'd b gud 2 show sum cude! or at least explain what method your using?

Pramit · December 27, 2007

I have tried using DOM, simple PHP file_get_contents(), cURL. buta ll in vain.

tinker · December 27, 2007

how are you parsing the page, what info extract, pls show eg...

how many pages, how store and correlate links, any timings???

p.s. i dont use curl

Pramit · December 27, 2007

function parseHTML($path)

{

/* The data will come as XML format. So check whether valid XML loaded or not. */

$file = file_get_contents($path);

if(strlen($file) > 0)

{

//echo strstr($file,"photoframe");

# for extracting property image

$pics = explode("photoframe",$file);

if(isset($pics[1]))

{

$img = explode("src=\"",$pics[1]);

$img = explode("\"",$img[1]);

$property_image = $img[0];

//echo $property_image;

}

# for extracting property bedrooms

$beds = explode("<p class=\"bedrooms\">",$file);

if(isset($beds[1]))

{ //echo $beds[1];

$beds = explode(" ",$beds[1]);

$property_beds = $beds[1];

}

?>

This is a sample code using simple PHP. Here $path can be few hundreds. ANd I have to navigate further into $path to get more similar paths.

cooldude832 · December 27, 2007

i've done a few reverse engineering of xhtml -> database and the best method is this

1) File_get_contents raw data

2) Regex all tags down to just raw tags i.e <div class="fish"> becomes <div>

3) Strip_tags except divs or tds

4) explode at the divs/tds

5) Phrase based on this, dynamic portions are then treated with while loops until a strstr is not matched

works fairly well so long as the strucutre is constant

Sign In

webscraper

Recommended Posts

Pramit

Link to comment

Share on other sites

php?

Link to comment

Share on other sites

tinker

Link to comment

Share on other sites

Pramit

Link to comment

Share on other sites

tinker

Link to comment

Share on other sites

Pramit

Link to comment

Share on other sites

cooldude832

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information