Jump to content

webscraper


Pramit

Recommended Posts

function parseHTML($path)

{

/* The data will come as XML format. So check whether valid XML loaded or not. */

$file = file_get_contents($path);

if(strlen($file) > 0)

{

//echo strstr($file,"photoframe");

 

# for extracting property image

$pics = explode("photoframe",$file);

if(isset($pics[1]))

{

$img = explode("src=\"",$pics[1]);

$img = explode("\"",$img[1]);

$property_image = $img[0];

//echo $property_image;

}

 

# for extracting property bedrooms

$beds = explode("<p class=\"bedrooms\">",$file);

if(isset($beds[1]))

{ //echo $beds[1];

$beds = explode(" ",$beds[1]);

$property_beds = $beds[1];

}

?>

 

This is a sample code using simple PHP. Here $path can be few hundreds. ANd I have to navigate further into $path to get more similar paths.

Link to comment
https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423936
Share on other sites

i've done a few reverse engineering of xhtml -> database and the best method is this

1) File_get_contents raw data

2) Regex all tags down to just raw tags i.e <div class="fish"> becomes <div>

3) Strip_tags except divs or tds

4) explode at the divs/tds

5) Phrase based on this, dynamic portions are then treated with while loops until a strstr is not matched

 

works fairly well so long as the strucutre is constant

Link to comment
https://forums.phpfreaks.com/topic/83321-webscraper/#findComment-423969
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.