get specific portion of html source with cUrl- problems retrieving right content

abrilfluke · August 4, 2012

i'm building my website with presentations off different products, and i face a few problems using curl basically what i need to do is to get some portions of html from different websites and display on my website ex: title, model, description, user reviews etc.... i managed to accomplish some of the code but when changing the source url stop working... even the source is the same my code:

$url = "http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=2819129&CatId=4938";

//$url = "http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1808177&csid=_61"; //this one is not working....

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);

$source = curl_exec ($ch);

$start_description1 = "</tr>
</tbody>
</table>




<p>";
$end_description1 = "</div>
</div>
<div id=\"Videos\" style=\"display:inline;\">";
$description1_start_pos = strpos($source, $start_description1) + strlen($start_description1);
$description1_end_pos = strpos($source, $end_description1) - $description1_start_pos;
$description1 = substr($source, $description1_start_pos, $description1_end_pos);
echo $description1;

it works perfect but if i change the url it won't work... the problem is the start_description html code... on other pages the html code differs...

instead of:

</tr>
</tbody>
</table>




<p>

new page have:

</tr>
</tbody>
</table>


<p>

or:

</tr>
</tbody>
</table>

<p>

how can i avoid this error? or what to do to avoid cUrl errors, and retrieve the content i want ?

thank you!

gizmola · August 4, 2012

It looks to me like your problem doesn't involve curl at all. It's instead in trying to parse out the portions of the data you want from the tigerdirect markup. Trying to find variable data inside html markup using simple string matching or regular expressions is notoriously painful and error prone. A much better solution is to take the page and use the DOM functions to find and extract the portions you need.

abrilfluke · August 4, 2012

please be kind enough and paste an example of DOM

thank you!

gizmola · August 4, 2012

It's not clear to me exactly what you are after. Also tigerdirects pages are pretty messy. In this example I just dump out the "ProductReview' portion of the DOM:

error_reporting(E_ERROR);
$urls[] = "http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=2819129&CatId=4938";
$urls[] = "http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1808177&csid=_61";

function curlload($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
        $source = curl_exec($ch);
        return $source;
}

foreach ($urls as $url) {
        $source = curlLoad($url);
        $dom = DOMDocument::loadHTML($source);
        $prodReviewElement = $dom->getElementById('ProductReview');
        $prodReview = $dom->saveXML($prodReviewElement);
        echo "***********************************************\n\n";
        echo "$url\n";
        echo "***********************************************\n\n";
        echo $prodReview;
}

Have a look at the DOM manual, domdocument etc. The only tricky thing I saw was that they often don't use id's so if you plan to try and extract individual elements, it looks like many of them would be by class, where you'd have to do an XPath search, which is a bit more complicated, but still the best approach.

Sign In

get specific portion of html source with cUrl- problems retrieving right content

Recommended Posts

abrilfluke

Link to comment

Share on other sites

gizmola

Link to comment

Share on other sites

abrilfluke

Link to comment

Share on other sites

gizmola

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information