Jump to content

get specific portion of html source with cUrl- problems retrieving right content


Recommended Posts

i'm building my website with presentations off different products, and i face a few problems using curl basically what i need to do is to get some portions of html from different websites and display on my website ex: title, model, description, user reviews etc.... i managed to accomplish some of the code but when changing the source url stop working... even the source is the same my code:

 

$url = "http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=2819129&CatId=4938";

//$url = "http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1808177&csid=_61"; //this one is not working....

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);

$source = curl_exec ($ch);

$start_description1 = "</tr>
</tbody>
</table>




<p>";
$end_description1 = "</div>
</div>
<div id=\"Videos\" style=\"display:inline;\">";
$description1_start_pos = strpos($source, $start_description1) + strlen($start_description1);
$description1_end_pos = strpos($source, $end_description1) - $description1_start_pos;
$description1 = substr($source, $description1_start_pos, $description1_end_pos);
echo $description1;

 

it works perfect but if i change the url it won't work... the problem is the start_description html code... on other pages the html code differs...

 

instead of:

 

</tr>
</tbody>
</table>




<p>

 

new page have:

 

</tr>
</tbody>
</table>


<p>

 

or:

 

</tr>
</tbody>
</table>

<p>

 

how can i avoid this error? or what to do to avoid cUrl errors, and retrieve the content i want ?

 

thank you!

 

 

It looks to me like your problem doesn't involve curl at all.  It's instead in trying to parse out the portions of the data you want from the tigerdirect markup.  Trying to find variable data inside html markup using simple string matching or regular expressions is notoriously painful and error prone.  A much better solution is to take the page and use the DOM functions to find and extract the portions you need. 

 

 

It's not clear to me exactly what you are after.  Also tigerdirects pages are pretty messy.  In this example I just dump out the "ProductReview' portion of the DOM:

 

error_reporting(E_ERROR);
$urls[] = "http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=2819129&CatId=4938";
$urls[] = "http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1808177&csid=_61";

function curlload($url) {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
        $source = curl_exec($ch);
        return $source;
}

foreach ($urls as $url) {
        $source = curlLoad($url);
        $dom = DOMDocument::loadHTML($source);
        $prodReviewElement = $dom->getElementById('ProductReview');
        $prodReview = $dom->saveXML($prodReviewElement);
        echo "***********************************************\n\n";
        echo "$url\n";
        echo "***********************************************\n\n";
        echo $prodReview;
}

 

Have a look at the DOM manual, domdocument etc.  The only tricky thing I saw was that they often don't use id's so if you plan to try and extract individual elements, it looks like many of them would be by class, where you'd have to do an XPath search, which is a bit more complicated, but still the best approach.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.