Jump to content

Screen Scraping


ctcmedia

Recommended Posts

I dont know if this is in the right section but it has regex in it?!?!

 

What I am trying to do is scrape http://www.channel4.com/programmes/4od/all

 

to get the titles and links so I can recurse through the website picking up information

 

I know how to get the title names and hrefs through regex but MAJOR but There are a lot of spans I am trying to get with the same name so my question is How can I loop through the results to get all title names and links etc...

 

 

I have some code that I am just trying with which has the problems

 

 

<?php
require('function.php'); // Has some custom functions in this file

$url = "http://www.channel4.com/programmes/4od/all"; // Url we are searching

$getUrl = curlEngine($url);


$content = $getUrl['content'];

preg_match('#<span\b[^>]*class=([\'"])?programme-info(?(1)\1)[^>]*>.*?</span>#s', $content, $match);

foreach($match as $match){

print $match;

}
?>

 

Any help would be appreciated

 

Thanks

 

Paul

Link to comment
Share on other sites

Fair enough. In that case the pattern you have will match 692 show names from that page assuming you switch to preg_match_all.

 

preg_match_all('#<span\b[^>]*class=([\'"])?programme-info(?(1)\1)[^>]*>.*?</span>#s', $content, $match);

echo '<pre>';
print_r($match[0]);
echo '</pre>';

Link to comment
Share on other sites

cool forgot the all :)

 

I got it parsing all I wanted from that page was the links so I can spider out and gather information like title description etc all is good apart from trying to extract the flash vars :(

 

this is an example of the code I am trying to get

 

<div id="flashContainer" class="cb flashEnabled hidden">
<object id="catchUpPlayer" width="625" height="1" type="application/x-shockwave-flash" data="/static/programmes/asset/flash/swf/4odplayer-4.30.2.swf">
<param name="align" value="top"/>
<param name="scale" value="noscale"/>
<param name="salign" value="lt"/>
<param name="allowFullScreen" value="true"/>
<param name="bgcolor" value="#000000"/>
<param name="allowScriptAccess" value="always"/>
<param name="wmode" value="opaque"/>
<param name="flashvars" value="brandTitle=Peep%20Show&wsBrandTitle=peep-show&primaryColor=0xCC0000&secondaryColor=0xCC0000&invertSkin=false&preSelectAsset=3005546&preSelectAssetGuidance=Strong%20language%20and%20adult%20humour&preSelectAssetImageURL=/assets/programmes/images/peep-show/series-6/episode-6/3b010dd1-69d3-490c-9985-19ad79f20557_625x352.jpg&pinRequestCallback=C4.PinController.doPinChecks"/>
</object>

 

 

and I am trying to get the

"preSelectAssetImageURL=/assets/programmes/images/peep-show/series-6/episode-6/3b010dd1-69d3-490c-9985-19ad79f20557_625x352.jpg"

 

out of it ... I am trying preg match again but nothing seems to be working

 

<?php
require('function.php');


$url = "http://www.channel4.com/programmes/peep-show/4od";

$page = curlEngine($url);
    
    // Find Image flashContainer
    
preg_match('#<object\b[^>]*id=([\'"])?catchUpPlayer(?(1)\1)[^>]*>.*?</object>#s', $page['content'], $bigImage);

?>

 

any hints :)

 

thanks

 

Paul

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.