Jump to content

Screen Scraping


ctcmedia

Recommended Posts

I dont know if this is in the right section but it has regex in it?!?!

 

What I am trying to do is scrape http://www.channel4.com/programmes/4od/all

 

to get the titles and links so I can recurse through the website picking up information

 

I know how to get the title names and hrefs through regex but MAJOR but There are a lot of spans I am trying to get with the same name so my question is How can I loop through the results to get all title names and links etc...

 

 

I have some code that I am just trying with which has the problems

 

 

<?php
require('function.php'); // Has some custom functions in this file

$url = "http://www.channel4.com/programmes/4od/all"; // Url we are searching

$getUrl = curlEngine($url);


$content = $getUrl['content'];

preg_match('#<span\b[^>]*class=([\'"])?programme-info(?(1)\1)[^>]*>.*?</span>#s', $content, $match);

foreach($match as $match){

print $match;

}
?>

 

Any help would be appreciated

 

Thanks

 

Paul

Link to comment
https://forums.phpfreaks.com/topic/179960-screen-scraping/
Share on other sites

Fair enough. In that case the pattern you have will match 692 show names from that page assuming you switch to preg_match_all.

 

preg_match_all('#<span\b[^>]*class=([\'"])?programme-info(?(1)\1)[^>]*>.*?</span>#s', $content, $match);

echo '<pre>';
print_r($match[0]);
echo '</pre>';

Link to comment
https://forums.phpfreaks.com/topic/179960-screen-scraping/#findComment-949370
Share on other sites

cool forgot the all :)

 

I got it parsing all I wanted from that page was the links so I can spider out and gather information like title description etc all is good apart from trying to extract the flash vars :(

 

this is an example of the code I am trying to get

 

<div id="flashContainer" class="cb flashEnabled hidden">
<object id="catchUpPlayer" width="625" height="1" type="application/x-shockwave-flash" data="/static/programmes/asset/flash/swf/4odplayer-4.30.2.swf">
<param name="align" value="top"/>
<param name="scale" value="noscale"/>
<param name="salign" value="lt"/>
<param name="allowFullScreen" value="true"/>
<param name="bgcolor" value="#000000"/>
<param name="allowScriptAccess" value="always"/>
<param name="wmode" value="opaque"/>
<param name="flashvars" value="brandTitle=Peep%20Show&wsBrandTitle=peep-show&primaryColor=0xCC0000&secondaryColor=0xCC0000&invertSkin=false&preSelectAsset=3005546&preSelectAssetGuidance=Strong%20language%20and%20adult%20humour&preSelectAssetImageURL=/assets/programmes/images/peep-show/series-6/episode-6/3b010dd1-69d3-490c-9985-19ad79f20557_625x352.jpg&pinRequestCallback=C4.PinController.doPinChecks"/>
</object>

 

 

and I am trying to get the

"preSelectAssetImageURL=/assets/programmes/images/peep-show/series-6/episode-6/3b010dd1-69d3-490c-9985-19ad79f20557_625x352.jpg"

 

out of it ... I am trying preg match again but nothing seems to be working

 

<?php
require('function.php');


$url = "http://www.channel4.com/programmes/peep-show/4od";

$page = curlEngine($url);
    
    // Find Image flashContainer
    
preg_match('#<object\b[^>]*id=([\'"])?catchUpPlayer(?(1)\1)[^>]*>.*?</object>#s', $page['content'], $bigImage);

?>

 

any hints :)

 

thanks

 

Paul

Link to comment
https://forums.phpfreaks.com/topic/179960-screen-scraping/#findComment-949445
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.