ctcmedia Posted November 2, 2009 Share Posted November 2, 2009 I dont know if this is in the right section but it has regex in it?!?! What I am trying to do is scrape http://www.channel4.com/programmes/4od/all to get the titles and links so I can recurse through the website picking up information I know how to get the title names and hrefs through regex but MAJOR but There are a lot of spans I am trying to get with the same name so my question is How can I loop through the results to get all title names and links etc... I have some code that I am just trying with which has the problems <?php require('function.php'); // Has some custom functions in this file $url = "http://www.channel4.com/programmes/4od/all"; // Url we are searching $getUrl = curlEngine($url); $content = $getUrl['content']; preg_match('#<span\b[^>]*class=([\'"])?programme-info(?(1)\1)[^>]*>.*?</span>#s', $content, $match); foreach($match as $match){ print $match; } ?> Any help would be appreciated Thanks Paul Quote Link to comment Share on other sites More sharing options...
cags Posted November 2, 2009 Share Posted November 2, 2009 As there are multiple items you would want to use preg_match_all. I do wonder though if regular expressions are the best method. You might be better off using something like DOMDocument. Quote Link to comment Share on other sites More sharing options...
ctcmedia Posted November 2, 2009 Author Share Posted November 2, 2009 i was thinking of going down the DOM route using xpath but thought I would give preg match ago... Ill try that if fails time to read up on xpath lol Quote Link to comment Share on other sites More sharing options...
cags Posted November 2, 2009 Share Posted November 2, 2009 Fair enough. In that case the pattern you have will match 692 show names from that page assuming you switch to preg_match_all. preg_match_all('#<span\b[^>]*class=([\'"])?programme-info(?(1)\1)[^>]*>.*?</span>#s', $content, $match); echo '<pre>'; print_r($match[0]); echo '</pre>'; Quote Link to comment Share on other sites More sharing options...
ctcmedia Posted November 2, 2009 Author Share Posted November 2, 2009 cool forgot the all I got it parsing all I wanted from that page was the links so I can spider out and gather information like title description etc all is good apart from trying to extract the flash vars this is an example of the code I am trying to get <div id="flashContainer" class="cb flashEnabled hidden"> <object id="catchUpPlayer" width="625" height="1" type="application/x-shockwave-flash" data="/static/programmes/asset/flash/swf/4odplayer-4.30.2.swf"> <param name="align" value="top"/> <param name="scale" value="noscale"/> <param name="salign" value="lt"/> <param name="allowFullScreen" value="true"/> <param name="bgcolor" value="#000000"/> <param name="allowScriptAccess" value="always"/> <param name="wmode" value="opaque"/> <param name="flashvars" value="brandTitle=Peep%20Show&wsBrandTitle=peep-show&primaryColor=0xCC0000&secondaryColor=0xCC0000&invertSkin=false&preSelectAsset=3005546&preSelectAssetGuidance=Strong%20language%20and%20adult%20humour&preSelectAssetImageURL=/assets/programmes/images/peep-show/series-6/episode-6/3b010dd1-69d3-490c-9985-19ad79f20557_625x352.jpg&pinRequestCallback=C4.PinController.doPinChecks"/> </object> and I am trying to get the "preSelectAssetImageURL=/assets/programmes/images/peep-show/series-6/episode-6/3b010dd1-69d3-490c-9985-19ad79f20557_625x352.jpg" out of it ... I am trying preg match again but nothing seems to be working <?php require('function.php'); $url = "http://www.channel4.com/programmes/peep-show/4od"; $page = curlEngine($url); // Find Image flashContainer preg_match('#<object\b[^>]*id=([\'"])?catchUpPlayer(?(1)\1)[^>]*>.*?</object>#s', $page['content'], $bigImage); ?> any hints thanks Paul Quote Link to comment Share on other sites More sharing options...
ctcmedia Posted November 2, 2009 Author Share Posted November 2, 2009 actualy just found the problem the flash player only loads if you have javascript and flash enabled and I am only curling into the page.. is there any options to enable these or am I just stuck with what I have? Quote Link to comment Share on other sites More sharing options...
cags Posted November 2, 2009 Share Posted November 2, 2009 Quick google search came up with this... http://curl.haxx.se/docs/faq.html#Does_curl_support_Javascript_or Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.