Modernvox Posted January 6, 2010 Share Posted January 6, 2010 Just want to know if it's possible to scrape multiple pages in sequence with PHP or do I need to use Curl as well? example: if ($x == $z) $open = "mysite.com/zzy, mysite.com/ssy, mysite.com./wwy"; This would have to be an array, right? Or how could I approach this? Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/ Share on other sites More sharing options...
trq Posted January 6, 2010 Share Posted January 6, 2010 You don't need curl unless you need to authenticate. $pages = array("mysite.com/zzy", "mysite.com/ssy", "mysite.com./wwy"); foreach ($pages as $page) { $html = file_get_contents($page); // do whatever with $html } Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989425 Share on other sites More sharing options...
Modernvox Posted January 6, 2010 Author Share Posted January 6, 2010 You don't need curl unless you need to authenticate. $pages = array("mysite.com/zzy", "mysite.com/ssy", "mysite.com./wwy"); foreach ($pages as $page) { $html = file_get_contents($page); // do whatever with $html } Thanks Thorpe Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989429 Share on other sites More sharing options...
Modernvox Posted January 6, 2010 Author Share Posted January 6, 2010 The word Array is being included as part of the URL's, How do I strip it out of there? Error: Warning: file_get_contents(Array/muc/) [function.file-get-contents]: failed to open stream: No such file or directory in /home/a7250761/public_html/page2.php on line 131 Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989434 Share on other sites More sharing options...
rajivgonsalves Posted January 6, 2010 Share Posted January 6, 2010 what does your code look like ? Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989435 Share on other sites More sharing options...
Modernvox Posted January 6, 2010 Author Share Posted January 6, 2010 what does your code look like ? <?php if(isset($_POST['submit'])) $st = $_POST['state']; if ($st == "AL") { $url = array("http://auburn.craigslist.org", "http://bham.craigslist.org"); } else if ($st == "AK") { $url= "http://anchorage.craigslist.org"; } else if ($st == "AZ") { $url= "http://anchorage.craigslist.org"; } $html = file_get_contents("$url/muc/"); preg_match_all('/<a href="([^"]+)">([^<]+)<\/a><font size="-1">([^"]+)<\/font>/s', $html,$posts,PREG_SET_ORDER); //echo "<pre>";print_r($posts); foreach ($posts as $post) { //print $post[0]; //HTML $post[2] = str_ireplace($url,"",$post[2]); //remove domain echo "<a href=\"$url{$post[1]}\">{$post[2]}<font size=\"3\">{$post[3]}<br />"; print "<BR />\n"; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989438 Share on other sites More sharing options...
rajivgonsalves Posted January 6, 2010 Share Posted January 6, 2010 there is a logical error in your code try this <?php $st = isset($_POST['submit']) ? $_POST['state'] : ''; $urls = array(); if ($st == "AL") { $urls = array("http://auburn.craigslist.org", "http://bham.craigslist.org"); } else if ($st == "AK") { $urls= array("http://anchorage.craigslist.org"); } else if ($st == "AZ") { $urls = array("http://anchorage.craigslist.org"); } foreach ($urls as $url) { $html = file_get_contents("$url/muc/"); preg_match_all('/<a href="([^"]+)">([^<]+)<\/a><font size="-1">([^"]+)<\/font>/s', $html,$posts,PREG_SET_ORDER); //echo "<pre>";print_r($posts); foreach ($posts as $post) { //print $post[0]; //HTML $post[2] = str_ireplace($url,"",$post[2]); //remove domain echo "<a href=\"$url{$post[1]}\">{$post[2]}<font size=\"3\">{$post[3]}<br />"; print "<BR />\n"; } } ?> Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989443 Share on other sites More sharing options...
Modernvox Posted January 6, 2010 Author Share Posted January 6, 2010 there is a logical error in your code try this <?php $st = isset($_POST['submit']) ? $_POST['state'] : ''; $urls = array(); if ($st == "AL") { $urls = array("http://auburn.craigslist.org", "http://bham.craigslist.org"); } else if ($st == "AK") { $urls= array("http://anchorage.craigslist.org"); } else if ($st == "AZ") { $urls = array("http://anchorage.craigslist.org"); } foreach ($urls as $url) { $html = file_get_contents("$url/muc/"); preg_match_all('/<a href="([^"]+)">([^<]+)<\/a><font size="-1">([^"]+)<\/font>/s', $html,$posts,PREG_SET_ORDER); //echo "<pre>";print_r($posts); foreach ($posts as $post) { //print $post[0]; //HTML $post[2] = str_ireplace($url,"",$post[2]); //remove domain echo "<a href=\"$url{$post[1]}\">{$post[2]}<font size=\"3\">{$post[3]}<br />"; print "<BR />\n"; } } ?> Great rajiv, thanks Now if i wanted to display just 50 per page. How would I approach that? Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989446 Share on other sites More sharing options...
rajivgonsalves Posted January 6, 2010 Share Posted January 6, 2010 I would suggest parse all the data out and put it in a database, it will be faster and more controllable, then you can use a easy pagination script to achieve the 50 per page. you can do it with the code you already have but you would be parsing the pages everytime, One alternate solution would be parse out the pages and put it in the session and on pagination pick it up from the session instead of again parsing it out again. however if there is alot of data it would be slow. Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989451 Share on other sites More sharing options...
Modernvox Posted January 6, 2010 Author Share Posted January 6, 2010 I would suggest parse all the data out and put it in a database, it will be faster and more controllable, then you can use a easy pagination script to achieve the 50 per page. you can do it with the code you already have but you would be parsing the pages everytime, One alternate solution would be parse out the pages and put it in the session and on pagination pick it up from the session instead of again parsing it out again. however if there is alot of data it would be slow. I'm only linking back to CL. In the biggest cities the return should be between 50 and 10,000 links. If you again suggest using a DB I will. I was under the assumption I could just display 50 scraped links, open new page (if needed) display the next 50 so on and so forth maybe using a button stating next. Thanks again for your guidance. Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989455 Share on other sites More sharing options...
rajivgonsalves Posted January 6, 2010 Share Posted January 6, 2010 You should study the designated site (browse threw it) and see if it has pagination there, then you can modify your script to add parameters so that you only fetch a limited amount of data at each call, this will ensure that your script is light on resources and only processes the amount of data it displays. However if you can't do this, I would suggest putting it in the database. 1) Write a script which processes data add/updates the database every 1/2 hour or 1 hour or which ever time interval you desire (so that your data is up to date) 2) Just write simple scripts to fetch data from the database and list them This will ensure that your listing on your site is fast. hope its helpful Quote Link to comment https://forums.phpfreaks.com/topic/187372-webscraping-multiple-pages-in-sequence/#findComment-989459 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.