Jump to content

WebScraping Multiple pages in sequence?


Modernvox

Recommended Posts

Just want to know if it's possible to scrape multiple pages in sequence with PHP or do I need to use Curl as well?

 

example:  if ($x == $z)

$open = "mysite.com/zzy, mysite.com/ssy, mysite.com./wwy";

 

This would have to be an array, right?

 

Or how could I approach this?

Link to comment
Share on other sites

You don't need curl unless you need to authenticate.

 

$pages = array("mysite.com/zzy", "mysite.com/ssy", "mysite.com./wwy");
foreach ($pages as $page) {
  $html = file_get_contents($page);
  // do whatever with $html
}

Link to comment
Share on other sites

The word Array is being included as part of the URL's, How do I strip it out of there?

 

Error:

Warning: file_get_contents(Array/muc/) [function.file-get-contents]: failed to open stream: No such file or directory in /home/a7250761/public_html/page2.php on line 131

Link to comment
Share on other sites

what does your code look like ?

 

 <?php
if(isset($_POST['submit'])) 
$st = $_POST['state'];

if ($st == "AL")
{
$url = array("http://auburn.craigslist.org", "http://bham.craigslist.org");
}
else if ($st == "AK") 
{
$url= "http://anchorage.craigslist.org";
}
else if ($st == "AZ") 
{
$url= "http://anchorage.craigslist.org";
}
$html = file_get_contents("$url/muc/");

preg_match_all('/<a href="([^"]+)">([^<]+)<\/a><font size="-1">([^"]+)<\/font>/s', $html,$posts,PREG_SET_ORDER);
//echo "<pre>";print_r($posts);


foreach ($posts as $post) {

    //print $post[0]; //HTML
    $post[2] = str_ireplace($url,"",$post[2]); //remove domain
    echo "<a href=\"$url{$post[1]}\">{$post[2]}<font size=\"3\">{$post[3]}<br />";
    print "<BR />\n";

}
?>

Link to comment
Share on other sites

there is a logical error in your code try this

<?php
$st = isset($_POST['submit']) ? $_POST['state'] : '';

$urls = array();

if ($st == "AL") 
{
$urls = array("http://auburn.craigslist.org", "http://bham.craigslist.org");
}
else if ($st == "AK") 
{
$urls= array("http://anchorage.craigslist.org");
}
else if ($st == "AZ") 
{
$urls = array("http://anchorage.craigslist.org");
}

foreach ($urls as $url) {
    $html = file_get_contents("$url/muc/");

    preg_match_all('/<a href="([^"]+)">([^<]+)<\/a><font size="-1">([^"]+)<\/font>/s', $html,$posts,PREG_SET_ORDER);
    //echo "<pre>";print_r($posts);


    foreach ($posts as $post) {

        //print $post[0]; //HTML
        $post[2] = str_ireplace($url,"",$post[2]); //remove domain
        echo "<a href=\"$url{$post[1]}\">{$post[2]}<font size=\"3\">{$post[3]}<br />";
        print "<BR />\n";

    }
}
?>

 

Link to comment
Share on other sites

there is a logical error in your code try this

<?php
$st = isset($_POST['submit']) ? $_POST['state'] : '';

$urls = array();

if ($st == "AL") 
{
$urls = array("http://auburn.craigslist.org", "http://bham.craigslist.org");
}
else if ($st == "AK") 
{
$urls= array("http://anchorage.craigslist.org");
}
else if ($st == "AZ") 
{
$urls = array("http://anchorage.craigslist.org");
}

foreach ($urls as $url) {
    $html = file_get_contents("$url/muc/");

    preg_match_all('/<a href="([^"]+)">([^<]+)<\/a><font size="-1">([^"]+)<\/font>/s', $html,$posts,PREG_SET_ORDER);
    //echo "<pre>";print_r($posts);


    foreach ($posts as $post) {

        //print $post[0]; //HTML
        $post[2] = str_ireplace($url,"",$post[2]); //remove domain
        echo "<a href=\"$url{$post[1]}\">{$post[2]}<font size=\"3\">{$post[3]}<br />";
        print "<BR />\n";

    }
}
?>

 

Great rajiv, thanks :shy:

 

Now if i wanted to display just 50 per page. How would I approach that?

Link to comment
Share on other sites

I would suggest parse all the data out and put it in a database, it will be faster and more controllable, then you can use a easy pagination script to achieve the 50 per page. you can do it with the code you already have but you would be parsing the pages everytime,

 

One alternate solution would be parse out the pages and put it in the session and on pagination pick it up from the session instead of again parsing it out again.  however if there is alot of data it would be slow.

Link to comment
Share on other sites

I would suggest parse all the data out and put it in a database, it will be faster and more controllable, then you can use a easy pagination script to achieve the 50 per page. you can do it with the code you already have but you would be parsing the pages everytime,

 

One alternate solution would be parse out the pages and put it in the session and on pagination pick it up from the session instead of again parsing it out again.  however if there is alot of data it would be slow.

 

I'm only linking back to CL. In the biggest cities the return should be between 50 and 10,000 links.

If you again suggest using a DB I will. I was under the assumption I could just display 50 scraped links, open new page (if needed) display the next 50 so on and so forth maybe using a button stating next.

 

Thanks again for your guidance.

Link to comment
Share on other sites

You should study the designated site (browse threw it) and see if it has pagination there, then you can modify your script to add parameters so that you only fetch a limited amount of data at each call, this will ensure that your script is light on resources and only processes the amount of data it displays.

 

However if you can't do this, I would suggest putting it in the database.

 

1) Write a script which processes data add/updates the database every 1/2 hour or 1 hour or which ever time interval you desire (so that your data is up to date)

2) Just write simple scripts to fetch data from the database and list them

 

This will ensure that your listing on your site is fast.

 

hope its helpful

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.