Scraping with PHP and cURL

beepingKeyboard · February 12, 2013

Hello everyone,

Would like some direction, as I want to start a project and I'm not even sure if I'm headed the right way.

I have a local news site, which I would like to "scrape" various of the news items off it.

I already talked with their webmaster, and he said it's good to go.

Ok, so I believe (please correct) that a good tool for the job would be PHP and cURL.

What about using PHP Simple HTML DOM Parser?

I ask because I'm just not sure of where to head.

I'm a n00b at this, so diving into this project is various hours... before I even realize if what I'm doing will work or not.

So, that's the general direction.

Should I use PHP and cURL? (a reference doc I found here)

Also, I don't know how this works, yet I would like to "scrape" the page 4-5 times per day (at pre-set times), and then save the info over in my server.

So when a user to my website visits, I server the scraped information from my site (as opposed to re-scraping from the original site?).

Any thoughts on this project?

Thank you very much everyone!

Jessica · February 12, 2013

I have recently used the Simple HTML DOM Parser and found it to be VERY useful.

beepingKeyboard · February 12, 2013

I have recently used the Simple HTML DOM Parser and found it to be VERY useful.

Thanks for the reply!

I'm reading the documentation right now (and trying to make sense of it).

Question: With the Simple HTML DOM Parser... will the data be scraped, and saved to my website's webserver?

Or does the programming do the "scraping" every time a visitor requests a page on my site?

Thank you

Jessica · February 12, 2013

The parser just helps you *parse* the HTML. What you do with it after that is up to you.

tibberous · February 12, 2013

You acctually need some pretty advanced knowedge to do this. First, getting a PHP script to run at set intervals takes chron daemon (typically, I know there are other ways)

You probably don't need curl. Unless your doing stuff like spoofing cookies and post variable requests, you can just do: $file = file_get_contents("page.php?p=1&section=whatever");

Biggest thing is probably going to be regex though.

Btw, you might want to see if the news site has an rss feed that would be easier to parse.

beepingKeyboard · February 12, 2013

The parser just helps you *parse* the HTML. What you do with it after that is up to you.

Thanks... I guess my next question is "what" happens to that data.

Does it go into a file (.txt?)... can it go into a .txt if I want it to?

Or does it stay "in memory"?

You acctually need some pretty advanced knowedge to do this. First, getting a PHP script to run at set intervals takes chron daemon (typically, I know there are other ways)

You probably don't need curl. Unless your doing stuff like spoofing cookies and post variable requests, you can just do: $file = file_get_contents("page.php?p=1&section=whatever");

Biggest thing is probably going to be regex though.

Btw, you might want to see if the news site has an rss feed that would be easier to parse.

Thanks for the reply!

The cronjob I can do fine (thankfully!).

Both schedule them, as well as directing it to run a script at x time interval (or hour).

Question, the

$file = file_get_contents("page.php?p1&section=whatever");

example... is that with PHP?

Or with the HTML DOM Parser?

I already checked for RSS from the website, yet they don't have any running.

Thanks everyone!

I will work on this today, and keep you posted.

Jessica · February 12, 2013

Thanks... I guess my next question is "what" happens to that data.

Whatever you do with it.

$a = 'Bob';

What happens to $a? Nothing if you don't store it somewhere.

beepingKeyboard · February 12, 2013

Jessica,

Thanks for the help.

Yet I'm still... at a loss.

Maybe I haven't expressed how much of a n00b I am at this.

Any links you might suggest I go read?

In your example, how could I store $a in a text file?

Or... do I want to save it to a text file?

I'm still checking out the HTML DOM Parser documentation.

Tonight when I get home I will try and do some examples, to see if I am successful.

I will update on how it goes.

Thanks!

Jessica · February 12, 2013

You probably want to save the data you get from the other site into a database. There are lots of basic mysql tutorials you can use. Look into using mysqli or PDO in PHP.

Sign In

Scraping with PHP and cURL

Recommended Posts

beepingKeyboard

Link to comment

Share on other sites

Jessica

Link to comment

Share on other sites

beepingKeyboard

Link to comment

Share on other sites

Jessica

Link to comment

Share on other sites

tibberous

Link to comment

Share on other sites

beepingKeyboard

Link to comment

Share on other sites

Jessica

Link to comment

Share on other sites

beepingKeyboard

Link to comment

Share on other sites

Jessica

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information