Jump to content

Recommended Posts

Hello everyone,

 

Would like some direction, as I want to start a project and I'm not even sure if I'm headed the right way.

 

I have a local news site, which I would like to "scrape" various of the news items off it.

I already talked with their webmaster, and he said it's good to go.

 

Ok, so I believe (please correct) that a good tool for the job would be PHP and cURL.

What about using PHP Simple HTML DOM Parser?

 

I ask because I'm just not sure of where to head.

I'm a n00b at this, so diving into this project is various hours... before I even realize if what I'm doing will work or not.

 

So, that's the general direction.

Should I use PHP and cURL? (a reference doc I found here)

 

Also, I don't know how this works, yet I would like to "scrape" the page 4-5 times per day (at pre-set times), and then save the info over in my server.

 

So when a user to my website visits, I server the scraped information from my site (as opposed to re-scraping from the original site?).

 

Any thoughts on this project?

 

Thank you very much everyone!

Link to comment
https://forums.phpfreaks.com/topic/274375-scraping-with-php-and-curl/
Share on other sites

I have recently used the Simple HTML DOM Parser and found it to be VERY useful.

Thanks for the reply!

 

I'm reading the documentation right now (and trying to make sense of it).

Question: With the Simple HTML DOM Parser... will the data be scraped, and saved to my website's webserver?

Or does the programming do the "scraping" every time a visitor requests a page on my site?

 

Thank you

You acctually need some pretty advanced knowedge to do this. First, getting a PHP script to run at set intervals takes chron daemon (typically, I know there are other ways)

 

You probably don't need curl. Unless your doing stuff like spoofing cookies and post variable requests, you can just do: $file = file_get_contents("page.php?p=1&section=whatever");

 

 

Biggest thing is probably going to be regex though.

 

Btw, you might want to see if the news site has an rss feed that would be easier to parse.

The parser just helps you *parse* the HTML. What you do with it after that is up to you.

 

Thanks... I guess my next question is "what" happens to that data.

Does it go into a file (.txt?)... can it go into a .txt if I want it to?

Or does it stay "in memory"?

 

You acctually need some pretty advanced knowedge to do this. First, getting a PHP script to run at set intervals takes chron daemon (typically, I know there are other ways)

 

You probably don't need curl. Unless your doing stuff like spoofing cookies and post variable requests, you can just do: $file = file_get_contents("page.php?p=1&section=whatever");

 

 

Biggest thing is probably going to be regex though.

 

Btw, you might want to see if the news site has an rss feed that would be easier to parse.

Thanks for the reply!

The cronjob I can do fine (thankfully!).

Both schedule them, as well as directing it to run a script at x time interval (or hour).

 

Question, the

$file = file_get_contents("page.php?p1&section=whatever");

example... is that with PHP?

Or with the HTML DOM Parser?

 

I already checked for RSS from the website, yet they don't have any running.

 

Thanks everyone!

I will work on this today, and keep you posted.

Jessica,

Thanks for the help.

 

Yet I'm still... at a loss.

 

Maybe I haven't expressed how much of a n00b I am at this.

 

Any links you might suggest I go read?

In your example, how could I store $a in a text file?

Or... do I want to save it to a text file?

 

I'm still checking out the HTML DOM Parser documentation.

Tonight when I get home I will try and do some examples, to see if I am successful.

 

I will update on how it goes.

Thanks!

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.