Jump to content

How resource intensive is PHP and PERL


anon

Recommended Posts

Download xhtml pages for archival. You mean grab all of the source content.  For that you are going to need a crawler, and xhtml parser. PHP/Perl would both be pretty good at it.  But I prefer PHP for anything I can, there won't be much of a difference. 

"No. I want it to only download HTML files specified by the database. So it downloads all HTML files within that site. "

 

That is a contradicting statement. It cannot download only the files listed in the database and all of the files on the site, unless you specify all files on the site in the list.

It sounds to me like you want to visit a site, and make a copy of the pages on the site and get the information out of them, then database the page..

It's going to be easier to just take a screen shot of the page.  In that case your going to want to look at Perl, they have a really good class for carbon copying image's of specific web pages.

Hang on. Let me think this through :-\

 

A database gives the crawler a list of sites to crawl. By crawling, i mean visiting, storing the sites HTML files, then repeating the process on the next site in the databases list.

 

Does this make sense.

 

@businessman.........1

 

I only want the text located in head and body.

Then you just create the basic crawler using file get contents on the url's.  When you get the data from the xhtml files then your going to write a parser.  Google "regex" and go through some tutorials on it. You'll want to use regex to get everything out of the head and body sections, leaving the rest.  Then from there you can customize it to be more specific based on exactly what you are needing.

 

Note this is against the copyright laws of most site's, as well as against personal site policies.  Be ready to get the ip of your server banned from ever being able to visit the site again if you misuse what you are trying to create.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.