Jump to content

How resource intensive is PHP and PERL


anon

Recommended Posts

"No. I want it to only download HTML files specified by the database. So it downloads all HTML files within that site. "

 

That is a contradicting statement. It cannot download only the files listed in the database and all of the files on the site, unless you specify all files on the site in the list.

Link to comment
Share on other sites

It sounds to me like you want to visit a site, and make a copy of the pages on the site and get the information out of them, then database the page..

It's going to be easier to just take a screen shot of the page.  In that case your going to want to look at Perl, they have a really good class for carbon copying image's of specific web pages.

Link to comment
Share on other sites

Hang on. Let me think this through :-\

 

A database gives the crawler a list of sites to crawl. By crawling, i mean visiting, storing the sites HTML files, then repeating the process on the next site in the databases list.

 

Does this make sense.

 

@businessman.........1

 

I only want the text located in head and body.

Link to comment
Share on other sites

Then you just create the basic crawler using file get contents on the url's.  When you get the data from the xhtml files then your going to write a parser.  Google "regex" and go through some tutorials on it. You'll want to use regex to get everything out of the head and body sections, leaving the rest.  Then from there you can customize it to be more specific based on exactly what you are needing.

 

Note this is against the copyright laws of most site's, as well as against personal site policies.  Be ready to get the ip of your server banned from ever being able to visit the site again if you misuse what you are trying to create.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.