Jump to content

Recommended Posts

It sounds like your trying to decide between the 2.  I saw this post similar before.  What are you wanting a web crawler for, give the reason and we can better give you feedback on which way would be the best one to proceed.

Download xhtml pages for archival. You mean grab all of the source content.  For that you are going to need a crawler, and xhtml parser. PHP/Perl would both be pretty good at it.  But I prefer PHP for anything I can, there won't be much of a difference. 

"No. I want it to only download HTML files specified by the database. So it downloads all HTML files within that site. "

 

That is a contradicting statement. It cannot download only the files listed in the database and all of the files on the site, unless you specify all files on the site in the list.

It sounds to me like you want to visit a site, and make a copy of the pages on the site and get the information out of them, then database the page..

It's going to be easier to just take a screen shot of the page.  In that case your going to want to look at Perl, they have a really good class for carbon copying image's of specific web pages.

Hang on. Let me think this through :-\

 

A database gives the crawler a list of sites to crawl. By crawling, i mean visiting, storing the sites HTML files, then repeating the process on the next site in the databases list.

 

Does this make sense.

 

@businessman.........1

 

I only want the text located in head and body.

Then you just create the basic crawler using file get contents on the url's.  When you get the data from the xhtml files then your going to write a parser.  Google "regex" and go through some tutorials on it. You'll want to use regex to get everything out of the head and body sections, leaving the rest.  Then from there you can customize it to be more specific based on exactly what you are needing.

 

Note this is against the copyright laws of most site's, as well as against personal site policies.  Be ready to get the ip of your server banned from ever being able to visit the site again if you misuse what you are trying to create.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.