How resource intensive is PHP and PERL

anon · December 20, 2007

Hi, I want to build a crawler out of PHP or PERL. I want to know which one on average is the least resource intensive?

trq · December 20, 2007

Just way too many variables come into play, it really depends on what your doing and how your doing it.

Ninjakreborn · December 20, 2007

It sounds like your trying to decide between the 2. I saw this post similar before. What are you wanting a web crawler for, give the reason and we can better give you feedback on which way would be the best one to proceed.

anon · December 20, 2007

I want the crawler to download HTML files from sites specified in an already existing database ???

Ninjakreborn · December 20, 2007

Download xhtml pages for archival. You mean grab all of the source content. For that you are going to need a crawler, and xhtml parser. PHP/Perl would both be pretty good at it. But I prefer PHP for anything I can, there won't be much of a difference.

trq · December 20, 2007

I want the crawler to download HTML files from sites specified in an already existing database

Does it actually need to crawl or just go to specific addressses?

anon · December 20, 2007

The database tells the crawler what sites to index. So, it downloads the HTML page and stores it in another Database. Kinda like Google.

trq · December 20, 2007

I'll ask that again. Does it need to crawl links within each site, or just save a specific file?

anon · December 20, 2007

No. I want it to only download HTML files specified by the database. So it downloads all HTML files within that site.

Sorry, wasn't sure what you meant.

Jessica · December 20, 2007

"No. I want it to only download HTML files specified by the database. So it downloads all HTML files within that site. "

That is a contradicting statement. It cannot download only the files listed in the database and all of the files on the site, unless you specify all files on the site in the list.

Ninjakreborn · December 20, 2007

It sounds to me like you want to visit a site, and make a copy of the pages on the site and get the information out of them, then database the page..

It's going to be easier to just take a screen shot of the page. In that case your going to want to look at Perl, they have a really good class for carbon copying image's of specific web pages.

anon · December 20, 2007

Hang on. Let me think this through :-\

A database gives the crawler a list of sites to crawl. By crawling, i mean visiting, storing the sites HTML files, then repeating the process on the next site in the databases list.

Does this make sense.

@businessman.........1

I only want the text located in head and body.

Jessica · December 20, 2007

(I responded to the wrong post, sorry)

Ninjakreborn · December 20, 2007

Then you just create the basic crawler using file get contents on the url's. When you get the data from the xhtml files then your going to write a parser. Google "regex" and go through some tutorials on it. You'll want to use regex to get everything out of the head and body sections, leaving the rest. Then from there you can customize it to be more specific based on exactly what you are needing.

Note this is against the copyright laws of most site's, as well as against personal site policies. Be ready to get the ip of your server banned from ever being able to visit the site again if you misuse what you are trying to create.

anon · December 20, 2007

Why is this illegal

phpSensei · December 20, 2007

Why is this illegal

Some sites just don't allow it, as it is against policies.

anon · December 20, 2007

but google do it.

trq · December 21, 2007

but google do it.

What?

Sign In

How resource intensive is PHP and PERL

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information