anon Posted December 20, 2007 Share Posted December 20, 2007 Hi, I want to build a crawler out of PHP or PERL. I want to know which one on average is the least resource intensive? Quote Link to comment Share on other sites More sharing options...
trq Posted December 20, 2007 Share Posted December 20, 2007 Just way too many variables come into play, it really depends on what your doing and how your doing it. Quote Link to comment Share on other sites More sharing options...
Ninjakreborn Posted December 20, 2007 Share Posted December 20, 2007 It sounds like your trying to decide between the 2. I saw this post similar before. What are you wanting a web crawler for, give the reason and we can better give you feedback on which way would be the best one to proceed. Quote Link to comment Share on other sites More sharing options...
anon Posted December 20, 2007 Author Share Posted December 20, 2007 I want the crawler to download HTML files from sites specified in an already existing database ??? Quote Link to comment Share on other sites More sharing options...
Ninjakreborn Posted December 20, 2007 Share Posted December 20, 2007 Download xhtml pages for archival. You mean grab all of the source content. For that you are going to need a crawler, and xhtml parser. PHP/Perl would both be pretty good at it. But I prefer PHP for anything I can, there won't be much of a difference. Quote Link to comment Share on other sites More sharing options...
trq Posted December 20, 2007 Share Posted December 20, 2007 I want the crawler to download HTML files from sites specified in an already existing database  Does it actually need to crawl or just go to specific addressses? Quote Link to comment Share on other sites More sharing options...
anon Posted December 20, 2007 Author Share Posted December 20, 2007 The database tells the crawler what sites to index. So, it downloads the HTML page and stores it in another Database. Kinda like Google. Quote Link to comment Share on other sites More sharing options...
trq Posted December 20, 2007 Share Posted December 20, 2007 I'll ask that again. Does it need to crawl links within each site, or just save a specific file? Quote Link to comment Share on other sites More sharing options...
anon Posted December 20, 2007 Author Share Posted December 20, 2007 No. I want it to only download HTML files specified by the database. So it downloads all HTML files within that site. Â Sorry, wasn't sure what you meant. Quote Link to comment Share on other sites More sharing options...
Jessica Posted December 20, 2007 Share Posted December 20, 2007 "No. I want it to only download HTML files specified by the database. So it downloads all HTML files within that site. " Â That is a contradicting statement. It cannot download only the files listed in the database and all of the files on the site, unless you specify all files on the site in the list. Quote Link to comment Share on other sites More sharing options...
Ninjakreborn Posted December 20, 2007 Share Posted December 20, 2007 It sounds to me like you want to visit a site, and make a copy of the pages on the site and get the information out of them, then database the page.. It's going to be easier to just take a screen shot of the page. In that case your going to want to look at Perl, they have a really good class for carbon copying image's of specific web pages. Quote Link to comment Share on other sites More sharing options...
anon Posted December 20, 2007 Author Share Posted December 20, 2007 Hang on. Let me think this through :-\ Â A database gives the crawler a list of sites to crawl. By crawling, i mean visiting, storing the sites HTML files, then repeating the process on the next site in the databases list. Â Does this make sense. Â @businessman.........1 Â I only want the text located in head and body. Quote Link to comment Share on other sites More sharing options...
Jessica Posted December 20, 2007 Share Posted December 20, 2007 (I responded to the wrong post, sorry) Quote Link to comment Share on other sites More sharing options...
Ninjakreborn Posted December 20, 2007 Share Posted December 20, 2007 Then you just create the basic crawler using file get contents on the url's. Â When you get the data from the xhtml files then your going to write a parser. Â Google "regex" and go through some tutorials on it. You'll want to use regex to get everything out of the head and body sections, leaving the rest. Â Then from there you can customize it to be more specific based on exactly what you are needing. Â Note this is against the copyright laws of most site's, as well as against personal site policies. Â Be ready to get the ip of your server banned from ever being able to visit the site again if you misuse what you are trying to create. Quote Link to comment Share on other sites More sharing options...
anon Posted December 20, 2007 Author Share Posted December 20, 2007 Why is this illegal Quote Link to comment Share on other sites More sharing options...
phpSensei Posted December 20, 2007 Share Posted December 20, 2007 Why is this illegal  Some sites just don't allow it, as it is against policies. Quote Link to comment Share on other sites More sharing options...
anon Posted December 20, 2007 Author Share Posted December 20, 2007 but google do it. Quote Link to comment Share on other sites More sharing options...
trq Posted December 21, 2007 Share Posted December 21, 2007 but google do it. What? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.