Jump to content

hackalive

Members
  • Posts

    652
  • Joined

  • Last visited

Everything posted by hackalive

  1. Thanks to ignace and MrAdam I have managed to complete my web crawler script for a seach engine. Considering the scale this may grow to I need a clean and efficient DB design, so I am asking how you guys would do it. It would need somewhere for all the keywords and be able to link them to certain sites One table (tbl_site) might work like this |id|url|title|description Just looking for your guys opinion and how you think this DB could/may/should work Thanks in advance
  2. PS thanks very very very much all those who have helped me sooooo much in the past, especially ignace who is my saviour, thanks a million And thanks for this post now that it is closing goes to ignace and MrAdam
  3. Well okay thorpe give me an email adress and I'll send you the Beta when its ready (of course as all OS virtualisation projects this is probably a few months away). And I was warned about you by people on other forums and in perosn about being a ... well frankly a stuck up smartass. At least people like ignace and MrAdam and many others offer advice and comments without the stuckup smartass attitude, no wander people leave this forum and never return.
  4. dont listen to thorpe, anything is possible it just depends how much work, time and effort you are willing to put in, time and time again I hear on this and other forums "thats not possible becuase..." but guess what most have ended up working (with loads of work and stress) and the others I have some ideas on how maybe they can work (but dont have a lot of time to do it at the moment). So if you are determined it will happen
  5. Well thorpe my OS virtulization project has started and is going well, so umm yeah. Just trying to clean this cURL thing out of my inbox (so to speak)
  6. <?php $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler'); ?> Yeah so if I do the above then do the crawl using cURL the user Agent will be recorded as HackAliveCrawler yes?
  7. okay thorpe so what do I set $ch as?
  8. MrAdam for this code <?php curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler'); echo $_SERVER['HTTP_USER_AGENT']; echo "|"; echo $_SERVER['USER_AGENT']; ?>
  9. yeah I am just doing it on a smaple page at the moment with no curl, just exactly what I have posted
  10. okay slight problem, this code <?php header('User-Agent: HackAliveCrawler http://mysite.com/hackalivecrawler'); echo $_SERVER['HTTP_USER_AGENT']; ?> returns not any reason why? if I use USER_AGENT and not HTTP_USER_AGENT it returns nothing thanks once again
  11. Yes with PHP thanks a million once again to ignace (who many a time has saved me from trauling the entiere internet) and also to MrAdam thanks.
  12. Becuase MLBot have this is what I want to achieve
  13. Okay so I get it is part of the header request so how then can I set my own like MLBot and Google Crawler for when it is executed via a cron job?
  14. oh bye the way it will be running as part of a cron job and somethimes via direct browser. Thought I might just add this as it may affect the User-Agent directive mentioned
  15. Okay thanks once again ignace, just unsure how to set the User-Agent directive, if you know a good link for this or can tell me it will be much apprreciated. Thanks again ignace
  16. Okay so thorpe has locked my post and pointed out it is far too vague. So here I go again. Firstly I know how to build a basic PHP cURL web page crawler. Now my question is more specific to this, how do I get Site Stats to recognise the crawler as other ones such as Google Bot and MLBot are listed (MLbot http://www.metadatalabs.com/mlbot, so mine would be HackAliveCrawler htpp://mysite.com/hackalivecrawler or similar...). And how do I get it to recognise ROBOTS meta tags, ie NOINDEX, NOFOLLOW, or INDEX, FOLLOW etc and for robots.txt (to do this I need to first part to work, the name set part (eg GoogleBot or MLBot etc). Hope this is less vague and can produce some answers. Links to tutorials or other forums that will achieve these desired result as well as personal opinion and commernts are most welcome. Many thanks in advance.
  17. Hello I am looking to build a web crawler such as MLBot. It must recognise robots.txt and ROBOTS meta tag, but in saying that when a site such as Wordpress shows visitor stats it lists the crawler (eg MLBot http://www.metadatalabs.com/mlbot). So how can I build a crawler that will list as HackAliveBot or HackAliveCrawler and will recognise robots.txt and ROBOTS meta tag. Thanks so much in advance
  18. maybe so people can see where this is headed, here is the first "tag" i want to implement <ha:group gid="66666"></ha:group> so it would be used like this <ha:group gid="66666">Hello, you are part of group 66666<ha:else>you are NOT part of group 66666</ha:else></ha:group> So what I am after is how to make the XMLNS sheet that recognises and converts the tags (how to parse this, or any tags). All suggestions are welcome. I know I need XML namespace (XMLNS) but need to know how to build up the XMLNS sheet etc to achive the above sample tag. Thanks guys in advance
  19. also if anyone knows another forum or site to post my question on please let me know, thanks
  20. anyone know how I can implement the XML parser, XML namespace (XSLT template) to make this all work
  21. Perfect, thanks so much ignace and Tazerenix, very very very much appreciated
  22. okay what you suggested ignace works except..... now all the themeting etc is gone (it has replace <PL1> with the file but all the surrounding stuff for <PL1> which should stay is also gone. I have ob_start(); $A = str_replace('<A>', require_once($dir.'/'.$file), $A); $A = ob_get_contents(); return $A;
  23. @ignace, yes it is returning a 1 @Tazerenix, yes that PHP needs to be there, I need to achive this somehow with the filr structure I have going So if anyone can think of a way to achive this let me know please, thanks in advance
  24. Tazerenix I have just tried you suggestion and it does'nt handle the PHP
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.