Jump to content

Recommended Posts

What do you mean by "detect"?

If by detect you mean to find new domains, you dont. You can visit sites that people submit new domains to and scrape them or work with domains that are submitted to your engine.

 

If a domain is asked for in your search engine that you dont have indexed or if someone submits a url then your bot,slurper,scraper, crawler or whatever you want to call it goes to the site and checks if its alive. If its alive it grabs the robots.txt file and goes from there. Some bots only index the root of the site, those are mostly called "directories" others will follow links while respecting the disallows of the robots.txt file, these are mostly termed "search engines".

 

What you do is index the root then follow links and index them as you go along. If you want to do it like google you use the keywords to match against page heading<h1><h2> succession and then the text of the page against the <h> tags and keywords to come up with a ranking. Then, like google, you use secret incantations, voodoo dolls and bat blood drippings to decide the final rankings.

 

As to the code. Use Curl or wget, or even just use file() or file_get_contgents(). From there on its up to you how complex you want your code to be and how you want to break up the work to be done.

 

As to indexing itself. Whatever scheme you come up with that works. Start simple and develop from there.

 

One thing. The more complex you make your script, the deeper you go into site, the more work your script does, the quicker you will need a dedicated server(s). Your hoster will quickly tire of the demands your script puts on your shared server and your script will not be able to function properly under the restriction set by a shared server. Not to mention you will, if you are serious about running a search engine, quickly run out of disk space and bandwidth. What comes into your site will also count against your bandwidth.

 

 

HTH

Teamatomic

thanks for your help but i still have a question

 

how can google index the web sites  ? how it is recognize new web sites  ? it's my problem

 

thanks

 

IF:

  a) Your site is registered, it will come onto a newly registered domain list (usually the case)

  b) Your site is linked to, by any known site, the spider will crawl and discover it.

  c)  Your site is on a shared host (more often the case), and Google will crawl the IP and find your site.

thanks for your help but i still have a question

 

how can google index the web sites  ? how it is recognize new web sites  ? it's my problem

 

thanks

 

IF:

  a) Your site is registered, it will come onto a newly registered domain list (usually the case)

  b) Your site is linked to, by any known site, the spider will crawl and discover it.

  c)  Your site is on a shared host (more often the case), and Google will crawl the IP and find your site.

 

thanks for your help . Now , where I can find the new domains list that have been submitted ?

 

 

Good question, but due to competition, you can not see who just submitted there info to Google to be listed in there search engine .

 

There no list, for anybody except Google staff for new added domain names. 

 

or there no information on domain names waiting to be added via Google.

 

building a search engine takes years(( a proper one and cost's thousands/millions  to create, especially the infrastructure and net work behind it.

 

People that create scripts, that scrape info from Google, are not real search engines.

 

real search engines are massive very massive.

 

The coding for indexing web sites on a search engines, is massive you have to be able to walk and talk mysql...

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.