Jump to content

Archived

This topic is now archived and is closed to further replies.

miniramen

How do I find a sitemap of a website -_-?

Recommended Posts

Hello,

 

People have been telling me if I want to do any crawling, I need to know the sitemap....I need to do xml parsing....

 

then I realized that sitemap is written with xml.....sorry my noobness is unbearable even for me sometimes.

 

So if I need to find a sitemap of a website, how do I go about doing it?

Share this post


Link to post
Share on other sites

usually site maps are stored in the base directory of the website. Well, as per google's preferred requirement... so http://www.website.com/sitemap.xml or Sitemap.xml

 

Not sure if you can do a search for it?

Share this post


Link to post
Share on other sites

You might look for the Sitemap: entries in the robots.txt file. It is a non-standard entry, but Google uses it.  You ARE checking that file anyway, right? You should be.

Share this post


Link to post
Share on other sites

Hello,

 

People have been telling me if I want to do any crawling, I need to know the sitemap....I need to do xml parsing....

 

then I realized that sitemap is written with xml.....sorry my noobness is unbearable even for me sometimes.

 

So if I need to find a sitemap of a website, how do I go about doing it?

 

You don't need the sitemap. Just start out by looking for a robots.txt file and respect the rules specified then start out by reading the index.html and obey the <meta name="ROBOTS"> tag if present. Fill your queue with any URL you find. Store whatever you think is relevant and continue with the next URL in the queue.

Share this post


Link to post
Share on other sites

If you have a website with a url like this: www.example.com then you can find the robots.txt by adding this: www.example.com/robots.txt

Then if you see something like this:

 

 

User-agent: *

Allow: /

Disallow: /inbox/

Disallow: /levels/

Disallow: /levels/extras/userpass.txt

Disallow: /users/

 

User-agent: Mediapartners-Google

Disallow:

#Begin Attracta SEO Tools Sitemap. Do not remove

sitemap: http://cdn.attracta.com/sitemap/2165581.xml.gz

#End Attracta SEO Tools Sitemap. Do not remove

 

Then you see the sitemap. Robots directory is inside every website and can have useful information to administrator access, but you wont always find the sitemap.

Share this post


Link to post
Share on other sites

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.