miniramen 0 Posted June 15, 2010 Share Posted June 15, 2010 Hello, People have been telling me if I want to do any crawling, I need to know the sitemap....I need to do xml parsing.... then I realized that sitemap is written with xml.....sorry my noobness is unbearable even for me sometimes. So if I need to find a sitemap of a website, how do I go about doing it? Link to post Share on other sites
Bottyz 0 Posted June 15, 2010 Share Posted June 15, 2010 usually site maps are stored in the base directory of the website. Well, as per google's preferred requirement... so http://www.website.com/sitemap.xml or Sitemap.xml Not sure if you can do a search for it? Link to post Share on other sites
DavidAM 127 Posted June 15, 2010 Share Posted June 15, 2010 You might look for the Sitemap: entries in the robots.txt file. It is a non-standard entry, but Google uses it. You ARE checking that file anyway, right? You should be. Link to post Share on other sites
ignace 196 Posted June 15, 2010 Share Posted June 15, 2010 Hello, People have been telling me if I want to do any crawling, I need to know the sitemap....I need to do xml parsing.... then I realized that sitemap is written with xml.....sorry my noobness is unbearable even for me sometimes. So if I need to find a sitemap of a website, how do I go about doing it? You don't need the sitemap. Just start out by looking for a robots.txt file and respect the rules specified then start out by reading the index.html and obey the <meta name="ROBOTS"> tag if present. Fill your queue with any URL you find. Store whatever you think is relevant and continue with the next URL in the queue. Link to post Share on other sites
YTxMasterModzx 0 Posted November 3, 2012 Share Posted November 3, 2012 If you have a website with a url like this: www.example.com then you can find the robots.txt by adding this: www.example.com/robots.txt Then if you see something like this: User-agent: * Allow: / Disallow: /inbox/ Disallow: /levels/ Disallow: /levels/extras/userpass.txt Disallow: /users/ User-agent: Mediapartners-Google Disallow: #Begin Attracta SEO Tools Sitemap. Do not remove sitemap: http://cdn.attracta.com/sitemap/2165581.xml.gz #End Attracta SEO Tools Sitemap. Do not remove Then you see the sitemap. Robots directory is inside every website and can have useful information to administrator access, but you wont always find the sitemap. Link to post Share on other sites
Recommended Posts
Archived
This topic is now archived and is closed to further replies.