miniramen Posted June 15, 2010 Share Posted June 15, 2010 Hello, People have been telling me if I want to do any crawling, I need to know the sitemap....I need to do xml parsing.... then I realized that sitemap is written with xml.....sorry my noobness is unbearable even for me sometimes. So if I need to find a sitemap of a website, how do I go about doing it? Quote Link to comment https://forums.phpfreaks.com/topic/204852-how-do-i-find-a-sitemap-of-a-website-_/ Share on other sites More sharing options...
Bottyz Posted June 15, 2010 Share Posted June 15, 2010 usually site maps are stored in the base directory of the website. Well, as per google's preferred requirement... so http://www.website.com/sitemap.xml or Sitemap.xml Not sure if you can do a search for it? Quote Link to comment https://forums.phpfreaks.com/topic/204852-how-do-i-find-a-sitemap-of-a-website-_/#findComment-1072379 Share on other sites More sharing options...
DavidAM Posted June 15, 2010 Share Posted June 15, 2010 You might look for the Sitemap: entries in the robots.txt file. It is a non-standard entry, but Google uses it. You ARE checking that file anyway, right? You should be. Quote Link to comment https://forums.phpfreaks.com/topic/204852-how-do-i-find-a-sitemap-of-a-website-_/#findComment-1072431 Share on other sites More sharing options...
ignace Posted June 15, 2010 Share Posted June 15, 2010 Hello, People have been telling me if I want to do any crawling, I need to know the sitemap....I need to do xml parsing.... then I realized that sitemap is written with xml.....sorry my noobness is unbearable even for me sometimes. So if I need to find a sitemap of a website, how do I go about doing it? You don't need the sitemap. Just start out by looking for a robots.txt file and respect the rules specified then start out by reading the index.html and obey the <meta name="ROBOTS"> tag if present. Fill your queue with any URL you find. Store whatever you think is relevant and continue with the next URL in the queue. Quote Link to comment https://forums.phpfreaks.com/topic/204852-how-do-i-find-a-sitemap-of-a-website-_/#findComment-1072530 Share on other sites More sharing options...
YTxMasterModzx Posted November 3, 2012 Share Posted November 3, 2012 If you have a website with a url like this: www.example.com then you can find the robots.txt by adding this: www.example.com/robots.txt Then if you see something like this: User-agent: * Allow: / Disallow: /inbox/ Disallow: /levels/ Disallow: /levels/extras/userpass.txt Disallow: /users/ User-agent: Mediapartners-Google Disallow: #Begin Attracta SEO Tools Sitemap. Do not remove sitemap: http://cdn.attracta.com/sitemap/2165581.xml.gz #End Attracta SEO Tools Sitemap. Do not remove Then you see the sitemap. Robots directory is inside every website and can have useful information to administrator access, but you wont always find the sitemap. Quote Link to comment https://forums.phpfreaks.com/topic/204852-how-do-i-find-a-sitemap-of-a-website-_/#findComment-1389992 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.