d.shankar Posted September 11, 2007 Share Posted September 11, 2007 I have a spider program that crawls all pages and retrieves all links. But i have to properly align it with the domain name. In doing so my LoC has become too large..... for eg: if i give www.google.com , it shows all links and subdomains such as form.py , mail.google.com ... i have to filter the subdomains and have to add the domain name in front of files. Hope you get me. Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/ Share on other sites More sharing options...
btherl Posted September 11, 2007 Share Posted September 11, 2007 Can you give some more detail? I don't understand your question. Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-345798 Share on other sites More sharing options...
d.shankar Posted September 11, 2007 Author Share Posted September 11, 2007 <html> <head> <meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8"> <title>More Google products</title> <style> <!-- body,td,div,p,a{font-family:arial,sans-serif } a:link{color:#00c} a:visited{color:#551a8b} a:active{color:#f00} //.q {color:#0000cc;} .header { font-size:100%; font-weight: bold; } --> </style> </head> <body bgcolor=#ffffff onLoad="document.gs.reset()" topmargin=2 marginheight=2> <table border=0 cellpadding=0 cellspacing=2 width=100%> <tr> <td width="1%" valign=top><a href="/webhp?hl=en"><img src=/images/google_sm.gif alt="Go to Google Home" width=143 height=59 hspace=3 vspace=5 border=0></a></td> <td> </td> <td valign=top><table width="100%" border=0 cellpadding=0 cellspacing=0> <tr> <td height=14 valign=bottom><script><!-- function qs(el) {if (window.RegExp && window.encodeURIComponent) {var qe=encodeURIComponent(document.f.q.value);if (el.href.indexOf("q=")!=-1) {el.href=el.href.replace(new RegExp("q=[^&$]*"),"q="+qe);} else {el.href+="&q="+qe;}}return 1;} // --> </script> <table border=0 cellpadding=4 cellspacing=0> <tr> <td class=q><font size=-1><a id=0a class=q href="/webhp?hl=en&tab=iw" onClick="return qs(this);">Web</a> <a id=1a class=q href="/imghp?hl=en&tab=wi" onClick="return qs(this);">Images</a> <a id=2a class=q href="/grphp?hl=en&tab=wg" onClick="return qs(this);">Groups</a> <a id=4a class=q href="/nwshp?hl=en&tab=wn" onClick="return qs(this);">News</a> <b>more »</b></font></td> </tr> </table></td> </tr> <tr> <td nowrap><form name=gs method=GET action=/search> <input type=text name=q size=40 maxlength=2048> <input type=submit name="btnG" value="Search the Web"> </form></td> </tr> </table> </td> </tr> </table> <table width=100% border=0 cellpadding=0 cellspacing=0> <tr> <td bgcolor=#3366cc><img width=1 height=1 alt=""></td> </tr> </table><table width=100% border=0 cellpadding=2 cellspacing=0> <tr> <td colspan=4 bgcolor=#e5ecf9><b>More Google products</b></td> </tr> </table> <table width="100%" border="0" cellpadding="0"> <tr> <td align="center" valign="top" width="50%"><table width="95%" border="0" cellpadding="5"> <tr> <td colspan="2" style="padding-top:15px;"><span class="header">Search</span></td> </tr> <tr> <td valign=top width=35><a href="/alerts?hl=en"><img src="/options/icons/alerts.gif" alt="Alerts" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/alerts?hl=en">Alerts</a> <br> <font color="333333">Receive news and search results via email</font></font></td> </tr> <tr> <td valign=top><a href="http://www.google.co.in/blogsearch?hl=en"><img src="/options/icons/blogsearch.gif" alt="Blog Search" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://www.google.co.in/blogsearch?hl=en">Blog Search</a> <br> <font color="333333">Find blogs on your favorite topics</font></font></td> </tr> <tr> <td valign=top><a href="/books?hl=en"><img src="/options/icons/print.gif" alt="Book Search" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/books?hl=en">Book Search</a> <br> <font color="333333">Search the full text of books</font></font></td> </tr> <tr> <td valign=top><a href="http://desktop.google.com/en/GB/?utm_source=en_GB-et-more&utm_medium=et&utm_campaign=en_GB"><img src="/options/icons/desktop.gif" alt="Desktop" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://desktop.google.com/en/GB/?utm_source=en_GB-et-more&utm_medium=et&utm_campaign=en_GB">Desktop</a> <br> <font color="333333">Search your own computer</font></font></td> </tr> <tr> <td valign=top><a href="/dirhp?hl=en"><img src="/options/icons/directory.gif" alt="Directory" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/dirhp?hl=en">Directory</a> <br> <font color="333333">Browse the web by topic</font></font></td> </tr> <tr> <td valign=top><a href="/imghp?hl=en"><img src="/options/icons/images.gif" alt="Images" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/imghp?hl=en">Images</a> <br> <font color="333333">Search for images on the web</font></font></td> </tr> <tr> <td valign=top><a href="/nwshp?hl=en"><img src="/options/icons/news.gif" alt="News" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/nwshp?hl=en">News</a> <br> <font color="333333">Search thousands of news stories</font></font></td> </tr> <tr> <td valign=top><a href="http://www.google.co.in/notebook/?hl=en"><img src="/options/icons/notebook.gif" alt="Notebook" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://www.google.co.in/notebook/?hl=en">Notebook</a><sup style="color:red">New!</sup> <br> <font color="333333">Clip and collect information as you surf the web</font></font></td> </tr> <tr> <td valign=top><a href="http://scholar.google.com"><img src="/options/icons/scholar.gif" alt="Scholar" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://scholar.google.com">Scholar</a> <br> <font color="333333">Search scholarly papers</font></font></td> </tr> <tr> <td valign=top><a href="/options/specialsearches.html"><img src="/options/icons/special.gif" alt="Specialized Searches" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/options/specialsearches.html">Specialized Searches</a> <br> <font color="333333">Search within specific topics</font></font></td> </tr> <tr> <td valign=top><a href="http://toolbar.google.com/T4/intl/en-GB/?utm_source=en_GB-et-more&utm_medium=et&utm_campaign=en_GB&tbbrand=GZEZ"><img src="/options/icons/toolbar.gif" alt="Toolbar" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://toolbar.google.com/T4/intl/en-GB/?utm_source=en_GB-et-more&utm_medium=et&utm_campaign=en_GB&tbbrand=GZEZ">Toolbar</a> <br> <font color="333333">Add a search box to your browser</font></font></td> </tr> <tr> <td valign=top><a href="/intl/en/options/universities.html"><img src="/options/icons/univ.gif" alt="University Search" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/intl/en/options/universities.html">University Search</a> <br> <font color="333333">Search a specific school's website</font></font></td> </tr> <tr> <td valign=top><a href="/webhp?hl=en"><img src="/options/icons/web.gif" alt="Web Search" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/webhp?hl=en">Web Search</a> <br> <font color="333333">Search over 8 billion web pages</font></font></td> </tr> <tr> <td valign=top><a href="/intl/en/help/features.html"><img src="/options/icons/calc_img.gif" alt="Web Search Features" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/intl/en/help/features.html">Web Search Features</a> <br> <font color="333333">Do more with search</font></font></td> </tr> <tr> <td> </td> <td> </td> </tr> </table></td> <td align="center" valign="top" width="50%"><table width="95%" border="0" cellpadding="5"> <tr> <td colspan="2" style="padding-top:15px;"><span class="header">Explore and innovate</span></td> </tr> <tr> <td valign=top><a href="http://code.google.com"><img src="/options/icons/code.gif" alt="Code" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://code.google.com">Code</a> <br> <font color="333333">Download APIs and open source code</font></font></td> </tr> <tr> <td valign=top><a href="http://labs.google.co.in"><img src="/options/icons/labs.gif" alt="Labs" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://labs.google.co.in">Labs</a> <br> <font color="333333">Try out new Google products</font></font></td> </tr> <tr> <td colspan="2" style="padding-top:15px;"><span class="header">Communicate, show & share</span></td> </tr> <tr> <td valign=top width=35><a href="http://www.blogger.com/start?hl=en"><img src="/options/icons/blogger.gif" alt="Blogger" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://www.blogger.com/start?hl=en">Blogger</a> <br> <font color="333333">Express yourself online</font></font></td> </tr> <tr> <td valign=top><a href="https://www.google.com/accounts/ServiceLogin?service=cl&passive=true&nui=1&continue=http%3A%2F%2Fwww.google.com%2Fcalendar%2Frender%3Fhl%3Den_GB&hl=en_GB&utm_source=en_GB-more&utm_medium=more&utm_campaign=en_GB"><img src="/options/icons/calendar.gif" alt="Calendar" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="https://www.google.com/accounts/ServiceLogin?service=cl&passive=true&nui=1&continue=http%3A%2F%2Fwww.google.com%2Fcalendar%2Frender%3Fhl%3Den_GB&hl=en_GB&utm_source=en_GB-more&utm_medium=more&utm_campaign=en_GB">Calendar</a> <br> <font color="333333">Organise your schedule and share events with friends</font></font></td> </tr> <tr> <td valign=top><a href="https://www.google.com/accounts/ServiceLogin?service=writely&passive=true&continue=http%3A%2F%2Fdocs.google.com%2F%3Fhl%3Den_GB&hl=en_GB<mpl=homepage&nui=1&utm_source=en_GB-more&utm_medium=more&tm_campaign=en_GB"><img src="/options/icons/dns.gif" alt="Docs & Spreadsheets" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="https://www.google.com/accounts/ServiceLogin?service=writely&passive=true&continue=http%3A%2F%2Fdocs.google.com%2F%3Fhl%3Den_GB&hl=en_GB<mpl=homepage&nui=1&utm_source=en_GB-more&utm_medium=more&tm_campaign=en_GB">Docs & Spreadsheets</a> <br> <font color="333333">Create and share documents online and access them from anywhere</font></font></td> </tr> <tr> <td valign=top><a href="http://mail.google.com/mail?hl=en&utm_source=en-et-more&utm_medium=et&utm_campaign=en"><img src="/options/icons/gmail.gif" alt="Gmail" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://mail.google.com/mail?hl=en&utm_source=en-et-more&utm_medium=et&utm_campaign=en">Gmail</a> <br> <font color="333333">Fast, searchable email with less spam</font></font></td> </tr> <tr> <td valign=top><a href="/grphp?hl=en"><img src="/options/icons/groups.gif" alt="Groups" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/grphp?hl=en">Groups</a> <br> <font color="333333">Create mailing lists and discussion groups</font></font></td> </tr> <tr> <td valign=top><a href="http://picasa.google.co.in/intl/en/#utm_source=en-all-more&utm_campaign=en-pic&utm_medium=et"><img src="/options/icons/picasa.gif" alt="Picasa" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://picasa.google.co.in/intl/en/#utm_source=en-all-more&utm_campaign=en-pic&utm_medium=et">Picasa</a> <br> <font color="333333">Find, edit and share your photos</font></font></td> </tr> <tr> <td valign=top><a href="http://www.google.com/talk/intl/en-GB/#utm_source=en-et-more&utm_medium=et&utm_campaign=en-GB"><img src="/options/icons/talk.gif" alt="Talk" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://www.google.com/talk/intl/en-GB/#utm_source=en-et-more&utm_medium=et&utm_campaign=en-GB">Talk</a> <br> <font color="333333">IM and call your friends through your computer</font></font></td> </tr> <tr> <td valign=top><a href="http://translate.google.com/translate_t"><img src="/options/icons/translate.gif" alt="Translate" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://translate.google.com/translate_t">Translate</a> <br> <font color="333333">View web pages in other languages</font></font></td> </tr> <tr> <td colspan="2" style="padding-top:15px;"><span class="header">Go mobile</span></td> </tr> <tr> <td valign=top><a href="/mobile"><img src="/options/icons/mobile.gif" alt="Mobile" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="/mobile">Mobile</a> <br> <font color="333333">Use Google on your mobile phone</font></font></td> </tr> <tr> <td colspan="2" style="padding-top:15px;"><span class="header">Make your computer work better</span></td> </tr> <tr> <td valign=top><a href="http://pack.google.com/intl/en-gb/pack_installer.html?hl=en-gb&gl=in&utm_source=en_gb_IN-et-more&utm_medium=et&utm_campaign=en_gb_IN"><img src="/options/icons/pack.gif" alt="Pack" width=35 height=35 vspace=1 border=0></a></td> <td valign=top><font size="-1"><a href="http://pack.google.com/intl/en-gb/pack_installer.html?hl=en-gb&gl=in&utm_source=en_gb_IN-et-more&utm_medium=et&utm_campaign=en_gb_IN">Pack</a> <br> <font color="333333">A free collection of essential software</font></font></td> </tr> </table></td> </tr> </table> <center> <table width=100% border=0 cellpadding=0 cellspacing=0> <tr> <td bgcolor=#3366cc><img width=1 height=1 alt=""></td> </tr> </table> <table width=100% border=0 cellpadding=2 cellspacing=0 bgcolor=e5ecf9> <tr> <td bgcolor=e5ecf9 nowrap><table width=100% border=0 cellpadding=0 cellspacing=0> <tr> <td align=center nowrap><font size=-1>©2007 Google <a href="/webhp?hl=en">Google Home</a> - <a href="/ads/">Advertising Programs</a> - <a href="/about.html">About Google</a></font></td> </tr> </table></td> </tr> </table> </center> </body> </html> This is the source code of http://www.google.co.in/intl/en/options/ Note the href tags there will be no "www.google.com" in the beggining. Now i need to append this to each and every link and also filter subdomains Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-345800 Share on other sites More sharing options...
btherl Posted September 12, 2007 Share Posted September 12, 2007 Ok, I understand the problem now. But I don't see why it is difficult. If an href does not have a hostname, then you need to add the hostname. If it does have a hostname, then you don't need to add it. With the subdomains, you can extract the hostname from the link and check if it is subdomain. What is the approach you are using now, and why is this difficult? Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-346482 Share on other sites More sharing options...
d.shankar Posted September 12, 2007 Author Share Posted September 12, 2007 How to check if its a subdomain ? Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-346662 Share on other sites More sharing options...
Azu Posted September 12, 2007 Share Posted September 12, 2007 If there is more then one dot, and it doesn't start with www. Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-346718 Share on other sites More sharing options...
d.shankar Posted September 12, 2007 Author Share Posted September 12, 2007 hehe i know that i asked a specific function. i use strpos() Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-346720 Share on other sites More sharing options...
Azu Posted September 12, 2007 Share Posted September 12, 2007 I guess I didn't understand your question. I thought you were asking how to check if it's a subdomain, so I told you how to check if it's a subdomain. Maybe rephrase your question? Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-346727 Share on other sites More sharing options...
d.shankar Posted September 12, 2007 Author Share Posted September 12, 2007 There's no need to rephrase the question. First need to find whether it is a subdomain and then i have to filter it. Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-346732 Share on other sites More sharing options...
Azu Posted September 12, 2007 Share Posted September 12, 2007 Hmm.. that's what I thought you were asking.. and I told you how to do that.. *not sure what your problem is now* =S Quote Link to comment https://forums.phpfreaks.com/topic/68791-rectify-this-crawler/#findComment-347210 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.