Jump to content

Rectify this crawler !


d.shankar

Recommended Posts

I have a spider program that crawls all pages and retrieves all links.

But i have to properly align it with the domain name.

In doing so my LoC has become too large.....

 

for eg: if i give www.google.com ,

it shows all links and subdomains such as form.py , mail.google.com ...

i have to filter the subdomains and have to add the domain name in front of files.

 

Hope you get me.

Link to comment
Share on other sites

<html>
<head>
<meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8">
<title>More Google products</title>
<style>
<!--
body,td,div,p,a{font-family:arial,sans-serif }
a:link{color:#00c}
a:visited{color:#551a8b}
a:active{color:#f00}
//.q {color:#0000cc;}
.header {
font-size:100%;
font-weight: bold;
}
-->
</style>

</head>
<body bgcolor=#ffffff onLoad="document.gs.reset()" topmargin=2 marginheight=2>
<table border=0 cellpadding=0 cellspacing=2 width=100%>
  <tr>
    <td width="1%" valign=top><a href="/webhp?hl=en"><img src=/images/google_sm.gif alt="Go to Google Home" width=143 height=59 hspace=3 vspace=5 border=0></a></td>
    <td>  </td>

    <td valign=top><table width="100%" border=0 cellpadding=0 cellspacing=0>
        <tr>
          <td height=14 valign=bottom><script><!--
function qs(el) {if (window.RegExp && window.encodeURIComponent) {var qe=encodeURIComponent(document.f.q.value);if (el.href.indexOf("q=")!=-1) {el.href=el.href.replace(new RegExp("q=[^&$]*"),"q="+qe);} else {el.href+="&q="+qe;}}return 1;}
// -->
</script>
              <table border=0 cellpadding=4 cellspacing=0>
                <tr>
                  <td class=q><font size=-1><a id=0a class=q href="/webhp?hl=en&tab=iw" onClick="return qs(this);">Web</a>    <a id=1a class=q href="/imghp?hl=en&tab=wi" onClick="return qs(this);">Images</a>    <a id=2a class=q href="/grphp?hl=en&tab=wg" onClick="return qs(this);">Groups</a>    <a id=4a class=q href="/nwshp?hl=en&tab=wn" onClick="return qs(this);">News</a>    <b>more »</b></font></td>

                </tr>
            </table></td>
        </tr>
        <tr>
          <td nowrap><form name=gs method=GET action=/search>
	 <input type=text name=q size=40 maxlength=2048>
              <input type=submit name="btnG" value="Search the Web">
          </form></td>
        </tr>

      </table>
    </td>
  </tr>
</table>
<table width=100% border=0 cellpadding=0 cellspacing=0>
  <tr>
    <td bgcolor=#3366cc><img width=1 height=1 alt=""></td>
  </tr>
</table><table width=100% border=0 cellpadding=2 cellspacing=0>
   <tr>

    <td colspan=4 bgcolor=#e5ecf9><b>More Google products</b></td>

  </tr>
</table>

<table width="100%"  border="0" cellpadding="0">
  <tr>
    <td align="center" valign="top" width="50%"><table width="95%"  border="0" cellpadding="5">
      <tr>
        <td colspan="2" style="padding-top:15px;"><span class="header">Search</span></td>
      </tr>
        <tr>
          <td valign=top width=35><a href="/alerts?hl=en"><img src="/options/icons/alerts.gif" alt="Alerts" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/alerts?hl=en">Alerts</a> <br>
              <font color="333333">Receive news and search results via email</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://www.google.co.in/blogsearch?hl=en"><img src="/options/icons/blogsearch.gif" alt="Blog Search" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://www.google.co.in/blogsearch?hl=en">Blog Search</a> <br>
              <font color="333333">Find blogs on your favorite topics</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/books?hl=en"><img src="/options/icons/print.gif" alt="Book Search" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/books?hl=en">Book Search</a> <br>
              <font color="333333">Search the full text of books</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://desktop.google.com/en/GB/?utm_source=en_GB-et-more&utm_medium=et&utm_campaign=en_GB"><img src="/options/icons/desktop.gif" alt="Desktop" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://desktop.google.com/en/GB/?utm_source=en_GB-et-more&utm_medium=et&utm_campaign=en_GB">Desktop</a> <br>
              <font color="333333">Search your own computer</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/dirhp?hl=en"><img src="/options/icons/directory.gif" alt="Directory" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/dirhp?hl=en">Directory</a> <br>
              <font color="333333">Browse the web by topic</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/imghp?hl=en"><img src="/options/icons/images.gif" alt="Images" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/imghp?hl=en">Images</a> <br>
              <font color="333333">Search for images on the web</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/nwshp?hl=en"><img src="/options/icons/news.gif" alt="News" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/nwshp?hl=en">News</a> <br>
              <font color="333333">Search thousands of news stories</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://www.google.co.in/notebook/?hl=en"><img src="/options/icons/notebook.gif" alt="Notebook" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://www.google.co.in/notebook/?hl=en">Notebook</a><sup style="color:red">New!</sup> <br>
              <font color="333333">Clip and collect information as you surf the web</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://scholar.google.com"><img src="/options/icons/scholar.gif" alt="Scholar" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://scholar.google.com">Scholar</a> <br>
              <font color="333333">Search scholarly papers</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/options/specialsearches.html"><img src="/options/icons/special.gif" alt="Specialized Searches" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/options/specialsearches.html">Specialized Searches</a> <br>
              <font color="333333">Search within specific topics</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://toolbar.google.com/T4/intl/en-GB/?utm_source=en_GB-et-more&utm_medium=et&utm_campaign=en_GB&tbbrand=GZEZ"><img src="/options/icons/toolbar.gif" alt="Toolbar" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://toolbar.google.com/T4/intl/en-GB/?utm_source=en_GB-et-more&utm_medium=et&utm_campaign=en_GB&tbbrand=GZEZ">Toolbar</a> <br>
              <font color="333333">Add a search box to your browser</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/intl/en/options/universities.html"><img src="/options/icons/univ.gif" alt="University Search" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/intl/en/options/universities.html">University Search</a> <br>
              <font color="333333">Search a specific school's website</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/webhp?hl=en"><img src="/options/icons/web.gif" alt="Web Search" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/webhp?hl=en">Web Search</a> <br>
              <font color="333333">Search over 8 billion web pages</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/intl/en/help/features.html"><img src="/options/icons/calc_img.gif" alt="Web Search Features" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/intl/en/help/features.html">Web Search Features</a> <br>
              <font color="333333">Do more with search</font></font></td>
        </tr>
        <tr>
        <td> </td>
        <td> </td>
      </tr>
    </table></td>
    <td align="center" valign="top" width="50%"><table width="95%"  border="0" cellpadding="5">
      <tr>
        <td colspan="2" style="padding-top:15px;"><span class="header">Explore and innovate</span></td>
      </tr>
      <tr>
          <td valign=top><a href="http://code.google.com"><img src="/options/icons/code.gif" alt="Code" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://code.google.com">Code</a> <br>
              <font color="333333">Download APIs and open source code</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://labs.google.co.in"><img src="/options/icons/labs.gif" alt="Labs" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://labs.google.co.in">Labs</a> <br>
              <font color="333333">Try out new Google products</font></font></td>
        </tr>
      <tr>
        <td colspan="2" style="padding-top:15px;"><span class="header">Communicate, show & share</span></td>
      </tr>
      <tr>
          <td valign=top width=35><a href="http://www.blogger.com/start?hl=en"><img src="/options/icons/blogger.gif" alt="Blogger" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://www.blogger.com/start?hl=en">Blogger</a> <br>
              <font color="333333">Express yourself online</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="https://www.google.com/accounts/ServiceLogin?service=cl&passive=true&nui=1&continue=http%3A%2F%2Fwww.google.com%2Fcalendar%2Frender%3Fhl%3Den_GB&hl=en_GB&utm_source=en_GB-more&utm_medium=more&utm_campaign=en_GB"><img src="/options/icons/calendar.gif" alt="Calendar" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="https://www.google.com/accounts/ServiceLogin?service=cl&passive=true&nui=1&continue=http%3A%2F%2Fwww.google.com%2Fcalendar%2Frender%3Fhl%3Den_GB&hl=en_GB&utm_source=en_GB-more&utm_medium=more&utm_campaign=en_GB">Calendar</a> <br>
              <font color="333333">Organise your schedule and share events with friends</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="https://www.google.com/accounts/ServiceLogin?service=writely&passive=true&continue=http%3A%2F%2Fdocs.google.com%2F%3Fhl%3Den_GB&hl=en_GB&ltmpl=homepage&nui=1&utm_source=en_GB-more&utm_medium=more&tm_campaign=en_GB"><img src="/options/icons/dns.gif" alt="Docs & Spreadsheets" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="https://www.google.com/accounts/ServiceLogin?service=writely&passive=true&continue=http%3A%2F%2Fdocs.google.com%2F%3Fhl%3Den_GB&hl=en_GB&ltmpl=homepage&nui=1&utm_source=en_GB-more&utm_medium=more&tm_campaign=en_GB">Docs & Spreadsheets</a> <br>
                <font color="333333">Create and share documents online and access them from anywhere</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://mail.google.com/mail?hl=en&utm_source=en-et-more&utm_medium=et&utm_campaign=en"><img src="/options/icons/gmail.gif" alt="Gmail" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://mail.google.com/mail?hl=en&utm_source=en-et-more&utm_medium=et&utm_campaign=en">Gmail</a> <br>
          <font color="333333">Fast, searchable email with less spam</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="/grphp?hl=en"><img src="/options/icons/groups.gif" alt="Groups" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/grphp?hl=en">Groups</a> <br>
              <font color="333333">Create mailing lists and discussion groups</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://picasa.google.co.in/intl/en/#utm_source=en-all-more&utm_campaign=en-pic&utm_medium=et"><img src="/options/icons/picasa.gif" alt="Picasa" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://picasa.google.co.in/intl/en/#utm_source=en-all-more&utm_campaign=en-pic&utm_medium=et">Picasa</a> <br>
              <font color="333333">Find, edit and share your photos</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://www.google.com/talk/intl/en-GB/#utm_source=en-et-more&utm_medium=et&utm_campaign=en-GB"><img src="/options/icons/talk.gif" alt="Talk" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://www.google.com/talk/intl/en-GB/#utm_source=en-et-more&utm_medium=et&utm_campaign=en-GB">Talk</a> <br>
              <font color="333333">IM and call your friends through your computer</font></font></td>
        </tr>
        <tr>
          <td valign=top><a href="http://translate.google.com/translate_t"><img src="/options/icons/translate.gif" alt="Translate" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://translate.google.com/translate_t">Translate</a> <br>
              <font color="333333">View web pages in other languages</font></font></td>
        </tr>
      <tr>
        <td colspan="2" style="padding-top:15px;"><span class="header">Go mobile</span></td>
      </tr>
      <tr>
          <td valign=top><a href="/mobile"><img src="/options/icons/mobile.gif" alt="Mobile" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="/mobile">Mobile</a> <br>
              <font color="333333">Use Google on your mobile phone</font></font></td>
        </tr>
      <tr>
        <td colspan="2" style="padding-top:15px;"><span class="header">Make your computer work better</span></td>
      </tr>
        <tr>
          <td valign=top><a href="http://pack.google.com/intl/en-gb/pack_installer.html?hl=en-gb&gl=in&utm_source=en_gb_IN-et-more&utm_medium=et&utm_campaign=en_gb_IN"><img src="/options/icons/pack.gif" alt="Pack" width=35 height=35 vspace=1 border=0></a></td>
          <td valign=top><font size="-1"><a href="http://pack.google.com/intl/en-gb/pack_installer.html?hl=en-gb&gl=in&utm_source=en_gb_IN-et-more&utm_medium=et&utm_campaign=en_gb_IN">Pack</a> <br>
              <font color="333333">A free collection of essential software</font></font></td>
        </tr>
    </table></td>
  </tr>
</table>
<center>
   <table width=100% border=0 cellpadding=0 cellspacing=0>
  <tr>
    <td bgcolor=#3366cc><img width=1 height=1 alt=""></td>
  </tr>

</table>
  <table width=100% border=0 cellpadding=2 cellspacing=0 bgcolor=e5ecf9>
    <tr>
      <td bgcolor=e5ecf9 nowrap><table width=100% border=0 cellpadding=0 cellspacing=0>
          <tr>
            <td align=center nowrap><font size=-1>©2007 Google  <a href="/webhp?hl=en">Google Home</a> - <a href="/ads/">Advertising
                Programs</a> - <a href="/about.html">About Google</a></font></td>

          </tr>
      </table></td>
    </tr>
  </table>
</center>
</body>
</html>

 

 

This is the source code of http://www.google.co.in/intl/en/options/

Note the href tags there will be no "www.google.com" in the beggining.

 

Now i need to append this to each and every link and also filter subdomains

Link to comment
Share on other sites

Ok, I understand the problem now.  But I don't see why it is difficult.

 

If an href does not have a hostname, then you need to add the hostname.  If it does have a hostname, then you don't need to add it.

 

With the subdomains, you can extract the hostname from the link and check if it is subdomain.

 

What is the approach you are using now, and why is this difficult?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.