Jump to content

Crawling Google search result pages


rishiraj

Recommended Posts

I want to make a crawler in php that will crawl google search results for given keywords. Procedure will be something like this

 

1. There will be a list of thousands keywords in file with csv or other format.

2. Crawler will crawl google.co.in for each keywords in the file.

3. Top 10 results title, description and the url will be collected and stored in MySQL database.

4. Now crawler will crawl for next keyword after some delay and loop will go on unless reach to daily limit of keywords to crawl. Then next day it will start again.

 

I need some suggestion on

1. How to crawl pages without using any addons.

(Because I am going to run this from an free server not my machine so I will only have php, mysql and general features. )

2. What kind of parsing I should use to extract title, description and urls from HTML code.

3. What should be the delay and daily crawl limit. ( I don't want to get banned by google for automatic query. )

 

I will be really thankful for any kind of help. Link to some kind of article most welcome.

Link to comment
Share on other sites

Initially I have tried that but didn't find Google Ajax search api helpful in getting sponsored results.

As Ajax search based on JavaScript so I never get the HTML code for sponsored results.

So I didn't find any way to get the sponsored result, because when i view source there is not

code for search result. If there is anyway to get the sponsored result code from Google Ajax search let me know.

Link to comment
Share on other sites

I am using curl to open Google search page

 

$filelocation="http://www.google.com/search?q=cellphone&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $filelocation);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$html=curl_exec ($ch);
curl_close ($ch);

 

Now I want all the sponsored results to get in php variable from $html, like

$title[0]="Cell phone" //ad title

$adurl[0]="http://www.unfoundation.org/vodafone/index.asp"

// ad url is appended after href=/url?sa or href=/pagead/iclk?sa

$addescription[0]="Improving telecommunications to help in times of disaster."

$displayurl[0]="www.UNFoundation.org/vodafone"

 

I am not able to parse ad data from html code ( because I can't write regex-regular expression for that)

I need some kind of help in writing regex to parse ad data from html code.

 

I can pay for it.

 

P.S. - Google sponsored results are at top or right of natural results.

Link to comment
Share on other sites

Dude its not click fraud,

I work for a SEM company that wants to check the competitiors for particular keyword before bidding so it can bid wisely.

These tool will be use to find the competitor for particular keyword and what are the keywords the competitor is bidding for.

If you need further explanation PM me. 

Link to comment
Share on other sites

I need regular expression to get the details

Code:

<a id=an5 href=/pagead/iclk?sa=l&ai=Bjt-Pnum=8&adurl=http://www.westhost.com/package-compare.html%3FDgoo-gene>
$3.95 <b>Web Hosting</b></a></font><br>VPS, Huge Disk Space and Bandwidth!<br>
Fall Special ends soon...<br><span class=a>www.westhost.com</span>

<a id=pa3 href=/url?sa=L&ai=B0MF0&q=http://www.3ix.com/%3Fso onmouseover="return true">
2GB <b>Web Hosting</b> $1/Rs.40</a><br>
<font size=-1><span class=a>www.3ix.in</span>

 

I have only above two type of code in my document.

and I want to extract following data from it.

 

Example:

exact url: http://www.westhost.com/package-compare.html

Title: $3.95 Web Hosting

Description : VPS, Huge Disk Space and Bandwidth! Fall Special ends soon...

Domain: www.westhost.com

 

I can make some kinda logic but cant make exact regular expression

<a id=(an|pa)[0-9] href=/[^&q|&adurl] (&q|&adurl)=$exacturl%[^ ]> $title </a> <span>$Domain </span>$description </font>

 

 

I need regular expression to parse this data from my html code.

with regular expression I can use preg_match_all to get the data.

 

P.S. - For any reference one can refer http://www.google.com/search?hl=en&q...=Google+Search

From here i got the HTML code. Exact url is ended at % sign.

 

Thanks for any kind of help

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.