rishiraj Posted October 15, 2007 Share Posted October 15, 2007 I want to make a crawler in php that will crawl google search results for given keywords. Procedure will be something like this 1. There will be a list of thousands keywords in file with csv or other format. 2. Crawler will crawl google.co.in for each keywords in the file. 3. Top 10 results title, description and the url will be collected and stored in MySQL database. 4. Now crawler will crawl for next keyword after some delay and loop will go on unless reach to daily limit of keywords to crawl. Then next day it will start again. I need some suggestion on 1. How to crawl pages without using any addons. (Because I am going to run this from an free server not my machine so I will only have php, mysql and general features. ) 2. What kind of parsing I should use to extract title, description and urls from HTML code. 3. What should be the delay and daily crawl limit. ( I don't want to get banned by google for automatic query. ) I will be really thankful for any kind of help. Link to some kind of article most welcome. Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/ Share on other sites More sharing options...
kenrbnsn Posted October 15, 2007 Share Posted October 15, 2007 Instead of trying to reinvent the wheel, why don't you use the tools Google provides for that purpose. One is Google Google AJAX Search API . All of the Google APIs can be found at Google APIs Ken Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-369861 Share on other sites More sharing options...
rishiraj Posted October 15, 2007 Author Share Posted October 15, 2007 Initially I have tried that but didn't find Google Ajax search api helpful in getting sponsored results. As Ajax search based on JavaScript so I never get the HTML code for sponsored results. So I didn't find any way to get the sponsored result, because when i view source there is not code for search result. If there is anyway to get the sponsored result code from Google Ajax search let me know. Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-369906 Share on other sites More sharing options...
rishiraj Posted October 16, 2007 Author Share Posted October 16, 2007 any othere suggestions, ken? Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-370686 Share on other sites More sharing options...
rishiraj Posted October 18, 2007 Author Share Posted October 18, 2007 I am using curl to open Google search page $filelocation="http://www.google.com/search?q=cellphone&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $filelocation); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); $html=curl_exec ($ch); curl_close ($ch); Now I want all the sponsored results to get in php variable from $html, like $title[0]="Cell phone" //ad title $adurl[0]="http://www.unfoundation.org/vodafone/index.asp" // ad url is appended after href=/url?sa or href=/pagead/iclk?sa $addescription[0]="Improving telecommunications to help in times of disaster." $displayurl[0]="www.UNFoundation.org/vodafone" I am not able to parse ad data from html code ( because I can't write regex-regular expression for that) I need some kind of help in writing regex to parse ad data from html code. I can pay for it. P.S. - Google sponsored results are at top or right of natural results. Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-372222 Share on other sites More sharing options...
adrianTNT Posted October 18, 2007 Share Posted October 18, 2007 I smell click fraud Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-372238 Share on other sites More sharing options...
rishiraj Posted October 18, 2007 Author Share Posted October 18, 2007 Dude its not click fraud, I work for a SEM company that wants to check the competitiors for particular keyword before bidding so it can bid wisely. These tool will be use to find the competitor for particular keyword and what are the keywords the competitor is bidding for. If you need further explanation PM me. Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-372240 Share on other sites More sharing options...
adrianTNT Posted October 18, 2007 Share Posted October 18, 2007 I think this could help you extract some data from the html tags (links, titles, etc): Class: Get tag value http://www.phpclasses.org/browse/package/4033.html I didn't tested it but looks like what you need. Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-372242 Share on other sites More sharing options...
rishiraj Posted October 19, 2007 Author Share Posted October 19, 2007 I just need some snippets of code, rest i can do by myself. Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-373028 Share on other sites More sharing options...
mattal999 Posted October 19, 2007 Share Posted October 19, 2007 try using this: http://www.weberdev.com/get_example-4678.html ive seen it work. Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-373042 Share on other sites More sharing options...
rishiraj Posted October 19, 2007 Author Share Posted October 19, 2007 I need regular expression to get the details Code: <a id=an5 href=/pagead/iclk?sa=l&ai=Bjt-Pnum=8&adurl=http://www.westhost.com/package-compare.html%3FDgoo-gene> $3.95 <b>Web Hosting</b></a></font><br>VPS, Huge Disk Space and Bandwidth!<br> Fall Special ends soon...<br><span class=a>www.westhost.com</span> <a id=pa3 href=/url?sa=L&ai=B0MF0&q=http://www.3ix.com/%3Fso onmouseover="return true"> 2GB <b>Web Hosting</b> $1/Rs.40</a><br> <font size=-1><span class=a>www.3ix.in</span> I have only above two type of code in my document. and I want to extract following data from it. Example: exact url: http://www.westhost.com/package-compare.html Title: $3.95 Web Hosting Description : VPS, Huge Disk Space and Bandwidth! Fall Special ends soon... Domain: www.westhost.com I can make some kinda logic but cant make exact regular expression <a id=(an|pa)[0-9] href=/[^&q|&adurl] (&q|&adurl)=$exacturl%[^ ]> $title </a> <span>$Domain </span>$description </font> I need regular expression to parse this data from my html code. with regular expression I can use preg_match_all to get the data. P.S. - For any reference one can refer http://www.google.com/search?hl=en&q...=Google+Search From here i got the HTML code. Exact url is ended at % sign. Thanks for any kind of help Quote Link to comment https://forums.phpfreaks.com/topic/73303-crawling-google-search-result-pages/#findComment-373234 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.