phplemon Posted October 8, 2013 Share Posted October 8, 2013 Hi, I'm trying to get some search results from google by using cURL and preg_match. <?php $curl = curl_init(); curl_setopt ($curl, CURLOPT_URL, "https://www.google.se/#q=horses"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $result = curl_exec ($curl); curl_close ($curl); if(preg_match_all('#<cite>(.*)</cite>#', $result, $cite)) { foreach($cite[0] as $cite) { echo $cite . '<br />'; } } ?> It doesnt work, I've used this code on other websites to get other things and it works there. What is the problem? Thank you Quote Link to comment Share on other sites More sharing options...
jazzman1 Posted October 8, 2013 Share Posted October 8, 2013 (edited) Save the html output of this searchable link into a file. Then grab the content (line by line) of this file and use DOM or some regEx to handle your desirable data. The script bellow I've written in BASH for my friend which searches open ports on the web ( for satellite TV's), you can get ideas. However, for sure you have to save somewhere this data before to handle it, not just to display it in the browser. #!/bin/bash echo -n 'Enter your ip address and its range: ' read b1 b2 r1 r2 if [ $r1 -lt 256 ] && [ $r2 -lt 256 ] && [ $b1 -lt 256 ] && [ $b2 -lt 256 ]; then if [ -f hosts.txt ];then rm -f hosts.txt fi START=$(date +%s) for n in $(seq $r1 $r2);do HOST="$b1.$b2.$n.0/24" nmap --max-retries 0 -p T:80,8080 $HOST | grep --basic-regexp 'Nmap scan report for' &>> hosts.txt done sort hosts.txt | grep --only-matching '[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*' | sort --output=hosts.txt while read line; do curl --basic --max-time 1 $line | grep --perl-regexp '.*(enIgma weB Interface)' "--ignore-case" &> /dev/null if [ $? -eq 0 ]; then chmod 600 outputs echo $line &>> outputs fi done < hosts.txt clear sort --numeric-sort --unique --output=outputs outputs chmod 400 outputs END=$(date +%s) DIFF=$(expr $END - $START ) cat outputs echo "This script has been executed in $DIFF seconds" else echo 'There is something wrong! Please try again!' exit fi Edited October 8, 2013 by jazzman1 Quote Link to comment Share on other sites More sharing options...
.josh Posted October 8, 2013 Share Posted October 8, 2013 Your first issue is that you are making a secure (ssl) request and you didn't set any configuration options for it. The *easy* way is to try and do this, and hope it's enough: curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); 2nd, you may possibly need to set cURL to follow redirects: curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); 3rd, you may possibly need to fake your user agent. Many sites do not like bots scraping them and look at things like user agent. IDK if Google accepts it or not (haven't really had a need to scrape google results) but they are most certainly aware of it (after all, that's a huge portion of being a search engine..having bots that do exactly what you are attempting to do), and I have a feeling they may likely have measures against it, since it undermines the money they make off the PPC and SEO tools they provide.. curl_setopt($curl, CURLOPT_USERAGENT, "user agent string here"); // google user agent strings, pick a common browser's If it's still not working, use curl_error and curl_errno to help figure out why cURL is failing Quote Link to comment Share on other sites More sharing options...
.josh Posted October 8, 2013 Share Posted October 8, 2013 Also a few more things: 1) Are you sure you are even using the right URL? When I try to go to https://www.google.se/#q=horses with js disabled (cURL will not execute js) I just get initial Google page.. it doesn't actually give me the search results page. If i change the # to a ? I still get the same page but with the search field filled in with "horses" - but still no search results. I have to actually click the search button to get to the search results page. Now on the other hand, if I go to this url: https://www.google.se/search?q=horses I go straight to the SERP 2) Now about your actual regex.. preg_match_all('#<cite>(.*)</cite>#', $result, $cite) This is not going to give you expected results. If you take a look at the viewsource of the SERP, you will see that Google basically has everything on 1 line. Well you are using a greedy match-all .* so it's going to make for a single match all the way to the last </cite>. You will want to change that to be a lazy match-all .*? so that it will only match up until the first </cite> Quote Link to comment Share on other sites More sharing options...
requinix Posted October 9, 2013 Share Posted October 9, 2013 If you really want to use Google's search, do it the right way: Custom Search, which also has an API. Quote Link to comment Share on other sites More sharing options...
phplemon Posted October 9, 2013 Author Share Posted October 9, 2013 Thank you for the answers. Turnes out I only needed to change the google URL(even thou the url I was using worked in browsers but not with PHP curl.) So problem solved. Thanks. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.