Jump to content

Preg_match a google search - help


Recommended Posts

Hi, I'm trying to get some search results from google by using cURL and preg_match.


    $curl = curl_init();
    curl_setopt ($curl, CURLOPT_URL, "https://www.google.se/#q=horses");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

    $result = curl_exec ($curl);
    curl_close ($curl);

        if(preg_match_all('#<cite>(.*)</cite>#', $result, $cite))
        foreach($cite[0] as $cite)
            echo $cite . '<br />';    

It doesnt work, I've used this code on other websites to get other things and it works there. What is the problem?


Thank you

Link to comment
Share on other sites

Save the html output of this searchable link into a file. Then grab the content (line by line) of this file and use DOM or some regEx to handle your desirable data.


The script bellow I've written in BASH for my friend which searches open ports on the web ( for satellite TV's), you can get ideas.


However, for sure you have to save somewhere this data before to handle it, not just to display it in the browser.


echo -n 'Enter your ip address and its range: '

read b1 b2 r1 r2

if [ $r1 -lt 256 ] && [ $r2 -lt 256 ] && [ $b1 -lt 256 ] && [ $b2 -lt 256 ]; then

if [ -f hosts.txt ];then

rm -f hosts.txt


START=$(date +%s)

for n in $(seq $r1 $r2);do

nmap --max-retries 0 -p T:80,8080 $HOST | grep --basic-regexp 'Nmap scan report for' &>> hosts.txt


sort hosts.txt | grep --only-matching '[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*' | sort --output=hosts.txt

while read line; do

curl --basic --max-time 1 $line | grep --perl-regexp '.*(enIgma weB Interface)' "--ignore-case" &> /dev/null

if [ $? -eq 0 ]; then

chmod 600 outputs

echo $line &>> outputs


done < hosts.txt


sort --numeric-sort --unique --output=outputs outputs

chmod 400 outputs

END=$(date +%s)

DIFF=$(expr $END - $START )

cat outputs

echo "This script has been executed in $DIFF seconds"


echo 'There is something wrong! Please try again!'


Edited by jazzman1
Link to comment
Share on other sites

Your first issue is that you are making a secure (ssl) request and you didn't set any configuration options for it. The *easy* way is to try and do this, and hope it's enough:


curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
2nd, you may possibly need to set cURL to follow redirects:


curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
3rd, you may possibly need to fake your user agent. Many sites do not like bots scraping them and look at things like user agent. IDK if Google accepts it or not (haven't really had a need to scrape google results) but they are most certainly aware of it (after all, that's a huge portion of being a search engine..having bots that do exactly what you are attempting to do), and I have a feeling they may likely have measures against it, since it undermines the money they make off the PPC and SEO tools they provide..


curl_setopt($curl, CURLOPT_USERAGENT, "user agent string here"); // google user agent strings, pick a common browser's 
If it's still not working, use curl_error and curl_errno to help figure out why cURL is failing
Link to comment
Share on other sites

Also a few more things:


1) Are you sure you are even using the right URL? When I try to go to https://www.google.se/#q=horses with js disabled (cURL will not execute js) I just get initial Google page.. it doesn't actually give me the search results page. If i change the # to a ? I still get the same page but with the search field filled in with "horses" - but still no search results. I have to actually click the search button to get to the search results page. Now on the other hand, if I go to this url: https://www.google.se/search?q=horses I go straight to the SERP


2) Now about your actual regex..


preg_match_all('#<cite>(.*)</cite>#', $result, $cite)
This is not going to give you expected results. If you take a look at the viewsource of the SERP, you will see that Google basically has everything on 1 line. Well you are using a greedy match-all .* so it's going to make for a single match all the way to the last </cite>. You will want to change that to be a lazy match-all .*? so that it will only match up until the first </cite>
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.