Preg_match a google search - help

phplemon · October 8, 2013

Hi, I'm trying to get some search results from google by using cURL and preg_match.

<?php

    $curl = curl_init();
    curl_setopt ($curl, CURLOPT_URL, "https://www.google.se/#q=horses");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

    $result = curl_exec ($curl);
    curl_close ($curl);

   
        if(preg_match_all('#<cite>(.*)</cite>#', $result, $cite))
    {
        foreach($cite[0] as $cite)
        {
            echo $cite . '<br />';    
        }
    }
?>

It doesnt work, I've used this code on other websites to get other things and it works there. What is the problem?

Thank you

jazzman1 · October 8, 2013

Save the html output of this searchable link into a file. Then grab the content (line by line) of this file and use DOM or some regEx to handle your desirable data.

The script bellow I've written in BASH for my friend which searches open ports on the web ( for satellite TV's), you can get ideas.

However, for sure you have to save somewhere this data before to handle it, not just to display it in the browser.

#!/bin/bash

echo -n 'Enter your ip address and its range: '

read b1 b2 r1 r2


if [ $r1 -lt 256 ] && [ $r2 -lt 256 ] && [ $b1 -lt 256 ] && [ $b2 -lt 256 ]; then

if [ -f hosts.txt ];then

rm -f hosts.txt

fi

START=$(date +%s)

for n in $(seq $r1 $r2);do
 
HOST="$b1.$b2.$n.0/24"

nmap --max-retries 0 -p T:80,8080 $HOST | grep --basic-regexp 'Nmap scan report for' &>> hosts.txt

done

sort hosts.txt | grep --only-matching '[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*\.[[:digit:]]*' | sort --output=hosts.txt

while read line; do

curl --basic --max-time 1 $line | grep --perl-regexp '.*(enIgma weB Interface)' "--ignore-case" &> /dev/null

if [ $? -eq 0 ]; then

chmod 600 outputs

echo $line &>> outputs

fi

done < hosts.txt

clear

sort --numeric-sort --unique --output=outputs outputs

chmod 400 outputs

END=$(date +%s)

DIFF=$(expr $END - $START )

cat outputs

echo "This script has been executed in $DIFF seconds"

else

echo 'There is something wrong! Please try again!'

exit

fi

.josh · October 8, 2013

Your first issue is that you are making a secure (ssl) request and you didn't set any configuration options for it. The *easy* way is to try and do this, and hope it's enough:

curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

2nd, you may possibly need to set cURL to follow redirects:

curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);

3rd, you may possibly need to fake your user agent. Many sites do not like bots scraping them and look at things like user agent. IDK if Google accepts it or not (haven't really had a need to scrape google results) but they are most certainly aware of it (after all, that's a huge portion of being a search engine..having bots that do exactly what you are attempting to do), and I have a feeling they may likely have measures against it, since it undermines the money they make off the PPC and SEO tools they provide..

curl_setopt($curl, CURLOPT_USERAGENT, "user agent string here"); // google user agent strings, pick a common browser's

If it's still not working, use curl_error and curl_errno to help figure out why cURL is failing

.josh · October 8, 2013

Also a few more things:

1) Are you sure you are even using the right URL? When I try to go to https://www.google.se/#q=horses with js disabled (cURL will not execute js) I just get initial Google page.. it doesn't actually give me the search results page. If i change the # to a ? I still get the same page but with the search field filled in with "horses" - but still no search results. I have to actually click the search button to get to the search results page. Now on the other hand, if I go to this url: https://www.google.se/search?q=horses I go straight to the SERP

2) Now about your actual regex..

preg_match_all('#<cite>(.*)</cite>#', $result, $cite)

This is not going to give you expected results. If you take a look at the viewsource of the SERP, you will see that Google basically has everything on 1 line. Well you are using a greedy match-all .* so it's going to make for a single match all the way to the last </cite>. You will want to change that to be a lazy match-all .*? so that it will only match up until the first </cite>

requinix · October 9, 2013

If you really want to use Google's search, do it the right way: Custom Search, which also has an API.

phplemon · October 9, 2013

Thank you for the answers.

Turnes out I only needed to change the google URL(even thou the url I was using worked in browsers but not with PHP curl.)

So problem solved. Thanks.

Sign In

Preg_match a google search - help

Recommended Posts

phplemon

Link to comment

Share on other sites

jazzman1

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

phplemon

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information