Google REGEX

slpctrl · December 27, 2008

Alright, I'm doing a little test/project thing and what I need to do is first take a search string randomly given. I have to search for that string in google and then return the URL of a randomly given search result (1-10 since there's 10 results per page). I'm coming along good, but now I'm at the point where I need to find the random search result on the first page (1-10), and store the corrosponding URL. Here's the full code:

<?php
//first CURL query
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://site.com/index.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11");
$result = curl_exec($ch);
curl_close($ch);

//regex
preg_match_all("|location of the (.*) search result|U", $result, $nums);
preg_match_all("|location of the search keyword: (.*) goes here|U", $result, $searches);

//variables for the result number and what to search for
$number = $nums[1][0];
$search = $searches[1][0];

//google
$ch1 = curl_init();
curl_setopt($ch1, CURLOPT_URL,"http://www.google.com/search?hl=en&q=$search&btnG=Google+Search&meta=");
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch1, CURLOPT_VERBOSE, 1);
curl_setopt($ch1, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch1, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11");
$result1 = curl_exec($ch1);
curl_close($ch1);
preg_match_all("|DON'T NKOW WHAT (.*) GOES HERE|U", $result, $matches);
?>

Okay, here's how to make sense of all this. I'm doing a challenge on a site, it gives you what search result to return and what search keyword to use. So, I grabbed that from the first page, all that works fine blah blah. So, I did the first page and stored the 2 variables (search result number and keyword), now I've CURLed google for the keyword given. Here is the format for how google gives it's links:

<a href="http://www.site.com" class=l onmousedown="return clk(this.href,'','','res','1','')">

Where the whole 'res','1' down at the bottom stands for result, 1 etc. Now, I need to use REGEX to find A: the result number, then I need to match it up with the URL given so that I can return it back to the challenge site using CURL once again, but I'm stuck here. I can't wrap my head around how to use REGEX to do this. Can anyone help me? Thanks a bunch .

corbin · December 28, 2008

~<a href="([^"]+)"~

Then the order in which you match them would be the number of the link.

preg_match_all will probably help you ;p.

slpctrl · January 1, 2009

That didn't work. Alright, here's what I NEED to do. I NEED to get only the 10 links in the array, and here are the rules to what links I need to find:

It must start with http://

it must contain:

nmousedown="return clk(this.href,'','','res','1,'')">

^^This is the end of the actual results that I need, it goes to 10 and those are the ONLY LINKS I NEED. If I can just match the links with only that in it, then I'll only have 10 links in the array (0-9) and I can determine which link they are by their location in their array, but so far I haven't been able to do this. I'm still reading up on regex, and I feel as if I've become a virtual pro with it but I still can't find what I need to match that up. I think the 'mousedown="return clk' etc etc can be cut down so that all I need to do is match: ^<a href="http://(.*)" and I should only need to include maybe as little as 'res','[1-10]'...blah, I donno. If anyone could help still that'd be great .

slpctrl · January 1, 2009

I donno, here's what I thought might work:

preg_match_all("|<a href=\"http://(.*)+nmousedown=\"return clk(this.href,'','','res','[0-10],'')\">\z|U", $result, $matches);

You've got the start of the link...must contain <a href="http:// plus an array of characters, and must end (hence \z) with nmousedown=\"return clk(this.href,'','','res','[0-10],'')\">\z. So I'm not sure why I can't get this to work, but it'd be really nice if someone could help

nrg_alpha · January 1, 2009

preg_match_all("|<a href=\"http://(.*)+nmousedown=\"return clk(this.href,'','','res','[0-10],'')\">\z|U", $result, $matches);

I was just in the neighbourhood and saw your latest pattern and immediately, some aspect jump out (I'll get to (.*) in a second).

[0-10]??? You do realize that this will only match zeros and ones, right? This is basically saying a rang from 0 to 1, or 0. So instead of [0-10]," found in 'res','[0-10],''), you can try (?:10|[0-9]{1},")
You have your function brackets clk(.........) but these brackets inside a regex pattern are creating a set of capture parenthesis... if you want literal brackets, you must escape them: clk$.......$
the dots inside your function brackets are match all meta characters... if you want a literal dot, you must also escape them. Example: \.

Getting back to (.*), this is simply very bad to use like this... the reason why is you are saying, match anything, zero or more times..(and then you have a + next to it, which means one or more times... the logic is flawed). What this all means is that this part of the pattern will match all the way to the end of the string you are checking (due to its greedy nature), then it will have to start backtracking, relinquishing one matched character at a time, checking those characters against what follows (.*)+,which in this case is n in nmousedown.... You may want to consider using lazy quantifiers instead: (.*?) and if you don't plan on using those captured characters, then make it a non-captured group: (?:.*?)

It gets even worse when you consider that there may be some other nested tags that start with <a (perhaps an XML tag).. so it is not a bad idea to use \b (word boundery) after <a... Then to safe guard yourself from over matching past the current tag, making use of [^>]* doesn't hurt either.. so without testing this, off the top of my head, I would start with something like this:

preg_match_all('#<a\b[^>]*href="http://(?:.*?)nmousedown="return clk\(this\.href,\'\',\'\',\'res\',\'(?:10|[0-9]{1}),\'\'\)"[^>]*>\z#U', $result, $matches);

I tend to use single quotes, so I escaped the single quotes inside the pattern (not sure if I got it all correct or not). You will have to comb through this and make sure I didn't mess up any quotes, as I have a sense I may have... Now even assuming the regex above is correct, it will not necessarily guarantee it will work (it has to be perfectly set up to work). But really understand that you cannot simply dump just anything inside a pattern and expect it to work.. you have to rethink how you present ranges of digits, and understand that brackets and dots as literals must be escaped, otherwise they are meta characters that serve very different purposes.

You also have \z (end of subject) as one of your modifiers.. I question this as you are using preg_match_all (which is meant to match multiple string instances of what is in your pattern.

So with all this untested,I can only rest a hand on your shoulder and bid you good luck... (time to get ready for sleep... it's very, VERY late and I'm tired).

slpctrl · January 1, 2009

preg_match_all("|<a href=\"http://(.*)+nmousedown=\"return clk(this.href,'','','res','[0-10],'')\">\z|U", $result, $matches);

I was just in the neighbourhood and saw your latest pattern and immediately, some aspect jump out (I'll get to (.*) in a second).

[0-10]??? You do realize that this will only match zeros and ones, right? This is basically saying a rang from 0 to 1, or 0. So instead of [0-10]," found in 'res','[0-10],''), you can try (?:10|[0-9]{1},")

You have your function brackets clk(.........) but these brackets inside a regex pattern are creating a set of capture parenthesis... if you want literal brackets, you must escape them: clk$.......$

the dots inside your function brackets are match all meta characters... if you want a literal dot, you must also escape them. Example: \.

Getting back to (.*), this is simply very bad to use like this... the reason why is you are saying, match anything, zero or more times..(and then you have a + next to it, which means one or more times... the logic is flawed). What this all means is that this part of the pattern will match all the way to the end of the string you are checking (due to its greedy nature), then it will have to start backtracking, relinquishing one matched character at a time, checking those characters against what follows (.*)+,which in this case is n in nmousedown.... You may want to consider using lazy quantifiers instead: (.*?) and if you don't plan on using those captured characters, then make it a non-captured group: (?:.*?)

It gets even worse when you consider that there may be some other nested tags that start with <a (perhaps an XML tag).. so it is not a bad idea to use \b (word boundery) after <a... Then to safe guard yourself from over matching past the current tag, making use of [^>]* doesn't hurt either.. so without testing this, off the top of my head, I would start with something like this:
preg_match_all('#<a\b[^>]*href="http://(?:.*?)nmousedown="return clk$this\.href,\'\',\'\',\'res\',\'(?:10|[0-9]{1}),\'\'$"[^>]*>\z#U', $result, $matches);
I tend to use single quotes, so I escaped the single quotes inside the pattern (not sure if I got it all correct or not). You will have to comb through this and make sure I didn't mess up any quotes, as I have a sense I may have... Now even assuming the regex above is correct, it will not necessarily guarantee it will work (it has to be perfectly set up to work). But really understand that you cannot simply dump just anything inside a pattern and expect it to work.. you have to rethink how you present ranges of digits, and understand that brackets and dots as literals must be escaped, otherwise they are meta characters that serve very different purposes.

You also have \z (end of subject) as one of your modifiers.. I question this as you are using preg_match_all (which is meant to match multiple string instances of what is in your pattern.

So with all this untested,I can only rest a hand on your shoulder and bid you good luck... (time to get ready for sleep... it's very, VERY late and I'm tired).

Hahahahahahha, yeah I just realized that. I posted this last night and I was kind of toasted from a new years party . I'm gonna continue reading up on more regex though.

slpctrl · January 13, 2009

So nobody at all can figure out how to build a regex for only the google results (I.E. the 10 actual results)? :'( I came back to it, can't figure it out I guess.

effigy · January 13, 2009

Does Google not have an API for this?

Try %<a[^>]+?href="http://[^"]+"[^>]+?class=l\s%. You can reduce the results with array_slice.

cwarn23 · January 14, 2009

I have just made a script which will place the first 10 url's and only the url's in an array. The script is as follows:

preg_match_all('/href="[^"]+/',$result,$matches);
array_splice($matches[0], 10);
$matches=$matches[0];
unset($matches[0]);
$matches=str_replace('href="','',$matches);

And the display them all:

foreach($matches AS $match)
    {
    echo $match."<br>";
    }
    unset($match);

Sign In

Google REGEX

Recommended Posts

slpctrl

Link to comment

Share on other sites

corbin

Link to comment

Share on other sites

slpctrl

Link to comment

Share on other sites

slpctrl

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

slpctrl

Link to comment

Share on other sites

slpctrl

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

cwarn23

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information