Need help with preg match...

physaux · April 18, 2010

Here is a sample line of what I would have to find:

somedomain.com/name-meaning/XXXXX">

I want to make a preg_match expression to find for me the XXXXX (which is a name, so 1-~20(?) characters)

Could someone help me please!

physaux · April 19, 2010

Ok so here is my preg match so far, but I am getting an error that says something to do with a delimiter.

preg_match('/somedomain.com/name-meaning/(^)">/',$pagedata,$matches);
print_r($matches);

I really have no clue what I am doing. Could anyone please fix that regex expression for me please (and tell me what is wrong)? Thanks!

andrewgauger · April 19, 2010

The / in the middle should be a \/ because it is a special character (delimiter). Are you trying to return the XXXX into a variable?

First: preg_match will only return up to once.

Preg_match_all will do a better job of multiple searches.

I'm no regex wiz but I think if you change the / to a \/ you should get the matches you want.

Oh yeah and replace (^) with .+

so you probably want:

preg_match_all('/somedomain.com\/name-meaning\/.+">/',$pagedata,$matches);

physaux · April 19, 2010

OK thanks, that one is working now. Now I am trying a second part, which I will now (try to) describe below.

second part

need a new regex. Please ignore previous posts, they were for the first regex. I now have a second one that is causing me problems. This boldness is just to prevent confusion

second part

So here is the raw text that this script will be chewing through:

<a href="/browse/letter/a?page=2">2</a>  
<a href="/browse/letter/a?page=3">3</a>  
<a href="/browse/letter/a?page=4">4</a>  
<a href="/browse/letter/a?page=5">5</a> 
<a href="/browse/letter/a?page=6">6</a>  
<a href="/browse/letter/a?page=7">7</a>  
<a href="/browse/letter/a?page=8">8</a>  
<a href="/browse/letter/a?page=9">9</a> 
<a href="/browse/letter/a?page=10">10</a>  
<a href="/browse/letter/a?page=11">11</a>  
<a href="/browse/letter/a?page=12">12</a>  
<a href="/browse/letter/a?page=13">13</a>  
<a href="/browse/letter/a?page=14">14</a>  
<a href="/browse/letter/a?page=15">15</a>

(there are no newline characters, I added them just so that the code above does not get squished into a single line)

I want to extract how many pages there are. So I would want the result array to be 2,3,4,...,15

So from my understanding, I am looking for something that starts with browse/letter/a?page= and ends with "

..right?

I now tried to change my delimiter to "~", here is what I have so far:

$regex = "~browse~/letter~/$letter?page=(.*)\"~Us";
echo $regex."\n\n";
preg_match($regex,$page,$matches);
print_r($matches);

But I am getting an error

Unknown modifier '/'

Thanks for the help before! How about this one?

andrewgauger · April 19, 2010

$ is still a special character.

I got:

preg_match_all("/browse\/letter\/a\?page=[0-9]+/",$page,$match);

to match and assemble an array such that:

Array ( [0] => Array ( [0] => browse/letter/a?page=2 [1] => browse/letter/a?page=3 [2] => browse/letter/a?page=4 [3] => browse/letter/a?page=5 [4] => browse/letter/a?page=6 [5] => browse/letter/a?page=7 [6] => browse/letter/a?page=8 [7] => browse/letter/a?page=9 [8] => browse/letter/a?page=10 [9] => browse/letter/a?page=11 [10] => browse/letter/a?page=12 [11] => browse/letter/a?page=13 [12] => browse/letter/a?page=14 [13] => browse/letter/a?page=15 ) )

I'm trying to figure out how to assemble the array of numbers.

cags · April 19, 2010

Changing the delimiter to a character other than forward slash is always a good idea when working with paths, due to the constant need to keep escaping forward slashes otherwise. The reason the OPs code didn't work was because you replaced all forward slashes with a ~, all you needed to do was replace the first and last (hence delimiters) and completely remove the backslashes that were escaping the forward slashes, leaving...

$regex = "~browse/letter/$letter?page=(.*)\"~Us";

As andrewgauger has pointed $ is a meta character BUT, I believe in this instance that would be irrelevant. I assume by the use of $letter you wish a character which is stored in a variable to be inserted there. Since this is a double quoted string, that dollar sign will have been evaluated out of the string before the PCRE engine receives the pattern. To capture the numbers all you need to do is add a capture group around the number part, so combining andrewgaugers pattern with the OPs....

$regex = "~browse/letter/$letter\?page=([0-9]+)\"~";

physaux · April 19, 2010

Thanks for the corrections, and the very detailed explanation. It is no longer telling me about an error, but it is still not printing out the data that I wanted. Just to be safe, I printed out the contents of $page on the screen, as well as printed out the contents of $regex, as well as the resulting array.

RELEVANT $page DATA COPIED FROM "view source" of the output of my code(This is all on one line. I only added new lines to make it easier for you to see):

<p> 
1  
<a href="/browse/letter/a?page=2">2</a>  
<a href="/browse/letter/a?page=3">3</a>  
<a href="/browse/letter/a?page=4">4</a> 
<a href="/browse/letter/a?page=5">5</a>  
<a href="/browse/letter/a?page=6">6</a>  
<a href="/browse/letter/a?page=7">7</a>  
<a href="/browse/letter/a?page=8">8</a>  
<a href="/browse/letter/a?page=9">9</a>  
<a href="/browse/letter/a?page=10">10</a>  
<a href="/browse/letter/a?page=11">11</a>  
<a href="/browse/letter/a?page=12">12</a>  
<a href="/browse/letter/a?page=13">13</a>  
<a href="/browse/letter/a?page=14">14</a>  
<a href="/browse/letter/a?page=15">15</a>  
<a href="/browse/letter/a?page=2">next»</a> 
</p>

Here is the printed out $regex

~browse/letter/a\?page=([0-9]+)"~

And here is the printed out $matches result

Array
(
    [0] => browse/letter/a?page=2"
    [1] => 2
)

And once again, here is all my code:

echo $page;
echo "AFTERPAGE\n\n\n<br/><br/>\n\n";
//$regex = "~browse~/letter~/".$letter."?page=.*\">(.*)<~/a>~Us";
$regex = "~browse/letter/$letter\?page=([0-9]+)\"~";
echo $regex."\n\n";
preg_match($regex,$page,$matches);
print_r($matches);

-Oh and if it matters, $page is gotten by using 'curl'. But I'm sure that it works fine because of the outputed values, they are the same as when I view the URL i'm scraping using 'curl'

Soo, does anyone still see a problem? I want the resulting array to contain 2,3,4,...,14,15. But it's not!

cags · April 19, 2010

Note the use of preg_match_all in andrewgaugers code. preg_match will only match the first, you want *all* matches so you should use preg_match_all. You will then want the contents of $matches[1].

physaux · April 19, 2010

yipee :D

Thanks, it works perfectly now!

andrewgauger · April 19, 2010

To capture the numbers all you need to do is add a capture group around the number part, so combining andrewgaugers pattern with the OPs....
$regex = "~browse/letter/$letter\?page=([0-9]+)\"~";

Thanks, use () to designated capture group--got it!

Sign In

Need help with preg match...

Recommended Posts

physaux

Link to comment

Share on other sites

physaux

Link to comment

Share on other sites

andrewgauger

Link to comment

Share on other sites

physaux

Link to comment

Share on other sites

andrewgauger

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

physaux

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

physaux

Link to comment

Share on other sites

andrewgauger

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information