Jump to content

Need help with preg match...


physaux

Recommended Posts

Ok so here is my preg match so far, but I am getting an error that says something to do with a delimiter.

preg_match('/somedomain.com/name-meaning/(^)">/',$pagedata,$matches);
print_r($matches);

 

I really have no clue what I am doing. Could anyone please fix that regex expression for me please (and tell me what is wrong)? Thanks!

The / in the middle should be a \/ because it is a special character (delimiter).  Are you trying to return the XXXX into a variable?

 

First: preg_match will only return up to once.

 

Preg_match_all will do a better job of multiple searches.

 

I'm no regex wiz but I think if you change the / to a \/ you should get the matches you want.

 

Oh yeah and replace (^) with .+

 

so you probably want:

 

preg_match_all('/somedomain.com\/name-meaning\/.+">/',$pagedata,$matches);

OK thanks, that one is working now. Now I am trying a second part, which I will now (try to) describe below.

 

second part

second part

second part

 

need a new regex. Please ignore previous posts, they were for the first regex. I now have a second one that is causing me problems. This boldness is just to prevent confusion :)

 

second part

second part

second part

 

So here is the raw text that this script will be chewing through:

<a href="/browse/letter/a?page=2">2</a>  
<a href="/browse/letter/a?page=3">3</a>  
<a href="/browse/letter/a?page=4">4</a>  
<a href="/browse/letter/a?page=5">5</a> 
<a href="/browse/letter/a?page=6">6</a>  
<a href="/browse/letter/a?page=7">7</a>  
<a href="/browse/letter/a?page=8">8</a>  
<a href="/browse/letter/a?page=9">9</a> 
<a href="/browse/letter/a?page=10">10</a>  
<a href="/browse/letter/a?page=11">11</a>  
<a href="/browse/letter/a?page=12">12</a>  
<a href="/browse/letter/a?page=13">13</a>  
<a href="/browse/letter/a?page=14">14</a>  
<a href="/browse/letter/a?page=15">15</a>

(there are no newline characters, I added them just so that the code above does not get squished into a single line)

 

I want to extract how many pages there are. So I would want the result array to be 2,3,4,...,15

 

So from my understanding, I am looking for something that starts with browse/letter/a?page= and ends with "

 

..right?

 

 

I now tried to change my delimiter to "~", here is what I have so far:

 

$regex = "~browse~/letter~/$letter?page=(.*)\"~Us";
echo $regex."\n\n";
preg_match($regex,$page,$matches);
print_r($matches);

 

But I am getting an error

Unknown modifier '/' 

 

Thanks for the help before! How about this one?

$ is still a special character.

 

I got:

preg_match_all("/browse\/letter\/a\?page=[0-9]+/",$page,$match);

to match and assemble an array such that:

 

Array ( [0] => Array ( [0] => browse/letter/a?page=2 [1] => browse/letter/a?page=3 [2] => browse/letter/a?page=4 [3] => browse/letter/a?page=5 [4] => browse/letter/a?page=6 [5] => browse/letter/a?page=7 [6] => browse/letter/a?page=8 [7] => browse/letter/a?page=9 [8] => browse/letter/a?page=10 [9] => browse/letter/a?page=11 [10] => browse/letter/a?page=12 [11] => browse/letter/a?page=13 [12] => browse/letter/a?page=14 [13] => browse/letter/a?page=15 ) )

 

I'm trying to figure out how to assemble the array of numbers.

 

Changing the delimiter to a character other than forward slash is always a good idea when working with paths, due to the constant need to keep escaping forward slashes otherwise. The reason the OPs code didn't work was because you replaced all forward slashes with a ~, all you needed to do was replace the first and last (hence delimiters) and completely remove the backslashes that were escaping the forward slashes, leaving...

 

$regex = "~browse/letter/$letter?page=(.*)\"~Us";

 

As andrewgauger has pointed $ is a meta character BUT, I believe in this instance that would be irrelevant. I assume by the use of $letter you wish a character which is stored in a variable to be inserted there. Since this is a double quoted string, that dollar sign will have been evaluated out of the string before the PCRE engine receives the pattern. To capture the numbers all you need to do is add a capture group around the number part, so combining andrewgaugers pattern with the OPs....

 

$regex = "~browse/letter/$letter\?page=([0-9]+)\"~";

Thanks for the corrections, and the very detailed explanation. It is no longer telling me about an error, but it is still not printing out the data that I wanted. Just to be safe, I printed out the contents of $page on the screen, as well as printed out the contents of $regex, as well as the resulting array.

 

RELEVANT $page DATA COPIED FROM "view source" of the output of my code(This is all on one line. I only added new lines to make it easier for you to see):

<p> 
1  
<a href="/browse/letter/a?page=2">2</a>  
<a href="/browse/letter/a?page=3">3</a>  
<a href="/browse/letter/a?page=4">4</a> 
<a href="/browse/letter/a?page=5">5</a>  
<a href="/browse/letter/a?page=6">6</a>  
<a href="/browse/letter/a?page=7">7</a>  
<a href="/browse/letter/a?page=8">8</a>  
<a href="/browse/letter/a?page=9">9</a>  
<a href="/browse/letter/a?page=10">10</a>  
<a href="/browse/letter/a?page=11">11</a>  
<a href="/browse/letter/a?page=12">12</a>  
<a href="/browse/letter/a?page=13">13</a>  
<a href="/browse/letter/a?page=14">14</a>  
<a href="/browse/letter/a?page=15">15</a>  
<a href="/browse/letter/a?page=2">next»</a> 
</p>

 

Here is the printed out $regex

~browse/letter/a\?page=([0-9]+)"~

 

And here is the printed out $matches result

Array
(
    [0] => browse/letter/a?page=2"
    [1] => 2
)

 

And once again, here is all my code:

echo $page;
echo "AFTERPAGE\n\n\n<br/><br/>\n\n";
//$regex = "~browse~/letter~/".$letter."?page=.*\">(.*)<~/a>~Us";
$regex = "~browse/letter/$letter\?page=([0-9]+)\"~";
echo $regex."\n\n";
preg_match($regex,$page,$matches);
print_r($matches);

 

-Oh and if it matters, $page is gotten by using 'curl'. But I'm sure that it works fine because of the outputed values, they are the same as when I view the URL i'm scraping using 'curl'

 

Soo, does anyone still see a problem? I want the resulting array to contain 2,3,4,...,14,15. But it's not!

To capture the numbers all you need to do is add a capture group around the number part, so combining andrewgaugers pattern with the OPs....

 

$regex = "~browse/letter/$letter\?page=([0-9]+)\"~";

 

Thanks, use () to designated capture group--got it!

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.