physaux Posted April 18, 2010 Share Posted April 18, 2010 Here is a sample line of what I would have to find: somedomain.com/name-meaning/XXXXX"> I want to make a preg_match expression to find for me the XXXXX (which is a name, so 1-~20(?) characters) Could someone help me please! Quote Link to comment Share on other sites More sharing options...
physaux Posted April 19, 2010 Author Share Posted April 19, 2010 Ok so here is my preg match so far, but I am getting an error that says something to do with a delimiter. preg_match('/somedomain.com/name-meaning/(^)">/',$pagedata,$matches); print_r($matches); I really have no clue what I am doing. Could anyone please fix that regex expression for me please (and tell me what is wrong)? Thanks! Quote Link to comment Share on other sites More sharing options...
andrewgauger Posted April 19, 2010 Share Posted April 19, 2010 The / in the middle should be a \/ because it is a special character (delimiter). Are you trying to return the XXXX into a variable? First: preg_match will only return up to once. Preg_match_all will do a better job of multiple searches. I'm no regex wiz but I think if you change the / to a \/ you should get the matches you want. Oh yeah and replace (^) with .+ so you probably want: preg_match_all('/somedomain.com\/name-meaning\/.+">/',$pagedata,$matches); Quote Link to comment Share on other sites More sharing options...
physaux Posted April 19, 2010 Author Share Posted April 19, 2010 OK thanks, that one is working now. Now I am trying a second part, which I will now (try to) describe below. second part second part second part need a new regex. Please ignore previous posts, they were for the first regex. I now have a second one that is causing me problems. This boldness is just to prevent confusion second part second part second part So here is the raw text that this script will be chewing through: <a href="/browse/letter/a?page=2">2</a> <a href="/browse/letter/a?page=3">3</a> <a href="/browse/letter/a?page=4">4</a> <a href="/browse/letter/a?page=5">5</a> <a href="/browse/letter/a?page=6">6</a> <a href="/browse/letter/a?page=7">7</a> <a href="/browse/letter/a?page=8">8</a> <a href="/browse/letter/a?page=9">9</a> <a href="/browse/letter/a?page=10">10</a> <a href="/browse/letter/a?page=11">11</a> <a href="/browse/letter/a?page=12">12</a> <a href="/browse/letter/a?page=13">13</a> <a href="/browse/letter/a?page=14">14</a> <a href="/browse/letter/a?page=15">15</a> (there are no newline characters, I added them just so that the code above does not get squished into a single line) I want to extract how many pages there are. So I would want the result array to be 2,3,4,...,15 So from my understanding, I am looking for something that starts with browse/letter/a?page= and ends with " ..right? I now tried to change my delimiter to "~", here is what I have so far: $regex = "~browse~/letter~/$letter?page=(.*)\"~Us"; echo $regex."\n\n"; preg_match($regex,$page,$matches); print_r($matches); But I am getting an error Unknown modifier '/' Thanks for the help before! How about this one? Quote Link to comment Share on other sites More sharing options...
andrewgauger Posted April 19, 2010 Share Posted April 19, 2010 $ is still a special character. I got: preg_match_all("/browse\/letter\/a\?page=[0-9]+/",$page,$match); to match and assemble an array such that: Array ( [0] => Array ( [0] => browse/letter/a?page=2 [1] => browse/letter/a?page=3 [2] => browse/letter/a?page=4 [3] => browse/letter/a?page=5 [4] => browse/letter/a?page=6 [5] => browse/letter/a?page=7 [6] => browse/letter/a?page=8 [7] => browse/letter/a?page=9 [8] => browse/letter/a?page=10 [9] => browse/letter/a?page=11 [10] => browse/letter/a?page=12 [11] => browse/letter/a?page=13 [12] => browse/letter/a?page=14 [13] => browse/letter/a?page=15 ) ) I'm trying to figure out how to assemble the array of numbers. Quote Link to comment Share on other sites More sharing options...
cags Posted April 19, 2010 Share Posted April 19, 2010 Changing the delimiter to a character other than forward slash is always a good idea when working with paths, due to the constant need to keep escaping forward slashes otherwise. The reason the OPs code didn't work was because you replaced all forward slashes with a ~, all you needed to do was replace the first and last (hence delimiters) and completely remove the backslashes that were escaping the forward slashes, leaving... $regex = "~browse/letter/$letter?page=(.*)\"~Us"; As andrewgauger has pointed $ is a meta character BUT, I believe in this instance that would be irrelevant. I assume by the use of $letter you wish a character which is stored in a variable to be inserted there. Since this is a double quoted string, that dollar sign will have been evaluated out of the string before the PCRE engine receives the pattern. To capture the numbers all you need to do is add a capture group around the number part, so combining andrewgaugers pattern with the OPs.... $regex = "~browse/letter/$letter\?page=([0-9]+)\"~"; Quote Link to comment Share on other sites More sharing options...
physaux Posted April 19, 2010 Author Share Posted April 19, 2010 Thanks for the corrections, and the very detailed explanation. It is no longer telling me about an error, but it is still not printing out the data that I wanted. Just to be safe, I printed out the contents of $page on the screen, as well as printed out the contents of $regex, as well as the resulting array. RELEVANT $page DATA COPIED FROM "view source" of the output of my code(This is all on one line. I only added new lines to make it easier for you to see): <p> 1 <a href="/browse/letter/a?page=2">2</a> <a href="/browse/letter/a?page=3">3</a> <a href="/browse/letter/a?page=4">4</a> <a href="/browse/letter/a?page=5">5</a> <a href="/browse/letter/a?page=6">6</a> <a href="/browse/letter/a?page=7">7</a> <a href="/browse/letter/a?page=8">8</a> <a href="/browse/letter/a?page=9">9</a> <a href="/browse/letter/a?page=10">10</a> <a href="/browse/letter/a?page=11">11</a> <a href="/browse/letter/a?page=12">12</a> <a href="/browse/letter/a?page=13">13</a> <a href="/browse/letter/a?page=14">14</a> <a href="/browse/letter/a?page=15">15</a> <a href="/browse/letter/a?page=2">next»</a> </p> Here is the printed out $regex ~browse/letter/a\?page=([0-9]+)"~ And here is the printed out $matches result Array ( [0] => browse/letter/a?page=2" [1] => 2 ) And once again, here is all my code: echo $page; echo "AFTERPAGE\n\n\n<br/><br/>\n\n"; //$regex = "~browse~/letter~/".$letter."?page=.*\">(.*)<~/a>~Us"; $regex = "~browse/letter/$letter\?page=([0-9]+)\"~"; echo $regex."\n\n"; preg_match($regex,$page,$matches); print_r($matches); -Oh and if it matters, $page is gotten by using 'curl'. But I'm sure that it works fine because of the outputed values, they are the same as when I view the URL i'm scraping using 'curl' Soo, does anyone still see a problem? I want the resulting array to contain 2,3,4,...,14,15. But it's not! Quote Link to comment Share on other sites More sharing options...
cags Posted April 19, 2010 Share Posted April 19, 2010 Note the use of preg_match_all in andrewgaugers code. preg_match will only match the first, you want *all* matches so you should use preg_match_all. You will then want the contents of $matches[1]. Quote Link to comment Share on other sites More sharing options...
physaux Posted April 19, 2010 Author Share Posted April 19, 2010 yipee :D Thanks, it works perfectly now! Quote Link to comment Share on other sites More sharing options...
andrewgauger Posted April 19, 2010 Share Posted April 19, 2010 To capture the numbers all you need to do is add a capture group around the number part, so combining andrewgaugers pattern with the OPs.... $regex = "~browse/letter/$letter\?page=([0-9]+)\"~"; Thanks, use () to designated capture group--got it! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.