thewired Posted March 21, 2009 Share Posted March 21, 2009 I have a site I am scraping source from and want to grab the info that says "WANT". The code looks like this, with the TD on a seperate line. Can I get help making a regex code for this? I am a regex noob The code looks like this, WANT and the random.com's will be different everytime. <td class="file"> <a href="random.com" title="random.com">WANT</a> Quote Link to comment Share on other sites More sharing options...
DJTim666 Posted March 21, 2009 Share Posted March 21, 2009 Try <?php $stringToSearch = "<a href=\"lalala.com\" title=\"lalala.com\">Go Here</a>"; preg_match("/\<a href\=\"(.*?)\" title\=\"(.*?)\"\>(.*?)\<\/a\>/i", $stingToSearch, $matches); print_r($matches); ?> Simple code, really. $matches[0] will be the first pattern, $matches[1] will be the second and so on. Quote Link to comment Share on other sites More sharing options...
thewired Posted March 21, 2009 Author Share Posted March 21, 2009 Thanks, I appreciate the response. That is pretty close to what I want, however that will grab any link on the page and I only want to grab links preceded the line: <td class="file"> Little more help? Quote Link to comment Share on other sites More sharing options...
thewired Posted March 21, 2009 Author Share Posted March 21, 2009 I tried using this for the regex, but no luck. I am assuming it is because the td class tag is on a separate line. '/\<td class\=\"file\"\>\<a href\=\"(.*?)\" title\=\"(.*?)\"\>(.*?)\<\/a\>/i' Quote Link to comment Share on other sites More sharing options...
.josh Posted March 21, 2009 Share Posted March 21, 2009 '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>~is' Quote Link to comment Share on other sites More sharing options...
DJTim666 Posted March 21, 2009 Share Posted March 21, 2009 <?php $stringToSearch = "<a href=\"lalala.com\" title=\"lalala.com\">Go Here</a>"; preg_match("/\<td class\=\"file\"\>\n\<a href\=\"(.*?)\" title\=\"(.*?)\"\>(.*?)\<\/a\>/i", $stingToSearch, $matches); print_r($matches); ?> Quote Link to comment Share on other sites More sharing options...
thewired Posted March 22, 2009 Author Share Posted March 22, 2009 Thanks for the help guys. I tried DJTims and Cranyon's but your new code didn't work DJTim. So I'm using Crayon's. Can someone explain to me what the difference is between ([^"]*) and (.*?) ? Also it seems to work as is but does it need the backslashes like DJTim's code? Quote Link to comment Share on other sites More sharing options...
.josh Posted March 22, 2009 Share Posted March 22, 2009 backslashes are used to escape things. For instance, if you have this: $string = "some "random" thing"; you are going to get a parse error, because php will think the 2nd quote is the end of the string. In order to tell php that no, that's not the end of the string, you escape it like this: $string = "some \"random\" thing"; That is the general principle of the backslash. Within a regex pattern, there are several things that need to be escaped. For one thing, quotes you may be trying to match within the pattern, just like I mentioned above. I don't have the quotes escaped in the pattern I gave, because I used single quotes around the pattern. Since I used single quotes, the double quotes don't need to be escaped, because php doesn't match single quotes to double quotes like that. Now, if there was a single quote in the pattern, I would have had to escape it, since I used single quotes around the pattern. Next thing is the pattern delimiter. The delimiter is what tells the regex engine what the start and end of the pattern is. You can use pretty much any non-alphanumeric character for the pattern delimiter. DJ chose to use / as the delimiter. Since he chose to use that, he has to escape any instance of that in the pattern (like in closing html tags), so that the regex engine knows for instance the / in </a> is not the end of the pattern, but part of the pattern. So it would have to look like this: <\/a>. / is a pretty common character to popup in patterns, because running regexes on html content is pretty common. I usually use ~ because it is a character that doesn't come up often, and instantly makes one less thing I have to escape in the pattern, as far as dealing with html content. On top of that, putting a backslash in front of certain things denotes special characters. For instance, \n stands for a new line. \s stands for a space or tab. \d stands for a digit. \w stands for any lower or uppercase letter or underscore. There are several things in DJ's regex that do not need escaping, because he doesn't use them as delimiters, nor do they mean anything special to the regex engine (=, >, and <) Escaping them doesn't necessarily hurt anything, but it makes for an ugly regex and also gives away noobness ([^"]*) means to match and capture 0 or more of anything that is not a ". It's pretty simple and straight forward. Is the next character a "? No? okay it matches. Keep on going. (.*?) means to match and capture 0 or more of anything except a new line, unless you use a modifier to tell it to match new lines too. It will keep matching until it reaches the first instance in which the rest of the pattern after it can be matched. So in order for it to get a final match, the engine must constantly look ahead and keep back tracking until it finds that first instance. Then it has to turn around and walk through the string all over again, for the rest of the pattern. So the really really short answer is the first one is more efficient and less likely to produce unexpected matches, so you should use negated character classes ([^]) instead of nongreedy match-alls (.*?) whenever possible. Quote Link to comment Share on other sites More sharing options...
thewired Posted March 22, 2009 Author Share Posted March 22, 2009 Thanks for your response Crayon it was very informative. I have now changed my regex a bit, and I am having problems, which I hope you or someone else can help me fix. The regex looks like this: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>\s*</td>\s*<td class="crcsize">([^"]*)</td>\s*\s*<td class="seeds"><b>([^"]*)</b></td>\s*<td class="conns"><b>([^"]*)</b></td>~is' And it is grabbing info from source code that looks like this: <td class="file"> <a href="link.com" title="title">grab1</a> </td> <td class="crcsize">grab2</td> <td class="seeds"><b>grab3</b></td> <td class="conns"><b>grab4</b></td> My problem is that some of the code on the page looks like this: <td class="file"> <a href="link.com" title="title">X</a> </td> <td class="crcsize">X</td> <td class="seeds" colspan="7"></td> That code is getting placed in a string in the array containing the link names. This is a problem. I do not want that code to even be taken into consideration, I want my code to complete ignore it and not take any values from those blocks of html. Help? Quote Link to comment Share on other sites More sharing options...
thewired Posted March 22, 2009 Author Share Posted March 22, 2009 I believe this is because it is grabbing the name along with all the source code that follows it. Here's an example to help make my problem more clear. The array looks like this: [12]=> string(35) "url name 12" [13]=> string(29) "url name 13" [14]=> string(2077) "url name</a> </td> <td class="crcsize">X</td> <td class="seeds" colspan="7"></td> </tr> Quote Link to comment Share on other sites More sharing options...
.josh Posted March 22, 2009 Share Posted March 22, 2009 very first thing I see wrong in your pattern is this: <td class="crcsize">([^"]*)</td> I think maybe you missed the point of negated char classes vs. match-alls. ([^"]*) is specific to getting stuff between quotes. For example: href="([^"]*)" means to keep matching until you hit a double quote. Well does that really make sense within the context of this? <td class="crcsize">([^"]*)</td> That says to match <td class="crcsize"> and then keep matching until you hit a quote, and then </td> so it's not going to match until it finds the first "</td> in your string, which looks like according to your posted example, doesn't exist. Quote Link to comment Share on other sites More sharing options...
thewired Posted March 22, 2009 Author Share Posted March 22, 2009 very first thing I see wrong in your pattern is this: <td class="crcsize">([^"]*)</td> I think maybe you missed the point of negated char classes vs. match-alls. ([^"]*) is specific to getting stuff between quotes. For example: href="([^"]*)" means to keep matching until you hit a double quote. Well does that really make sense within the context of this? <td class="crcsize">([^"]*)</td> That says to match <td class="crcsize"> and then keep matching until you hit a quote, and then </td> so it's not going to match until it finds the first "</td> in your string, which looks like according to your posted example, doesn't exist. Hmm so how do you recommend I fix it? It should be something along the lines of ([^<]*) (keep matching till it hits <) right? Well this didn't help the problem, if its even valid... I tried: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>\s*</td>\s*<td class="crcsize">([^<]*)</td>\s*\s*<td class="seeds"><b>([^<]*)</b></td>\s*<td class="conns"><b>([^<]*)</b></td>~is' Quote Link to comment Share on other sites More sharing options...
thewired Posted March 22, 2009 Author Share Posted March 22, 2009 Actually, when I replace (.*?) with ([^<]*), it fixes the problem! Yay So now my code is: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">([^<]*)</a>\s*</td>\s*<td class="crcsize">([^<]*)</td>\s*\s*<td class="seeds"><b>([^<]*)</b></td>\s*<td class="conns"><b>([^<]*)</b></td>~is' Of course the (potential) problem is see with this is if there is a < in the link title, it won't grab the whole title. Would something along these lines be valid ([^</]*) ? My objective with that is for it to stop only when it gets to a </ (ending html tag). Quote Link to comment Share on other sites More sharing options...
.josh Posted March 22, 2009 Share Posted March 22, 2009 no. Negative character classes only match one character at a time. So it will match anything that is not a < or /, not a string of "</" together. What you want is negative lookahead. Something like (?!</a>).*? Quote Link to comment Share on other sites More sharing options...
.josh Posted March 22, 2009 Share Posted March 22, 2009 another problem with your pattern though is that it doesn't take into consideration other things that might be in your td tags, or certain ones not being there at all. For example, the example you posted: <td class="file"> <a href="link.com" title="title">X</a> </td> <td class="crcsize">X</td> <td class="seeds" colspan="7"></td> That has a colspan in your seeds td, and also your conns td is missing. Both of those things will make your regex fail Quote Link to comment Share on other sites More sharing options...
.josh Posted March 22, 2009 Share Posted March 22, 2009 You can do it in one regex, doing something like this: '~<td.*?class="(?:file|crcsize|seeds|conns)"[^>]*>\s*(??:<a href="([^"]*)" title="([^"]*)">(.*?)</a>)|(?:<b>)?(.*?)(?:</b>)?)\s*</td>\s*~is' What that basically does is look for any td with class file, crcsize,seeds, or conns. Then it will either look for a link tag and match the stuff inside it, or just do a generic match everything, to accommodate the different scenarios. This pattern will match all of your info. It will match the href, title, stuff between link tags, general stuff between the td tags, check for bold tags, etc.. for any of those 4 classes. The main problem with this pattern is that it will make for some funky ass result formatting. Try it out and do a print_r on the results to see what I mean. There's a whole lot of empty elements, for things that don't match for any given td. Your best bet is to break it down into 2 different regexes. First match the link stuff, then match the other class td's. Quote Link to comment Share on other sites More sharing options...
thewired Posted March 22, 2009 Author Share Posted March 22, 2009 You can do it in one regex, doing something like this: '~<td.*?class="(?:file|crcsize|seeds|conns)"[^>]*>\s*(??:<a href="([^"]*)" title="([^"]*)">(.*?)</a>)|(?:<b>)?(.*?)(?:</b>)?)\s*</td>\s*~is' What that basically does is look for any td with class file, crcsize,seeds, or conns. Then it will either look for a link tag and match the stuff inside it, or just do a generic match everything, to accommodate the different scenarios. This pattern will match all of your info. It will match the href, title, stuff between link tags, general stuff between the td tags, check for bold tags, etc.. for any of those 4 classes. The main problem with this pattern is that it will make for some funky ass result formatting. Try it out and do a print_r on the results to see what I mean. There's a whole lot of empty elements, for things that don't match for any given td. Your best bet is to break it down into 2 different regexes. First match the link stuff, then match the other class td's. Yeah your right about funky results. For my uses however, my code should work fine. I can see how it would break like you said if some tags I can't predict show up and whatnot, but that shouldn't happen in my case. Anyway thanks for all the help with my regex questions! Quote Link to comment Share on other sites More sharing options...
redarrow Posted March 23, 2009 Share Posted March 23, 2009 what a post my god that grate info, if you get time can you, explain all the delimiters please. if you can add any other fantastic advance info please do so. Whale much quicker then 10 books. backslashes are used to escape things. For instance, if you have this: $string = "some "random" thing"; you are going to get a parse error, because php will think the 2nd quote is the end of the string. In order to tell php that no, that's not the end of the string, you escape it like this: $string = "some \"random\" thing"; That is the general principle of the backslash. Within a regex pattern, there are several things that need to be escaped. For one thing, quotes you may be trying to match within the pattern, just like I mentioned above. I don't have the quotes escaped in the pattern I gave, because I used single quotes around the pattern. Since I used single quotes, the double quotes don't need to be escaped, because php doesn't match single quotes to double quotes like that. Now, if there was a single quote in the pattern, I would have had to escape it, since I used single quotes around the pattern. Next thing is the pattern delimiter. The delimiter is what tells the regex engine what the start and end of the pattern is. You can use pretty much any non-alphanumeric character for the pattern delimiter. DJ chose to use / as the delimiter. Since he chose to use that, he has to escape any instance of that in the pattern (like in closing html tags), so that the regex engine knows for instance the / in </a> is not the end of the pattern, but part of the pattern. So it would have to look like this: <\/a>. / is a pretty common character to popup in patterns, because running regexes on html content is pretty common. I usually use ~ because it is a character that doesn't come up often, and instantly makes one less thing I have to escape in the pattern, as far as dealing with html content. On top of that, putting a backslash in front of certain things denotes special characters. For instance, \n stands for a new line. \s stands for a space or tab. \d stands for a digit. \w stands for any lower or uppercase letter or underscore. There are several things in DJ's regex that do not need escaping, because he doesn't use them as delimiters, nor do they mean anything special to the regex engine (=, >, and <) Escaping them doesn't necessarily hurt anything, but it makes for an ugly regex and also gives away noobness ([^"]*) means to match and capture 0 or more of anything that is not a ". It's pretty simple and straight forward. Is the next character a "? No? okay it matches. Keep on going. (.*?) means to match and capture 0 or more of anything except a new line, unless you use a modifier to tell it to match new lines too. It will keep matching until it reaches the first instance in which the rest of the pattern after it can be matched. So in order for it to get a final match, the engine must constantly look ahead and keep back tracking until it finds that first instance. Then it has to turn around and walk through the string all over again, for the rest of the pattern. So the really really short answer is the first one is more efficient and less likely to produce unexpected matches, so you should use negated character classes ([^]) instead of nongreedy match-alls (.*?) whenever possible. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.