Jump to content

regex code for grabbing info from link


thewired

Recommended Posts

I have a site I am scraping source from and want to grab the info that says "WANT". The code looks like this, with the TD on a seperate line. Can I get help making a regex code for this? I am a regex noob  :P

The code looks like this, WANT and the random.com's will be different everytime.

 

<td class="file">

<a href="random.com" title="random.com">WANT</a>

Link to comment
Share on other sites

Try

 

<?php
$stringToSearch = "<a href=\"lalala.com\" title=\"lalala.com\">Go Here</a>";
preg_match("/\<a href\=\"(.*?)\" title\=\"(.*?)\"\>(.*?)\<\/a\>/i", $stingToSearch, $matches);

print_r($matches);
?>

 

Simple code, really. $matches[0] will be the first pattern, $matches[1] will be the second and so on.

Link to comment
Share on other sites

Thanks for the help guys. I tried DJTims and Cranyon's but your new code didn't work DJTim. So I'm using Crayon's.

Can someone explain to me what the difference is between ([^"]*) and (.*?) ? Also it seems to work as is but does it need the backslashes like DJTim's code?

Link to comment
Share on other sites

backslashes are used to escape things.  For instance, if you have this:

 

$string = "some "random" thing";

 

you are going to get a parse error, because php will think the 2nd quote is the end of the string.  In order to tell php that no, that's not the end of the string, you escape it like this:

 

$string = "some \"random\" thing";

 

That is the general principle of the backslash.  Within a regex pattern, there are several things that need to be escaped.  For one thing, quotes you may be trying to match within the pattern, just like I mentioned above.  I don't have the quotes escaped in the pattern I gave, because I used single quotes around the pattern.  Since I used single quotes, the double quotes don't need to be escaped, because php doesn't match single quotes to double quotes like that.  Now, if there was a single quote in the pattern, I would have had to escape it, since I used single quotes around the pattern.

 

Next thing is the pattern delimiter.  The delimiter is what tells the regex engine what the start and end of the pattern is.  You can use pretty much any non-alphanumeric character for the pattern delimiter.  DJ chose to use / as the delimiter.  Since he chose to use that, he has to escape any instance of that in the pattern (like in closing html tags), so that the regex engine knows for instance the / in </a> is not the end of the pattern, but part of the pattern.  So it would have to look like this: <\/a>.  / is a pretty common character to popup in patterns, because running regexes on html content is pretty common.  I usually use ~ because it is a character that doesn't come up often, and instantly makes one less thing I have to escape in the pattern, as far as dealing with html content.

 

On top of that, putting a backslash in front of certain things denotes special characters.  For instance, \n stands for a new line.  \s stands for a space or tab.  \d stands for a digit. \w stands for any lower or uppercase letter or underscore. 

 

There are several things in DJ's regex that do not need escaping, because he doesn't use them as delimiters, nor do they mean anything special to the regex engine (=, >, and <) Escaping them doesn't necessarily hurt anything, but it makes for an ugly regex and also gives away noobness :P

 

([^"]*) means to match and capture 0 or more of anything that is not a ".  It's pretty simple and straight forward.  Is the next character a "? No? okay it matches. Keep on going. 

 

(.*?) means to match and capture 0 or more of anything except a new line, unless you use a modifier to tell it to match new lines too.  It will keep matching until it reaches the first instance in which the rest of the pattern after it can be matched.  So in order for it to get a final match, the engine must constantly look ahead and keep back tracking until it finds that first instance.  Then it has to turn around and walk through the string all over again, for the rest of the pattern.

 

So the really really short answer is the first one is more efficient and less likely to produce unexpected matches, so you should use negated character classes ([^]) instead of nongreedy match-alls (.*?) whenever possible.

 

Link to comment
Share on other sites

Thanks for your response Crayon it was very informative.

I have now changed my regex a bit, and I am having problems, which I hope you or someone else can help me fix.

 

The regex looks like this:

'~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>\s*</td>\s*<td class="crcsize">([^"]*)</td>\s*\s*<td class="seeds"><b>([^"]*)</b></td>\s*<td class="conns"><b>([^"]*)</b></td>~is'

 

And it is grabbing info from source code that looks like this:

<td class="file">
<a href="link.com" title="title">grab1</a>
</td>
<td class="crcsize">grab2</td>

<td class="seeds"><b>grab3</b></td>
<td class="conns"><b>grab4</b></td>

 

My problem is that some of the code on the page looks like this:

<td class="file">
<a href="link.com" title="title">X</a>
</td>
<td class="crcsize">X</td>
<td class="seeds" colspan="7"></td>

 

That code is getting placed in a string in the array containing the link names. This is a problem. I do not want that code to even be taken into consideration, I want my code to complete ignore it and not take any values from those blocks of html. Help?

Link to comment
Share on other sites

I believe this is because it is grabbing the name along with all the source code that follows it.

Here's an example to help make my problem more clear. The array looks like this:

[12]=>
    string(35) "url name 12"
    [13]=>
    string(29) "url name 13"
    [14]=>
    string(2077) "url name</a>
</td>
<td class="crcsize">X</td>

<td class="seeds" colspan="7"></td>
</tr>

Link to comment
Share on other sites

very first thing I see wrong in your pattern is this: <td class="crcsize">([^"]*)</td>

 

I think maybe you missed the point of negated char classes vs. match-alls.  ([^"]*) is specific to getting stuff between quotes.  For example:

 

href="([^"]*)"

 

means to keep matching until you hit a double quote.  Well does that really make sense within the context of this?

 

<td class="crcsize">([^"]*)</td>

 

That says to match <td class="crcsize"> and then keep matching until you hit a quote, and then </td> so it's not going to match until it finds the first

 

"</td>

 

in your string, which looks like according to your posted example, doesn't exist. 

Link to comment
Share on other sites

very first thing I see wrong in your pattern is this: <td class="crcsize">([^"]*)</td>

 

I think maybe you missed the point of negated char classes vs. match-alls.  ([^"]*) is specific to getting stuff between quotes.  For example:

 

href="([^"]*)"

 

means to keep matching until you hit a double quote.  Well does that really make sense within the context of this?

 

<td class="crcsize">([^"]*)</td>

 

That says to match <td class="crcsize"> and then keep matching until you hit a quote, and then </td> so it's not going to match until it finds the first

 

"</td>

 

in your string, which looks like according to your posted example, doesn't exist.

Hmm so how do you recommend I fix it? It should be something along the lines of ([^<]*) (keep matching till it hits <) right? Well this didn't help the problem, if its even valid...

I tried:

'~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>\s*</td>\s*<td class="crcsize">([^<]*)</td>\s*\s*<td class="seeds"><b>([^<]*)</b></td>\s*<td class="conns"><b>([^<]*)</b></td>~is'

Link to comment
Share on other sites

Actually, when I replace (.*?) with ([^<]*), it fixes the problem! Yay  ;D

So now my code is:

'~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">([^<]*)</a>\s*</td>\s*<td class="crcsize">([^<]*)</td>\s*\s*<td class="seeds"><b>([^<]*)</b></td>\s*<td class="conns"><b>([^<]*)</b></td>~is'

 

Of course the (potential) problem is see with this is if there is a < in the link title, it won't grab the whole title. Would something along these lines be valid ([^</]*) ? My objective with that is for it to stop only when it gets to a </ (ending html tag).

Link to comment
Share on other sites

another problem with your pattern though is that it doesn't take into consideration other things that might be in your td tags, or certain ones not being there at all.  For example, the example you posted:

 

<td class="file">
<a href="link.com" title="title">X</a>
</td>
<td class="crcsize">X</td>
<td class="seeds" colspan="7"></td>

 

That has a colspan in your seeds td, and also your conns td is missing.  Both of those things will make your regex fail

Link to comment
Share on other sites

You can do it in one regex, doing something like this:

'~<td.*?class="(?:file|crcsize|seeds|conns)"[^>]*>\s*(??:<a href="([^"]*)" title="([^"]*)">(.*?)</a>)|(?:<b>)?(.*?)(?:</b>)?)\s*</td>\s*~is'

 

What that basically does is look for any td with class file, crcsize,seeds, or conns.  Then it will either look for a link tag and match the stuff inside it, or just do a generic match everything, to accommodate the different scenarios.  This pattern will match all of your info.  It will match the href, title, stuff between link tags, general stuff between the td tags, check for bold tags, etc.. for any of those 4 classes. 

 

The main problem with this pattern is that it will make for some funky ass result formatting. Try it out and do a print_r on the results to see what I mean.  There's a whole lot of empty elements, for things that don't match for any given td.

 

Your best bet is to break it down into 2 different regexes.  First match the link stuff, then match the other class td's. 

 

 

Link to comment
Share on other sites

You can do it in one regex, doing something like this:

'~<td.*?class="(?:file|crcsize|seeds|conns)"[^>]*>\s*(??:<a href="([^"]*)" title="([^"]*)">(.*?)</a>)|(?:<b>)?(.*?)(?:</b>)?)\s*</td>\s*~is'

 

What that basically does is look for any td with class file, crcsize,seeds, or conns.  Then it will either look for a link tag and match the stuff inside it, or just do a generic match everything, to accommodate the different scenarios.  This pattern will match all of your info.  It will match the href, title, stuff between link tags, general stuff between the td tags, check for bold tags, etc.. for any of those 4 classes. 

 

The main problem with this pattern is that it will make for some funky ass result formatting. Try it out and do a print_r on the results to see what I mean.  There's a whole lot of empty elements, for things that don't match for any given td.

 

Your best bet is to break it down into 2 different regexes.  First match the link stuff, then match the other class td's.

 

Yeah your right about funky results.  :P

For my uses however, my code should work fine. I can see how it would break like you said if some tags I can't predict show up and whatnot, but that shouldn't happen in my case. Anyway thanks for all the help with my regex questions!  O0

Link to comment
Share on other sites

 

what a post my god that grate info,

if you get time can you, explain all the delimiters please.

 

if you can add any other fantastic advance info please do so.

 

Whale much quicker then 10 books.

backslashes are used to escape things.  For instance, if you have this:

$string = "some "random" thing";

you are going to get a parse error, because php will think the 2nd quote is the end of the string.  In order to tell php that no, that's not the end of the string, you escape it like this:

$string = "some \"random\" thing";

That is the general principle of the backslash.  Within a regex pattern, there are several things that need to be escaped.  For one thing, quotes you may be trying to match within the pattern, just like I mentioned above.  I don't have the quotes escaped in the pattern I gave, because I used single quotes around the pattern.  Since I used single quotes, the double quotes don't need to be escaped, because php doesn't match single quotes to double quotes like that.  Now, if there was a single quote in the pattern, I would have had to escape it, since I used single quotes around the pattern.

Next thing is the pattern delimiter.  The delimiter is what tells the regex engine what the start and end of the pattern is.  You can use pretty much any non-alphanumeric character for the pattern delimiter.  DJ chose to use / as the delimiter.  Since he chose to use that, he has to escape any instance of that in the pattern (like in closing html tags), so that the regex engine knows for instance the / in </a> is not the end of the pattern, but part of the pattern.  So it would have to look like this: <\/a>.   / is a pretty common character to popup in patterns, because running regexes on html content is pretty common.  I usually use ~ because it is a character that doesn't come up often, and instantly makes one less thing I have to escape in the pattern, as far as dealing with html content.

On top of that, putting a backslash in front of certain things denotes special characters.  For instance, \n stands for a new line.  \s stands for a space or tab.  \d stands for a digit. \w stands for any lower or uppercase letter or underscore. 

There are several things in DJ's regex that do not need escaping, because he doesn't use them as delimiters, nor do they mean anything special to the regex engine (=, >, and <) Escaping them doesn't necessarily hurt anything, but it makes for an ugly regex and also gives away noobness 

([^"]*) means to match and capture 0 or more of anything that is not a ".  It's pretty simple and straight forward.  Is the next character a "? No? okay it matches. Keep on going. 

(.*?) means to match and capture 0 or more of anything except a new line, unless you use a modifier to tell it to match new lines too.  It will keep matching until it reaches the first instance in which the rest of the pattern after it can be matched.  So in order for it to get a final match, the engine must constantly look ahead and keep back tracking until it finds that first instance.  Then it has to turn around and walk through the string all over again, for the rest of the pattern.

So the really really short answer is the first one is more efficient and less likely to produce unexpected matches, so you should use negated character classes ([^]) instead of nongreedy match-alls (.*?) whenever possible.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.