Regex to match and return all anchor tags in a string

kyleabaker · December 4, 2013

I'm trying to write regex to take an input string and return results of each anchor tag that is found. For example, in the following string, it should return 3 results:

This is an test <a href="link1.html" data-modal="sdfdsf87ds87fdsf8bds8fb">string</a> example to <a href="link2.html">parse</a> and return <a class="someclass" href="link3.html">some</a> anchor tag results.

My expected results are:

1. <a href="link1.html" data-modal="sdfdsf87ds87fdsf8bds8fb">string</a>

2. <a href="link2.html">parse</a>

3. <a class="someclass" href="link3.html">some</a>

I'm trying to test this at http://regexpal.com/ and the problem I'm seeing is that my regex ( <a (.+)[^<]*</a> ) is selecting everything from the start of the first anchor tag to the end of the last anchor tag and I can't seem to figure out how to split these apart.

Any suggestions so it returns each tag as a separate result in the match array?

Thanks in advance!

dalecosp · December 4, 2013

Try spaces, newlines, etc?

Or, perhaps, use something like DOMDocument to read the HTML instead of a regexp.

requinix · December 4, 2013

Or, perhaps, use something like DOMDocument to read the HTML instead of a regexp.

That. Very that.

Not only are regular expressions the wrong tool for dealing with HTML, DOMDocument is actually better at doing what you want.

getElementsByTagName

Edited December 4, 2013 by requinix

dalecosp · December 4, 2013

Yeah, that's what I use; he didn't say if it was a requirement ... never can tell when people are doing coursework ;)

.josh · December 5, 2013

I agree that in general a DOM parser would be better for general DOM parsing/manipulation, but regex isn't a bad alternative if what you are looking for is regular. If that is all you want, this regex should work ($anchors will hold the results):

preg_match_all('~<a\s+.*?</a>~is',$string,$anchors);

If however you want to parse individual attributes or just the "text" of the anchor etc. then using a DOM parser would definitely be better.

since you are using regex buddy, <a\s+.*?</a> is the actual pattern and is are modifiers for making it case-insensitive (i) and also allowing the dot to match newline chars (s), in the event that the "text" inside the anchor tags has newline chars (so IOW make sure to add those flags in regex buddy)

Sign In

Regex to match and return all anchor tags in a string

Recommended Posts

kyleabaker

Link to comment

Share on other sites

dalecosp

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

dalecosp

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information