gamblor01 Posted October 8, 2009 Share Posted October 8, 2009 Hi everyone, I have a regex that is looking for URLs in a line of text (actually I have several but only one is giving me problems that I know of). The idea is that I want to find strings of the form something <dot> something <dot> (2|3 characters). This will allow me to find things like: a.b.cd www.google.com hello.world.tv So here is the current line of code I am using: preg_match('/.+\..+\.([a-zA-Z]{2}|[a-zA-Z]{3})/i', $msg) The problem is that it is matching strings such as "forever...or" and I see why, I just don't know how to avoid it. Is there any way that I can specify in my regex to match everything of that form EXCEPT a string that contains "..." in it? Or do I simply need to rewrite the entire regular expression in a different way? Thanks! Quote Link to comment Share on other sites More sharing options...
Garethp Posted October 8, 2009 Share Posted October 8, 2009 Use this '~[^.]+\.[^.]+\.([a-zA-Z]{2,3})~i' Quote Link to comment Share on other sites More sharing options...
fooDigi Posted October 8, 2009 Share Posted October 8, 2009 instead of using ".+", try specifying the valid domain characters explicitly, cause a .+ will actuall match a '.' also, you don't need to use the pipe to match 2 or 3 chars, see below... $msg = "skdaj www.google.com asd hello.world.tv jaksdjflajsk forever...or dfas as d.s.as askajjsl ke jhash kj"; preg_match_all('/[0-9a-zA-Z-]+\.[0-9a-zA-Z]+\.[a-zA-Z]{2,3}/i', $msg, $matches); print_r($matches); edit: btw... i posted this in opera and it stripped certain characters, should be good now... Quote Link to comment Share on other sites More sharing options...
gamblor01 Posted October 9, 2009 Author Share Posted October 9, 2009 Thanks garethp -- your expression worked like a champ! I know you're anchoring to the beginning of the expression but I'll have to really sit down and look it over at some point to truly understand what the heck it's doing. Quote Link to comment Share on other sites More sharing options...
Garethp Posted October 9, 2009 Share Posted October 9, 2009 Actually, I'm not. I'll show you '~[^.]+\.[^.]+\.([a-zA-Z]{2,3})~i' [^.] means any character that's not a dot (If it's the first character in a character class, it means whatever is NOT in this class) + means more than one \. is dot {2,3] is two OR three times. So it all means Match any character that's not a dot, any number of times. Then Dot. Then anything that's not a dot, any amount of times. Then dot. Then [a-zA-Z] two or three times Quote Link to comment Share on other sites More sharing options...
fooDigi Posted October 9, 2009 Share Posted October 9, 2009 garethp, correction in your last post... i know it's small, but why not be accurate... the plus(+) actually matches the previous expression ONE or MORE times, not more than one Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.