bbmak Posted July 16, 2012 Share Posted July 16, 2012 Are there code/function that extract all links from a string? I can't figure out the code. I am trying to extract all the <img src= from an xml, however, the links are mixed in the <description> tag. I am able to use the strip_tags() to get all description, but not the links. Can anyone help. Quote Link to comment Share on other sites More sharing options...
cpd Posted July 16, 2012 Share Posted July 16, 2012 preg_replace("/<a(.*?)>|<\/a>/", "", $string); should remove any anchor tags from your text. Quote Link to comment Share on other sites More sharing options...
ignace Posted July 16, 2012 Share Posted July 16, 2012 if (preg_match_all('!(href|src)="([^"]+)"!', $string, $matches)) { foreach ($matches[2] as $location) { //use $location } } Quote Link to comment Share on other sites More sharing options...
ManiacDan Posted July 16, 2012 Share Posted July 16, 2012 Ignace's expression is correct as long as all the URLs are "double quoted" and none are 'single quoted' or unquoted at all. This should get all of them: if (preg_match_all('!(href|src)=[\'"]?([^\'"\s]+)[\'"\s>]!', $string, $matches)) { foreach ($matches[2] as $location) { //use $location } } Quote Link to comment Share on other sites More sharing options...
ignace Posted July 16, 2012 Share Posted July 16, 2012 Ignace's expression is correct as long as all the URLs are "double quoted" and none are 'single quoted' or unquoted at all. Thought about that! But didn't wanted to bother as more experienced regex guru's would surely follow up Quick note to the OP: [^\'"\s]+ means that if you have unescaped href's like: http://www.domain.top/files/Report 2011.pdf Which should be: http://www.domain.top/files/Report+2011.pdf OR http://www.domain.top/files/Report&202011.pdf Would return an URL like: http://www.domain.top/files/Report Quote Link to comment Share on other sites More sharing options...
bbmak Posted July 16, 2012 Author Share Posted July 16, 2012 Thank you very much. It works great. May I know why u use [2] in there? $matches[2] ? i try [1], 1 is src, 2 is the link, 3 is error. I also try $location[0] is h $location[1] is t $location[2] is t $location[3] is p can you give it bit a explanation? sorry I am a newbie. Quote Link to comment Share on other sites More sharing options...
ManiacDan Posted July 16, 2012 Share Posted July 16, 2012 Ignace, those URLs wouldn't work anyway since the HTML would be invalid. bbmak, preg_match_all puts an array of matches into the third argument, in this case it's $matches. $matches[0] is an array of the matches for the full pattern, then $matches[1] through $matches[9] are arrays of the sub-pattern matches, the items inside parentheses inside the expression. When you use a string like $location as an array, it becomes an array of characters, which is why you were getting the individual letters of the string as results. Regular expressions as well as array access are both really complex problems, so this is a very very basic answer. Quote Link to comment Share on other sites More sharing options...
ignace Posted July 16, 2012 Share Posted July 16, 2012 Ignace, those URLs wouldn't work anyway since the HTML would be invalid. Yeah, but nor the browser, the OP, or regex would care about that. Just wanted to point it out in case of the OP would had to deal with such URL's and to avoid a repeat question: "Why does this regex not match blabla". Quote Link to comment Share on other sites More sharing options...
ManiacDan Posted July 16, 2012 Share Posted July 16, 2012 The browser won't care? If I have a space in my URL...it shouldn't work. Quote Link to comment Share on other sites More sharing options...
bbmak Posted July 16, 2012 Author Share Posted July 16, 2012 Ignace, those URLs wouldn't work anyway since the HTML would be invalid. bbmak, preg_match_all puts an array of matches into the third argument, in this case it's $matches. $matches[0] is an array of the matches for the full pattern, then $matches[1] through $matches[9] are arrays of the sub-pattern matches, the items inside parentheses inside the expression. When you use a string like $location as an array, it becomes an array of characters, which is why you were getting the individual letters of the string as results. Regular expressions as well as array access are both really complex problems, so this is a very very basic answer. Thank you. icic... I run the echo '<pre>', print_r($matches, true), '</pre>'; and see all the arrays. and I see the array ( [2] => Array ) has all the links there. Are there a website that has all the preg_match() modifiers listed? I try php.net. However, I do not understand at all. Any website explain it in simple English? Quote Link to comment Share on other sites More sharing options...
bbmak Posted July 16, 2012 Author Share Posted July 16, 2012 what is the 2 !s stand for in the !(href|src)="([^"]+)"! ??? Quote Link to comment Share on other sites More sharing options...
ManiacDan Posted July 17, 2012 Share Posted July 17, 2012 I told you already, but I'll try to explain again: Regular expressions are their own language entirely. They are not PHP, SQL, or anything else you're familiar with. Inside regular expressions you can have parts of the expression wrapped in parentheses. Those parentheses, when used in preg_match, are used as "capture groups." Capture groups are put into the matches array as their own entries in the $matches array. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.