matt.xx Posted December 3, 2009 Share Posted December 3, 2009 Hi Guys, I need to build a preg match all expression that will search through a given string to pick out filenames/extensions. A typical link could be http://www.website.com/resources/publicdocuments/marketing/sales.pdf I've highlighted in bold the data i would like to extract when a url to the file type is found. I would like to be able to do this for the following formats: pdf, ppt, doc, exe, XLS, jpg, db, gif, mov, avi, mpg, dot, jpeg, mgmf, swf, wmv,mp3, zip, mp4, flv, txt, pps, rtf, tif, bmp, htm, js, docx, bak-2 and pptx. Any help would be greatly, greatly appreciated. Many Thanks, Matt Quote Link to comment Share on other sites More sharing options...
cags Posted December 3, 2009 Share Posted December 3, 2009 What does a standard input string look like, you say preg_match_all so I assume there is more than one in the string? Will the filenames always have a url path before them (by this I mean will there not be any plain 'bob.pdf')? Quote Link to comment Share on other sites More sharing options...
matt.xx Posted December 3, 2009 Author Share Posted December 3, 2009 Yeah, basically we need to run this check through our content management system. So a standard string would be the html of the page.. so it'll need to search through the html to find the results. (We are trying to find all documents contained within our website, checking row by row of the "content" data) There will always be a path before them, there wont ever be just the single file name. Thanks for your quick reply. Quote Link to comment Share on other sites More sharing options...
cags Posted December 3, 2009 Share Posted December 3, 2009 How about something like this, it may occasionally pick up false positives. Basically searches for a forward slash followed by one or more of a-z_- then a fullstop, followed by one of your extensions. "#/\K[a-z_-]+\.(?:pdf|ppt|doc|exe|XLS|jpg|db|gif|mov|avi|mpg|dot|jpeg|mgmf|swf|wmv,mp3|zip|mp4|flv|txt|pps|rtf|tif|bmp|htm|js|docx|bak-2|pptx)#" Quote Link to comment Share on other sites More sharing options...
matt.xx Posted December 3, 2009 Author Share Posted December 3, 2009 That's awesome, just what i wanted! How would i go about with the foreach loop? I can never work the foreach query out. Thanks mate Quote Link to comment Share on other sites More sharing options...
cags Posted December 3, 2009 Share Posted December 3, 2009 What do you wish to do with them? Quote Link to comment Share on other sites More sharing options...
matt.xx Posted December 3, 2009 Author Share Posted December 3, 2009 I basically just want to display all of the files it finds within the websites content into a large list? Thanks Quote Link to comment Share on other sites More sharing options...
cags Posted December 3, 2009 Share Posted December 3, 2009 $pattern = "What i put in my last post, can't be arsed to copy it"; preg_match_all($pattern, $input, $output); echo '<pre>'; print_r($output); echo '</pre>'; // or alternatively foreach($output[0] as $k=>$v) { echo $v . '<br/>'; } Quote Link to comment Share on other sites More sharing options...
matt.xx Posted December 3, 2009 Author Share Posted December 3, 2009 Thanks for your reply, this is pretty much what i had.. but it will only display the first result (Page ID: 3032 - File Found - 0 -> /Konstantine.jpg) for($i3=0;$i3<$Num3;$i3++){ $Content = mysql_result($Result3, $i3, "content"); $Id = mysql_result($Result3, $i3, "id"); preg_match_all("#/\K[a-z_-]+\.(?:pdf|ppt|doc|exe|XLS|jpg|db|gif|mov|avi|mpg|dot|jpeg|mgmf|swf|wmv,mp3|zip|mp4|flv|txt|pps|rtf|tif|bmp|htm|js|docx|bak-2|pptx)#", $Content, $output); foreach($output[0] as $k=>$v) { echo "Page ID: <b>".$Id."</b> - "."File Found - <b>".$k." -> ".$v."</b><br/>"; //mysql_query("UPDATE racnew.doc_items SET is_active = 1 WHERE doc_id = ".$matches[1][$k]."") or die ("<p>Query Failed</p><br />"); $Docs++; } } The query etc is correct, because i've used it with the previous foreach loop that i obtained from this website for a simular task. Is there any visible problems that you can see in there to make it only output the one result?! Thanks Quote Link to comment Share on other sites More sharing options...
salathe Posted December 3, 2009 Share Posted December 3, 2009 Are you using an older version of PHP (say, less than 5.2.4)? Quote Link to comment Share on other sites More sharing options...
matt.xx Posted December 3, 2009 Author Share Posted December 3, 2009 php info says PHP Version 4.4.1 would this really have that big of a difference? Matt Quote Link to comment Share on other sites More sharing options...
cags Posted December 3, 2009 Share Posted December 3, 2009 Yes, according to the documentation the \K assertion has only been present since PHP 5.2.4 making the pattern I gave you invalid. Try changing the front half of the pattern from... #/\K[a-z_-]+\. ...to... #(?<=/)[a-z_-]+\. I'm new to lookbehind assertions, but in my weary state that looks about right. Quote Link to comment Share on other sites More sharing options...
matt.xx Posted December 4, 2009 Author Share Posted December 4, 2009 Worked Perfectly, Thanks! Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted December 5, 2009 Share Posted December 5, 2009 (?:pdf|ppt|doc|exe|XLS|jpg|db|gif|mov|avi|mpg|dot|jpeg|mgmf|swf|wmv,mp3|zip|mp4|flv|txt|pps|rtf|tif|bmp|htm|js|docx|bak-2|pptx) I was drinking water and almost choked when saw that monster alternation In general, large alternations like this can be a tad heavy in calculation terms. I thought perhaps streamlining things a tad with the above written as perhaps something like: (??:pp(?:tx?|s))|(?:pd|[gt]i|mgm|sw|rt)f|(?:do|tx)t|exe|(?:mo|fl)v|wmv,mp3|XLS|jpe?g|zip|mp4|htm|js|docx|bak-2) ? I'm sure that can be redone more efficiently as well (I think even the order of things within the alternation can have an impact on its speed [and in some cases, accuracy as well]). Then I thought about possibly making use of the 'S' modifier. The purpose of this modifier is to have the regex engine pre-analyze the pattern in question to make things more efficient (there are exceptions that nullify the effectiveness of this however, but I think in this case, it could apply. In some situations, the S modifier is applicable, but may not even amount to much time saving [if at all] due to the pattern / target content being checked). All in all, this pretty much behaves like initial-character discrimination optimization behavior. What exactly does that mean? Giving an example from the Mastering Regular Expressions book, the idea is simple. Assuming you needed to match an abbreviated month.. typically you might see a pattern like: (?:Jan|Feb|Mar... etc) We can mimic the optimization behavior by amending the pattern using a lookahead assertion like: (?=[JFMASOND])(?:Jan|Feb|Mar... etc) which in essence forces an initial character match requirement for what comes after the assertion to even be considered (in other words, if the regex engine doesn't match any of those initial characters, we know we are not matching any of the months in question - which could save some time). In either case, when I see large alternations like in this thread, I can't help but think things can be optimized (perhaps in more ways than one). I suppose so long as it does the job, that's what matters most (optimized or not). Just depending on what amount of content these kind of patterns has to check against, it might make things considerably faster (but I suspect this is highly circumstantial). Quote Link to comment Share on other sites More sharing options...
cags Posted December 5, 2009 Share Posted December 5, 2009 [ot]Thanks for the information., interesting example with the lookahead assertion. If I'm honest I didn't even think of optimization when I wrote it, I simply copied what the OP had entered and did a find replace to replace the ', ' with the '|'. I'm sure if I'd have put my mind to it I could have come up with something closer to your first example. Having said that I'm not exactly clued in on the internals of RegEx so in some cases I wouldn't know for sure which 'optimizations' would actually improve performance. That's the main reason my first attempt tends to be the most 'obvious' solution (as also seen in the 'valid date' thread awhile back). [/ot] Quote Link to comment Share on other sites More sharing options...
salathe Posted December 6, 2009 Share Posted December 6, 2009 I was drinking water and almost choked when saw that monster alternation I'm sure folks looking at your regular expression might well experience the same. nrg_alpha's expression is slightly more optimised than cags' but there are a few points to consider. The very first and I believe foremost would be the fact that most people would be able to look at cags' and know which extensions are in the list to be matched. Can anyone glance at nrg_alpha's and say the same? (That's a rhetorical question, I'm sure some particularly gifted person could.) I believe that this holds far more weight than any of the particular improvements that might come about through optimisations like nrg_alpha presented. Next, onto the optimisation itself. The biggie in terms of optimising the expression is that the number of alternations is reduced (20 vs 29) resulting in less backtracking to do even in the worst-case scenario (i.e. the last possible alternative). However, the impact as is often the case is negligible1 in the grand-scheme of things (in real-time terms). As was eluded to in nrg_alpha's post above, an optimisation that would have a larger net effect would be to re-order the alternations so that the most common file extensions are checked first2. Of course, that would not improve the speed of execution for those extensions at the end of the list nor would the benefit be apparent if the numbers of files of each extension were similar. There are a couple of things to point out. First is the dangers of copy-and-pasting; cags made (what I interpret as) a mistake in his post (…wmv,mp3…) which was then also repeated in nrg_alpha's post and in all likelihood the OP's code. Second is that cags' pattern will stop once it has found a match: given …/abc.pptx it will report matching abc.ppt since that alternative comes first and is successfully matched. Depending on the specific needs of the OP, it might simply be worth anchoring the pattern with a $ (and of course, the D modifier) or using some other determinant to make sure the one correct extension only is matched. -- 1. This is mostly based on experience and also a very, very unscientific benchmark. 2. Ditto the note above. Quote Link to comment Share on other sites More sharing options...
cags Posted December 6, 2009 Share Posted December 6, 2009 There are a couple of things to point out. First is the dangers of copy-and-pasting; cags made (what I interpret as) a mistake in his post (…wmv,mp3…) which was then also repeated in nrg_alpha's post and in all likelihood the OP's code. Good spot salathe, it's so obvious now you've pointed it out but I didn't see that at all. Depending on the specific needs of the OP, it might simply be worth anchoring the pattern with a $ (and of course, the D modifier) or using some other determinant to make sure the one correct extension only is matched. I don't see it being possible to use the anchor as from what I can tell the OP is matching the strings within a larger string. Perhaps appending \b to the end of the pattern would be a better solution. Quote Link to comment Share on other sites More sharing options...
salathe Posted December 6, 2009 Share Posted December 6, 2009 I don't see it being possible to use the anchor as from what I can tell the OP is matching the strings within a larger string. Perhaps appending \b to the end of the pattern would be a better solution. I did prefix with "Depending on the specific needs of the OP". The post eluded to the subject string being a larger document but we all know what posters are like with their details and requirements. Good spot salathe, it's so obvious now you've pointed it out but I didn't see that at all. I believe there's a technical term for the problem of not seeing mistakes created by your own hand, but right now it escapes me. A fresh pair of eyes (be they your own or someone else's) is always useful and besides, mistakes give us something to post about. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted December 6, 2009 Share Posted December 6, 2009 I was drinking water and almost choked when saw that monster alternation I'm sure folks looking at your regular expression might well experience the same. lol Well with a monster alternation like that, it becomes hard not too! The very first and I believe foremost would be the fact that most people would be able to look at cags' and know which extensions are in the list to be matched. Can anyone glance at nrg_alpha's and say the same? (That's a rhetorical question, I'm sure some particularly gifted person could.) Yeah, my attempt was certainly not an improvement in 'readability', that's for sure (made it worse in fact). The idea here was to simplify things instead of listing largely redundant options (re-ordering content not obviously taken into account). But it certainly is ugly, I'll admit. There are a couple of things to point out. First is the dangers of copy-and-pasting; cags made (what I interpret as) a mistake in his post (…wmv,mp3…) which was then also repeated in nrg_alpha's post and in all likelihood the OP's code. D'oh! I must have been brain dead on that one! Hehe.. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.