Jump to content

Preg_Match_All for filepaths


neuroxik

Recommended Posts

Hi everyone,

 

I'll include some details, as I've just read the sticky and really want to respect the guidelines. This is a PCRE (preg) flavor regex, as suggests preg_match_all() which I wish to use.

 

And foremost: I am not someone seeking for a cooked answer (although I wouldn't refuse it). Any guidance into the accomplishment of my required regex will be greatly appreciated, hints, explanations. The reason is I'm a pretty good programmer at php, but I suck at regex, although I know I'll have to catch up on it one day.

 

My problem: (and some contextual background, as per the *sticky* instructions)

I am working on a file analyser, Mixcraft (a DAW) files to be precise. The "problem" in Mixcraft is that one can COPY project files (WAV/OGG/etc) to a new location, with no MOVE option, as to delete the old unused files. Of course, what I did (manually) for years was just make a list on notepad to then skim through my harddrive for these filenames and delete them one by one, and some filenames are hard to discern from within that program when they are cross-faded (musical term) into others and cut short... to make it simple: I wanted a simple (php-based) solution to upload that project file which contains references to the sound files (mostly WAVs) in a string fashion, scattered amongst the gibberish characters one might expect from opening a desktop file that is generated to be interpretted by a desktop application (probably programmed in C# or something). I'll give you a glimpse of the file I am trying to read, replacing the gibberish characters with question marks because they seem to parse improperly when pasted here

 

Short file sample:

????????????????????????????????????????????????????????????????C:\Documents and Settings\neuroxik\My Documents\My Recordings\01 Audio

Track-8.wav????????????????????????????????????????????????????????????????????????????????????????

??????????????????????????????????????????????\..\..\..\..\Catch Me\bounce-track 01 6-3-2011 ID3.WAV?????????????????????

 

I've highlighted the searched strings above in red. What I want is a list (preferably in an array, that's why I chose preg_match_all and to find ALL occurences) of the file paths found in these files. I've been working and still will work on the rest, but I'm stuck on the regex part. Here are some important points:

  • They're ALWAYS (Windows) filepaths, either relative (like the second one above) or absolute.
  • More than one (or none at all) filepaths can be found per line
  • A filepath can start on one line and end on the other, as between lines 1 and 2 in my example above
  • The file extention at the end needs to be case-insensitive (WAV|wav) and if possible, if I can use an array of predetermined file extensions, such as WAV, OGG, MP3, etc
  • The drive letter is not bound to be C, but can be, when the filename happens to be absolute
  • If this is too complicated, I would be VERY HAPPY even if I can only match on something like this: \Audio File.wav (backslash, filename which may include numbers, letters, spaces, apostrophes and some "legal filename characters", followed by a dot and an array of wav|ogg|mp3 file extension)

 

If any further data or code is needed to help me, just tell me, I am more than grateful for any suggestions, hints, code snippets or "how to's".... Thanks in advance! Alot!

 

Link to comment
Share on other sites

try

<?php
$test ='????????????????????????????????????????????????????????????????C:\Documents and Settings\neuroxik\My Documents\My Recordings\01 Audio 
Track-8.wav????????????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????\..\..\..\..\Catch Me\bounce-track 01 6-3-2011 ID3.Ogg?????????????????????';
$extensions = array('wav','ogg','mp3');
preg_match_all('~(.?\\\\.+?\.('.implode('|',$extensions).')~is', $test, $out);
echo '<pre>', print_r($out[0]), '</pre>';
?>

Link to comment
Share on other sites

try

<?php
$test ='????????????????????????????????????????????????????????????????C:\Documents and Settings\neuroxik\My Documents\My Recordings\01 Audio 
Track-8.wav????????????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????\..\..\..\..\Catch Me\bounce-track 01 6-3-2011 ID3.Ogg?????????????????????';
$extensions = array('wav','ogg','mp3');
preg_match_all('~(.?\\\\.+?\.('.implode('|',$extensions).')~is', $test, $out);
echo '<pre>', print_r($out[0]), '</pre>';
?>

 

Wow, this actually works perfect! Thanks. I mean, really: Thank you!!

 

Just a few questions, if you don't mind:

 

Last night, I was reading alot on PCRE to trying to figure this out (I would have never come up with something as clever and nice as yours), and about the modifiers, I wanted to use 'm' for multiline, and then I also wanted it to be case-insensitive (i), but I didn't know you could just use more than one like you did above! Okay, so my question: you didn't use the 'm' modifier above, yet it gets the first occurence, which happens on two lines, how does it accomplish that without the 'm'?

 

And (if you've still got some patience with my questions), the questions marks in this part:

(.?\\\\.+?

, do they represent the greedy thing? (I'm sorry I don't have the proper term, I didn't quite understand how it can be "greedy" on some parts and not on others).

 

If you've still got enough patience at this point: I'm guessing .+?\. represents the folders between the drive name and the file itself in the string. My question is: do the dots in that snippet represent escapes, because I saw them around your implode function for the file extensions array, a bit like one would use: echo "bla".$var."bla"; // or maybe I'm on the wrong track.

 

Anyhow, even if you don't have time for my questions, I'll understand, but thank you so much again.

Link to comment
Share on other sites

Sorry for bumping, for some reason the "edit" button has dissapeared on my previous post.

 

try

<?php
$test ='????????????????????????????????????????????????????????????????C:\Documents and Settings\neuroxik\My Documents\My Recordings\01 Audio 
Track-8.wav????????????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????\..\..\..\..\Catch Me\bounce-track 01 6-3-2011 ID3.Ogg?????????????????????';
$extensions = array('wav','ogg','mp3');
preg_match_all('~(.?\\\\.+?\.('.implode('|',$extensions).')~is', $test, $out);
echo '<pre>', print_r($out[0]), '</pre>';
?>

 

I've run the regex on the file in question and I ran into some trouble. I've printed them numbered and  as you can see here , TWO paths were matched in the 1st array index. Also, Lines 3, 5, and 15  have leading characters included in the string. The file analysed in non-UTF but ANSI, because I thought that might have been the problem. Do you have any clue why this happens?

 

Code used:

require_once('../config.php');
$to_scan = file_get_contents("../".UPD_COPIED_FPATH);
$extensions = array('wav','ogg','mp3');
preg_match_all('~(.?\\\\.+?\.('.implode('|',$extensions).')~is', $to_scan, $out); // no modification here except for var name
$i = 1;
foreach($out[0] as $k=>$v) {
echo "<b style=\"color:red;display:block; width:20px; float:left;\">".$i."</b>".$v."<br />";
$i++;
}

 

I'm guessing, on the non-regex part, maybe the problem arised from using file_get_contents(), so I'll try with binary safe fread() and will come back. Hopefully the edit button will still be here.

Link to comment
Share on other sites

try to change . (dot) any caracter to [^?] any caraacter but not ?

'~(.?\\\\[^?]+?\.('

btw

your 1st link has wrong extension

 

This is CRAZY !

Since I was running out of time, I contoured the problem with a bit of overkill. I had noticed the problem often arised AFTER a question mark, so I wrote this quickly: (you don't have to read it, I'm just trying to make a point)

function stripLeadingQM($arr=null) { // strip leading Question Mark and gibberish
// okay, so some instances aren't the ?C:\ but [gibberish][drive_letter]:\ .... so, take out the "?" and just search [A-Z]:\ ?
if(is_null($arr)||empty($arr)) return $arr;
if(is_array($arr)) {
	foreach($arr as $k=>$v) { // chk later in the str :: find occurence of ?[A-Z]: or ?\
		$strpos = strpos($v,'?C');
		if($strpos!==FALSE) $arr[$k] = substr($v,$strpos + 1);
	}
}
else {
	if(substr($arr,0,1)=='?') $arr = substr($v,1);
	else { // chk later in the str
		$strpos = strpos($v,'?C');
		if($strpos!==FALSE) $arr = substr($arr,$strpos + 1);
	}
}
return $arr;
}
// ....
preg_match_all('~([A-Z]?\\\\[^?]+?\.('.implode('|',$legal_exts).')~i', $to_scan, $out); // without 's' modifier works cuz file_get_contents is IN A STRING, therefore no new lines.
$i = 1;
$filepaths_tmp = array();
foreach($out[0] as $k=>$v) {
$scan_this_line = preg_match_all('~\?([^\?]+)([A-Z]?\\\\.+?\.('.implode('|',$legal_exts).')~i', $v, $out2ndpass); 
if(!empty($scan_this_line)) { // for cases where 1st regex doesn't strip all leading gibberish
	$out2ndtreated = stripLeadingQM($out2ndpass[0]);
	$v = $out2ndtreated[0];

	$patt = '~[A-Z]:\\\\~';
	$sp = preg_match($patt,$v,$outvar,$flags=PREG_OFFSET_CAPTURE);
	if($outvar!==FALSE) {
		$v = substr($v,$outvar[0][1]);
	}
}
if(!in_array($v,$filepaths_tmp)) {
	if(substr($v,0,1)!="\\") $filepaths_tmp['full_path'][] = $v; // skip relative paths :: then sort by diff folder (etc) somewhere later
}
$i++;
}

 

Now this was working fine, but took more resources than ONE regex. I've tried what you said by replacing the dot by a [^?] and it works EXACTLY the same, but without all that redundant looping and extra string treatment. THANK YOU ENORMOUSLY! This is really like magic.

 

You can mark this topic as solved if you have nothing to add. Thanks so much.

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.