Match a duplicate folder in a path (back reference not working?)

nathanziarek · October 2, 2008

I can not figure out why this isn't working, but I know it must be something incredibly simple.

I accidentally ruined some paths in a database and want to fix them.

Should be:

I want to match and then erase one of the duplicate folders (they may be trailing or in the middle of the path).

Here is what I've been trying, but it simply doesn't work:

@(/[^/]+)\1@

(using @ as the delimiter)

I've also tried:

@/([^/]+)/\1@

(using @ as the delimiter)

Thoughts?

Thanks!

Nate

DarkWater · October 2, 2008

<?php
$url = "http://www.example.com/nate/joe/joe/tom";
$new_url = preg_replace('!/([^/]+)/\1!', '/$1', $url);
echo $new_url;
?>

Works for me.

nathanziarek · October 2, 2008

I should note my code:

preg_match_all("@/([^/]+)/\1@", "http://www.example.com/nate/joe/joe/tom/", $m);
print_r($m)
Array (
 [0] => Array (
 )
 [1] => Array (
 )
)

Oddly, removing the back reference treats the dual "joe" folders as non-unique and only lists one. I am clearly missing something here!

preg_match_all("@/([^/]+)/@", "http://www.example.com/nate/joe/joe/tom/", $m);
print_r($m)
Array (
 [0] => Array (
    [0] => /www.example.com/
    [1] => /joe/
    [2] => /tom/
 )
 [1] => Array (
    [0] => www.example.com
    [1] => joe
    [2] => tom

 )
)

.

I've also tried preg_match, think there was some difference I am unaware of. It delivers similar results.

DarkWater · October 2, 2008

Why are you using preg_match() or preg_match_all()? You should be using preg_replace().

nathanziarek · October 2, 2008

Works for me.

Yeah, me too. Something with the double quotes. Once I changed them to single quotes (as you had) it all started working. Don't get why, but I know enough to just not ask questions sometimes.

Thanks DW!

nathanziarek · October 2, 2008

Why are you using preg_match() or preg_match_all()? You should be using preg_replace().

I'm not 100% sure that all of the duplicate folders like that are actually wrong (these aren't my files or folder structure), so I wanted to just print out a list I could quick scan before making the change permanent. I've already done enough damage

nz

DarkWater · October 2, 2008

Works for me.

Yeah, me too. Something with the double quotes. Once I changed them to single quotes (as you had) it all started working. Don't get why, but I know enough to just not ask questions sometimes.

Thanks DW!

Always use single quotes for regexes. =P I didn't even notice you used double quotes.

ghostdog74 · October 4, 2008

$string = "http://example.com/nate/joe/joe/tom/";
$a = explode("//",$string) ;
$b = explode("/",$a[1]);
$removed = implode("/",array_unique($b)) ;
echo $a[0]."//$removed";

nrg_alpha · October 5, 2008

$string = "http://example.com/nate/joe/joe/tom/";
$a = explode("//",$string) ;
$b = explode("/",$a[1]);
$removed = implode("/",array_unique($b)) ;
echo $a[0]."//$removed";

This tends to be a lot of work for a simple task that is best done using Darkwater's solution.

Keeping the code small with the least amount of steps is most recommended.

ghostdog74 · October 6, 2008

$string = "http://example.com/nate/joe/joe/tom/";
$a = explode("//",$string) ;
$b = explode("/",$a[1]);
$removed = implode("/",array_unique($b)) ;
echo $a[0]."//$removed";
This tends to be a lot of work for a simple task that is best done using Darkwater's solution.

Keeping the code small with the least amount of steps is most recommended.

i don't agree with you. yes it looks like its a lot of work, but these string/array methods are so common that one looking at the code knows what's going on what its doing. its true regex is short and sharp but when it comes to troubleshooting, especially for those who is going to read your code with lots of regex, he's going to have a hard time. anyway, the simplest solution might be using parse_url, instead of wasting time constructing regex.

nrg_alpha · October 6, 2008

i don't agree with you. yes it looks like its a lot of work, but these string/array methods are so common that one looking at the code knows what's going on what its doing.

Code readability has no bearing on code performance. Clean and efficiently written code (whether some can read / understand it or not) will work better than code that may be more readable yet employ multiple extra steps. Good code should never be 'dumbed down' for others who might have problems trouble shooting. The point of writing good clean, fast and efficient code it to achieve a solution to a problem in the most optimal fashion. If anything, good code should encourage those who are not comfortable with it to further educate themselves so that they may embrace the level of better coding practices. There are many ways to solve a problem when dealing with programming. It boils down to most robust and optimal solutions that should matter most.

its true regex is short and sharp but when it comes to troubleshooting, especially for those who is going to read your code with lots of regex, he's going to have a hard time. anyway, the simplest solution might be using parse_url, instead of wasting time constructing regex.

Regex may not be the most readable.. but again.. it boils down to speed and efficiency. It is an invaluable tool in any serious PHP programmer's arsenal. Well written patterns prove to be quite efficient and effective.

And finally, parse_url would not in itself solve the OP's problem. That function merely breaks down a URL into multiple components.

consider:

$url = 'http://example.com/nate/joe/joe/tom/';
$urlParsed = parse_url($url);
echo '<pre>';
print_r($urlParsed);
echo '</pre>';

Ouput:

Array
(
    [scheme] => http
    [host] => example.com
    [path] => /nate/joe/joe/tom/
)

This in no way solves the issue at hand. You would need additional code, all the while Darkwater's pattern solves this quite nicely using one simple line.

ghostdog74 · October 6, 2008

Code readability has no bearing on code performance. Clean and efficiently written code (whether some can read / understand it or not) will work better than code that may be more readable yet employ multiple extra steps.

work better? in what ways? anyway, i am not arguing with you on code performance. I am just merely showing OP there are more simpler ways to do things than spending time constructing complex regular expressions, which he seems to have a headache on, since needs to post for help.

Good code should never be 'dumbed down' for others who might have problems trouble shooting. The point of writing good clean, fast and efficient code it to achieve a solution to a problem in the most optimal fashion. If anything, good code should encourage those who are not comfortable with it to further educate themselves so that they may embrace the level of better coding practices. There are many ways to solve a problem when dealing with programming. It boils down to most robust and optimal solutions that should matter most.

Good code should be immediately understandable to someone who might not be the original coder. You have said it, there are many ways to do things and one of them is using less regexp.

Regex may not be the most readable.. but again.. it boils down to speed and efficiency.

speed & efficiency is very subjective, it depends on how well the coder understands his concepts. A not well written regexp can be slow too.

And finally, parse_url would not in itself solve the OP's problem. That function merely breaks down a URL into multiple components.

consider:

its only another example of getting the last part. further processing then can be done, like split and array_unique i suggested. so do you think OP's problem can't be solved without regexp?

This in no way solves the issue at hand. You would need additional code, all the while Darkwater's pattern solves this quite nicely using one simple line.

so what if there's additional code? As long as its understandable, its fine.

nrg_alpha · October 6, 2008

Code readability has no bearing on code performance. Clean and efficiently written code (whether some can read / understand it or not) will work better than code that may be more readable yet employ multiple extra steps.

work better? in what ways?

The less processing needed to perform a task, the less prone to error, the better. In that way

anyway, i am not arguing with you on code performance. I am just merely showing OP there are more simpler ways to do things than spending time constructing complex regular expressions, which he seems to have a headache on, since needs to post for help.

Well, not everyone is equally well versed in regex. That's what these forums are for. To offer help to people. For some, yes regex is a headache. To others, not so much.

The difficulty is relative to the experience of the bolder in question. This doesn't mean regex is a poor solution by any stretch.

Good code should be immediately understandable to someone who might not be the original coder.

And if the non-original coder is a novice looking at code written by someone with a solid decade of high quality experience? Sorry, but your statement is incorrect. Good code should perform its task in a fast, efficient and bug-free manner. If someone else cannot read robust quality code, this is not the fault of the advanced programmer.

By example, I has seen someone present a problem that took say 15 lines of code to solve. I would come along, and offer a better solution at say 7 lines.. Then one of the 'big boys' would come and solve the problem in 3. I would look at their code and not understand what they have done. Does this make their code bad? Nope.. turns out, my knowledge / standards were not high enough. I had to spend some time reverse engineering their solution to understand.. I walk away much more knowledgeable than before. And if I can use their methodologies to solve a future problem (whether others can understand it or not), I would.

You have said it, there are many ways to do things and one of them is using less regexp.

I never said using less regex.. I said there are many solutions to a problem. My point is that not all solutions are equal (as far as efficiency is concerned). That was my point. Some solutions are better (read, less code bloat, faster executions, less prone to error, etc..) than others.

Regex may not be the most readable.. but again.. it boils down to speed and efficiency.

speed & efficiency is very subjective, it depends on how well the coder understands his concepts. A not well written regexp can be slow too.

Speed and efficiency is not subjective.. it boils down to CPU cycles and such. Perhaps you are thinking in terms of time and effort to write code? If so, then yes, I agree.. but when I mention speed and efficiency, I am talking from the standpoint of program execution. Sorry if there was any confusion there. And yes, you are right... poorly written regex can be slower.. in this case, Darkwater's case, it is not poorly written.

And finally, parse_url would not in itself solve the OP's problem. That function merely breaks down a URL into multiple components.

consider:

its only another example of getting the last part. further processing then can be done, like split and array_unique i suggested. so do you think OP's problem can't be solved without regexp?

That's my point.. it involves extra program execution. I am not suggesting that the OP's problem cannot be solved without regex. It obviously can be solved without regex to be sure. But problems like the OP's is in this case perfectly suitable in regex.. because the issue needs to find a repeating pattern and replace it.

This in no way solves the issue at hand. You would need additional code, all the while Darkwater's pattern solves this quite nicely using one simple line.

so what if there's additional code? As long as its understandable, its fine.

This is where we clearly differ. You equate 'understandability' as being fine. I equate elegantly fast and efficient bug-free code as being fine.

So in the end, we can agree to disagree.

ghostdog74 · October 6, 2008

The less processing needed to perform a task, the less prone to error, the better. In that way

have you measured how many behind-the-scene "processes" the regexp engine creates when parsing regexp? if you were to go behind the scene, what its doing is similar.

Good code should perform its task in a fast, efficient and bug-free manner. If someone else cannot read robust quality code, this is not the fault of the advanced programmer.

that's part of it. Good code, should also be maintainable and readable. I am sure you don't code your program in binary or numbers do you? Same logic, with too much symbols , it makes code unreadable.

By example, I has seen someone present a problem that took say 15 lines of code to solve. I would come along, and offer a better solution at say 7 lines.. Then one of the 'big boys' would come and solve the problem in 3. I would look at their code and not understand what they have done. Does this make their code bad? Nope.. turns out, my knowledge / standards were not high enough. I had to spend some time reverse engineering their solution to understand.. I walk away much more knowledgeable than before. And if I can use their methodologies to solve a future problem (whether others can understand it or not), I would.

So its even better if i come and present a one line solution that crams everything together?? A 15 lines of code program, properly indented, with understandable english words, and works just as fast as regexp sure beats one that's uncomprehensible and less lines of code.

Speed and efficiency is not subjective.. it boils down to CPU cycles and such. Perhaps you are thinking in terms of time and effort to write code? If so, then yes, I agree.. but when I mention speed and efficiency, I am talking from the standpoint of program execution. Sorry if there was any confusion there. And yes, you are right... poorly written regex can be slower.. in this case, Darkwater's case, it is not poorly written.

yes indeed what i meant. however, have you conducted experiments to see whether regexp are indeed better in terms of speed than native string functions to solve problems?

That's my point.. it involves extra program execution. I am not suggesting that the OP's problem cannot be solved without regex. It obviously can be solved without regex to be sure. But problems like the OP's is in this case perfectly suitable in regex.. because the issue needs to find a repeating pattern and replace it.

I didn't say its not suitable. Remember, i provided a non regex solution to OP, but you said its a lot of work. I say its not, because behind the regexp engine is also doing work.

This is where we clearly differ. You equate 'understandability' as being fine. I equate elegantly fast and efficient bug-free code as being fine.

So in the end, we can agree to disagree.

yes indeed. Its all up to the OP to consider and use the solutions. Its really no point continuing to argue here.

nrg_alpha · October 6, 2008

have you measured how many behind-the-scene "processes" the regexp engine creates when parsing regexp? if you were to go behind the scene, what its doing is similar.

I personally have not. But Jeffrey Friedl's book on mastering regular expressions give a good look at what regex engines are doing behind the scenes. If you have not picked up and read his book, I suggest you do (if you are interested at all in regex). Obviously, tasks such as simple find a straightforward set of letters and replace it, it would be wiser to use str_replace instead of preg_replace.. but in cases like the problem posed by the OP in this thread, PCRE is perfectly useful for this.

that's part of it. Good code, should also be maintainable and readable. I am sure you don't code your program in binary or numbers do you? Same logic, with too much symbols , it makes code unreadable.

I agree. The question here is who's standards dictates maintainability and readability? By example, are you suggesting that Darkwater's solution is complex and not readable and maintainable? For those not familiar with regex, yes, perhaps so. But for those who have a better understanding will not be lost in the slightest. I'll say it again.. good code should not be 'dumbed down' for those less knowledgeable. If the code is well written (even at advanced enough levels), this does not make it unreadble / unmaintainable to everyone. I agree in that it may not be so to everyone. But who do we cater to? The absolute beginners? The absolute coding Savants? I say code to the best of your abilities.

So its even better if i come and present a one line solution that crams everything together?? A 15 lines of code program, properly indented, with understandable english words, and works just as fast as regexp sure beats one that's uncomprehensible and less lines of code.

I believe that less lines will translate to better performance in the end, yes. The more hoops the parser has to go through.. the worse it gets...

It does of course highly depend on the task at hand. In this particular case, the speed difference between your solution and Darkwater's is negligible at best (because of how small the problem is). On a much larger scale, the differences would be likely much more pronounced. For me, it's a matter of coding principal. If I can get the code nice and neat and have it work with less lines, the better. It CAN still be presentable and readable and maintainable (as in Darkwater's case). So yes, I think that less is better.

yes indeed what i meant. however, have you conducted experiments to see whether regexp are indeed better in terms of speed than native string functions to solve problems?

Again,admittedly no I have not. Have you?

I didn't say its not suitable. Remember, i provided a non regex solution to OP, but you said its a lot of work. I say its not, because behind the regexp engine is also doing work.

Well, all routines must do work, yes. So of course, even regex needs to 'work' for the solution. In reality, for this instance, the system (either way.. yours or Darkwaters) doesn't work hard (in the grand scheme of things.. so perhaps I mislabeled .works hard').. Just from a programmer's standpoint, if I had to choose between multiple explodes, passing an array through array_unique, then imploding, verses a simply capture with back referencing, well, you can guess which direction I would take.

This is where we clearly differ. You equate 'understandability' as being fine. I equate elegantly fast and efficient bug-free code as being fine.

So in the end, we can agree to disagree.

yes indeed. Its all up to the OP to consider and use the solutions. Its really no point continuing to argue here.

Indeed, the OP can pick / choose. Since the OP is using preg, and since preg is perfectly suitable for this problem, and since this is a regex forum, a wise preg solution was presented.

nrg_alpha · October 6, 2008

So I decided to actually test this (you have me curious).. The first part is using your code, and second is using Darkwater's. I time everything encapsulating each solution.

So the top line outputted is yours.. the next line is his.

Here is the test code:

$start = gettimeofday();
$string = "http://example.com/nate/joe/joe/tom/";
$a = explode("//",$string) ;
$b = explode("/",$a[1]);
$removed = implode("/",array_unique($b)) ;
echo $a[0]."//$removed";
$final = gettimeofday();
$sec = ($final['sec'] + $final['usec']/1000000)-
       ($start['sec'] = $start['usec']/1000000);
printf("\r\nTime of execution: %.3f units\n", $sec);

echo '<br />'; // insert a new line between tests...

$start = gettimeofday();
$string = "http://example.com/nate/joe/joe/tom/";
$new_url = preg_replace('!/([^/]+)/\1!', '/$1', $string);
echo $new_url;
$final = gettimeofday();
$sec = ($final['sec'] + $final['usec']/1000000)-
       ($start['sec'] = $start['usec']/1000000);
printf("\r\nTime of execution: %.3f units\n", $sec);

Sample Output:

http://example.com/nate/joe/tom/ Time of execution: 1223270662.000 units 
http://example.com/nate/joe/tom/ Time of execution: 1223270662.000 units

So I concede, it is neck and neck.

If you keep refreshing, it is pretty much tied it seems.. Sometimes, yours is faster by .001, and sometimes its the other way around. This was tested in a clean file (no other junk in it). Just a Dreamweaver blank HTML template page with the above code in it. So as I suspected.. for tasks like this, the speed is negligible. I still would have went with the preg approch personally though, as this [to me], this makes more sense use it for the OP's problem.

Cheers,

NRG

DarkWater · October 6, 2008

I've never seen 'DarkWater' used some many times in a post.

Sign In

Match a duplicate folder in a path (back reference not working?)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information