parsing url , regular expression.

sangfroid · November 5, 2008

Hi

I want to replace every 3rd and higher occurence of "." with / in a string.

For eg:

if i have some string like www.google.com.news , www.google.com.sports, then I would like to have something like www.google.com/news and www.google.com/sports

How do i do it with regular expression ??

ddrudik · November 5, 2008

This would seem to require some code, what platform are you using? (C#.NET,PHP, etc.)

nrg_alpha · November 5, 2008

Here is one possible solution (non-regex):

$str = 'www.google.com.news';
$arr = explode('.', $str);
$total = count($arr);
if($total > 3){
   $newArr = $arr[0] . '.' . $arr[1] . '.' . $arr[2];
   for($i = 3; $i < $total; $i++){
      $newArr .= '/' . $arr[$i];
   }
   echo $newArr;
}

Output:

www.google.com/news

ddrudik · November 5, 2008

This would seem to require some code, what platform are you using? (C#.NET,PHP, etc.)

I am usually answering questions in a non-platform-specific regex forum, looking back I guess I knew you were using PHP. Sorry about the question. nrg_alpha had the answer.

ddrudik · November 5, 2008

There seems to be a (very slight) speed advantage to using this method instead:

$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;

nrg_alpha · November 5, 2008

There seems to be a (very slight) speed advantage to using this method instead:
$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;

Very nice solution, ddrudik!

I think a slight modification can sqeeze even more (very slight) speed out of it:

'~^((?:[^.]+\.){2}[^.]+)(.*)~'

Negated character classes are faster than lazy quantifiers. But for medial tasks, the speed difference in this case would be probably negligible at best.

But again, nice solution

Cheers,

NRG

EDIT: I think my solution may be grabbing remaining letters past the last needed dot..so it may be grabbing more than it actually needs..so I suppose to mirror your solution exactly, it could also be written as: '~^((?:[^.]+\.){2})(.*)~'

ddrudik · November 5, 2008

My benchmark testing might be flawed, but that pattern runs slower for me:

<?php
$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:[^.]+\.){2})(.*)~',$str,$parts)){
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
echo "<br>".(microtime(true)-$time_start)."<hr>";
$time_start = microtime(true);
$str = 'www.google.com.news';
$arr = explode('.', $str);
$total = count($arr);
if($total > 3){
   $newArr = $arr[0] . '.' . $arr[1] . '.' . $arr[2];
   for($i = 3; $i < $total; $i++){
      $newArr .= '/' . $arr[$i];
   }
   echo $newArr;
}
echo "<br>".(microtime(true)-$time_start)."<hr>";
$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
echo "<br>".(microtime(true)-$time_start)."<hr>";
?>

Output:

www.google.com/sports
3.9815902709961E-5
--------------------------------------------------------------------------------
www.google.com/news
1.7881393432617E-5
--------------------------------------------------------------------------------
www.google.com/sports
1.2874603271484E-5

nrg_alpha · November 5, 2008

We have met in this thread, haven't we? Nice running into you again BTW. In that link, we both agreed (through your code snippet) that the negated character class beat out the lazy quantifier (no backtracking involved)..

I have used the following snippet (which is kind of based on your code in the link above):

$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:[^.]+\.){2})(.*)~',$str,$parts)){ // NRG
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
$elapsed_time = round($time_end-$time_start,4);
echo $elapsed_time . '<br />';


$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){ // ddrudik
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
$elapsed_time = round($time_end-$time_start,4);
echo $elapsed_time . '<br />';

Example output:

www.google.com/sports-1225926253.6569
www.google.com/sports-1225926253.657

Granted, the difference on a single pass is so small...

One thing I did notice is the use of ($elapsed_time = round($time_end-$time_start,4).. oh.. and the use of the round function... Perhaps this is throwing things off? Mabey the readings I am getting is skewed because of this? Perhaps my example is incorrect?

Cheers,

NRG

ddrudik · November 5, 2008

In that code snippet I don't see $time_end defined, so your values would have no reference ending point, as well we can't round to 4 places since our values are 1 place smaller than that.

Consider this code:

<?php
$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:[^.]+\.){2})(.*)~',$str,$parts)){ // NRG
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
$time_end = microtime(true);
$elapsed_time = $time_end-$time_start;
echo $elapsed_time . '<br />';


$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){ // ddrudik
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
$time_end = microtime(true);
$elapsed_time = $time_end-$time_start;
echo $elapsed_time . '<br />';
?>

This result:

www.google.com/sports4.1007995605469E-5
www.google.com/sports1.3828277587891E-5

The speed of the previous thread's solution must have been influenced by different factors, maybe it's the use of capture groups in this example, not sure.

nrg_alpha · November 5, 2008

Ah, good call on the lack of $time_end. That would indeed be a problem Don't I feel foolish now... so it's settled (yeah, not sure why the discrepancy either. It nags at me...)

ddrudik · November 5, 2008

I will just assume that I have to test all alternatives to get the actual speed results for a given match pattern and source string.

nrg_alpha · November 5, 2008

I did test mine.. just not correctly...

Sign In

parsing url , regular expression.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information