Jump to content

parsing url , regular expression.


sangfroid

Recommended Posts

Hi

  I want to replace every 3rd and higher occurence of "." with / in a string.

 

  For eg:

 

if i have some string like www.google.com.news , www.google.com.sports, then I would like to have something like www.google.com/news and www.google.com/sports

 

How do i do it with regular expression ??

Link to comment
Share on other sites

Here is one possible solution (non-regex):

 

$str = 'www.google.com.news';
$arr = explode('.', $str);
$total = count($arr);
if($total > 3){
   $newArr = $arr[0] . '.' . $arr[1] . '.' . $arr[2];
   for($i = 3; $i < $total; $i++){
      $newArr .= '/' . $arr[$i];
   }
   echo $newArr;
}

 

Output:

 

www.google.com/news

Link to comment
Share on other sites

This would seem to require some code, what platform are you using?  (C#.NET,PHP, etc.)

 

I am usually answering questions in a non-platform-specific regex forum, looking back I guess I knew you were using PHP.  Sorry about the question.  nrg_alpha had the answer.

Link to comment
Share on other sites

There seems to be a (very slight) speed advantage to using this method instead:

$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;

 

Very nice solution, ddrudik!

I think a slight modification can sqeeze even more (very slight) speed out of it:

 

'~^((?:[^.]+\.){2}[^.]+)(.*)~'

 

Negated character classes are faster than lazy quantifiers. But for medial tasks, the speed difference in this case would be probably negligible at best.

But again, nice solution :)

 

Cheers,

 

NRG

 

EDIT:  I think my solution may be grabbing remaining letters past the last needed dot..so it may be grabbing more than it actually needs..so I suppose to mirror your solution exactly, it could also be written as: '~^((?:[^.]+\.){2})(.*)~'

Link to comment
Share on other sites

My benchmark testing might be flawed, but that pattern runs slower for me:

<?php
$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:[^.]+\.){2})(.*)~',$str,$parts)){
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
echo "<br>".(microtime(true)-$time_start)."<hr>";
$time_start = microtime(true);
$str = 'www.google.com.news';
$arr = explode('.', $str);
$total = count($arr);
if($total > 3){
   $newArr = $arr[0] . '.' . $arr[1] . '.' . $arr[2];
   for($i = 3; $i < $total; $i++){
      $newArr .= '/' . $arr[$i];
   }
   echo $newArr;
}
echo "<br>".(microtime(true)-$time_start)."<hr>";
$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
echo "<br>".(microtime(true)-$time_start)."<hr>";
?>

Output:

www.google.com/sports
3.9815902709961E-5
--------------------------------------------------------------------------------
www.google.com/news
1.7881393432617E-5
--------------------------------------------------------------------------------
www.google.com/sports
1.2874603271484E-5

Link to comment
Share on other sites

We have met in this thread, haven't we? Nice running into you again BTW. In that link, we both agreed (through your code snippet) that the negated character class beat out the lazy quantifier (no backtracking involved)..

 

I have used the following snippet (which is kind of based on your code in the link above):

 

$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:[^.]+\.){2})(.*)~',$str,$parts)){ // NRG
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
$elapsed_time = round($time_end-$time_start,4);
echo $elapsed_time . '<br />';


$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){ // ddrudik
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
$elapsed_time = round($time_end-$time_start,4);
echo $elapsed_time . '<br />';

 

Example output:

www.google.com/sports-1225926253.6569
www.google.com/sports-1225926253.657

 

Granted, the difference on a single pass is so small...

One thing I did notice is the use of ($elapsed_time = round($time_end-$time_start,4).. oh.. and the use of the round function... Perhaps this is throwing things off? Mabey the readings I am getting is skewed because of this? Perhaps my example is incorrect?

 

Cheers,

 

NRG

Link to comment
Share on other sites

In that code snippet I don't see $time_end defined, so your values would have no reference ending point, as well we can't round to 4 places since our values are 1 place smaller than that.

 

Consider this code:

<?php
$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:[^.]+\.){2})(.*)~',$str,$parts)){ // NRG
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
$time_end = microtime(true);
$elapsed_time = $time_end-$time_start;
echo $elapsed_time . '<br />';


$time_start = microtime(true);
$str='www.google.com.sports';
if(preg_match('~^((?:.*?\.){2})(.*)~',$str,$parts)){ // ddrudik
  $str=$parts[1].str_replace('.','/',$parts[2]);
}
echo $str;
$time_end = microtime(true);
$elapsed_time = $time_end-$time_start;
echo $elapsed_time . '<br />';
?>

 

This result:

www.google.com/sports4.1007995605469E-5
www.google.com/sports1.3828277587891E-5

 

The speed of the previous thread's solution must have been influenced by different factors, maybe it's the use of capture groups in this example, not sure.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.