Strip everything before first occurence of period

AshleyS · October 17, 2008

Hello,

I've done quite a bit of searching to try and figure how to accomplish this.

We receive strings like the following:

1. Some text with commas and periods.

20. S.A.T.S - School Exam

3523. 5 Stars.

Basically, I need to be able to strip everything before the first period, leaving just the underlined text. I do not have much knowledge with regular expressions, so could please someone assist?

Regards.

Orio · October 17, 2008

Can you show an example of raw data you get (in code tags!) and the output you're expecting?

Orio.

AshleyS · October 17, 2008

Thanks for you speedy reply, here is an example of data we get from a batch of mp3 song titles. (We run a DJ system.)

8. 3 Doors Down - Kryptonite
207. Aerosmith - I Don't Want To Miss A Thing
1096. Coldplay - The Scientist
1097. Coldplay - Trouble
1832. FatBoy Slim - Praise You
1833. FatBoy Slim - Right Here, Right Now
2068. Green Day - Time your life

3 Doors Down - Kryptonite
Aerosmith - I Don't Want To Miss A Thing
Coldplay - The Scientist
Coldplay - Trouble
FatBoy Slim - Praise You
FatBoy Slim - Right Here, Right Now
Green Day - Time your life

The index numbers can go up into the high thousands, so I cannot specify a range from where it will go up to.

Regards.

Orio · October 17, 2008

Try that:

<?php

$data = <<<DATA
8. 3 Doors Down - Kryptonite
207. Aerosmith - I Don't Want To Miss A Thing
1096. Coldplay - The Scientist
1097. Coldplay - Trouble
1832. FatBoy Slim - Praise You
1833. FatBoy Slim - Right Here, Right Now
2068. Green Day - Time your life
DATA;

$result = preg_replace("#^[^\s]+ (.*?)$#m", "$1", $data);

echo $result;

?>

Orio.

AshleyS · October 17, 2008

Thanks for the code, Orio. It works just as I needed.

If you have the time, would you be able to explain each segment of what is used in the preg_replace?

Regards.

Orio · October 17, 2008

I've added the 'm' modifier, so each line is treated separately (so ^ matches a start of a newline and $ an end of one).

[^\s]+ matches everything until a space is met, so this way it skips the numbers and the dot. Then comes a literal space to match the space that comes after the dot. Then it captures everything until the end of the line (and because it's brackets it "saves" it as $1). The whole pattern is replaced by $1 - so you get only the song names.

Orio.

AshleyS · October 17, 2008

Thank you Orio, you've explained it very well and I can understand how it operates.

Many thanks.

effigy · October 17, 2008

There's no point in capturing more than you need:

echo $result = preg_replace('/^.*\.\s+/m', '', $data);

ghostdog74 · October 17, 2008

you don't need a regex to do simple task like this. PHP comes with many string functions you can use.

<?php
$data = <<<DATA
8. 3 Doors Down - Kryptonite
207. Aerosmith - I Don't Want To Miss A Thing
1096. Coldplay - The Scientist
1097. Coldplay - Trouble
1832. FatBoy Slim - Praise You
1833. FatBoy Slim - Right Here, Right Now
2068. Green Day - Time your life
DATA;
foreach ( split("\n",$data) as $k=>$v ){
    $s = explode(".",$v);
    echo $s[0]."\n";
}
?>

DarkWater · October 17, 2008

Just to add to ghostdog's response, you'd want to use array_shift() to get the first element off.

<?php
$data = <<<DATA
8. 3 Doors Down - Kryptonite
207. Aerosmith - I Don't Want To Miss A Thing
1096. Coldplay - The Scientist
1097. Coldplay - Trouble
1832. FatBoy Slim - Praise You
1833. FatBoy Slim - Right Here, Right Now
2068. Green Day - Time your life
DATA;
foreach ( split("\n",$data) as $k=>$v ){
    $broken = explode(".",$v);
    array_shift($broken);
    $songinfo = implode('', array_map('trim', $broken));
    echo $songinfo;
}
?>

nrg_alpha · October 18, 2008

There's no point in capturing more than you need:
echo $result = preg_replace('/^.*\.\s+/m', '', $data);

Hmm.. I wonder about which method is more efficient Effigy, yours or Orio's.

Sure, Orio's solution involves a capture (not sure how 'heavy' this actually is), but when I examine you solution Effigy, I found it interesting that you used .* in conjunction with the m modifier. If I understand this correctly, this implies that from the start of each line (as you are using the m modifier), you match everything to the end of the line, then have regex backtrack character by character until it reaches (and thus matches) the beginning dot and space, and replace that...

I wonder aloud which is more work.. all that backtracking, or Orio's straight forward capturing. Looking at Orio's method, it starts matching everything after the first space. Side note, I do wonder about the lazy quantifier in this case.. it may not be necessary?

On the onset, I have to admit, I like Orio's solution best of all in this thread (this is just my opinion of course).

I guess what I'm getting at, is that even though you use the m modifier, I am weary of .* usage, as it does match as much as it can prior to backtracking (which may or may not be heavily involved, depending on how much backtracking is involved).

Perhaps I'm misunderstanding something?

Cheers,

NRG

corbin · October 18, 2008

^.*\.\s+

^ is an anchor, meaning from the start of the line

. means anything

* means any amount of times

\. means literal character .

\s means space character (" " for example)

+ means 1 or more times

So all combined:

From the start of the line, anything until a period and then a space after it.

The .* doesn't go until the end of the line and back track. There could, however, be issues if a string such as:

1. Some. Thing here

That would give back "Thing here".

nrg_alpha · October 18, 2008

\s means space character (" " for example)

A more complete explanation to those not aware is that it is a shorthand for a character class that encompasses many forms of spaces (such as tabs, literal spaces, return carriages and newlines).

corbin · October 18, 2008

"\s means space character (" " for example)"

Was meant to read "\s means a space character (" " for example)"

Incase you were correcting me. I know what it means.

If that wasn't aimed at me, errr... ignore this comment.

nrg_alpha · October 18, 2008

Nope.. my comment was not directed to you. Just for those who are not aware

I realize you used " " as an example (implying there are more versions of spaces).

DarkWater · October 19, 2008

This might be even faster:

<?php
$data = <<<DATA
8. 3 Doors Down - Kryptonite
207. Aerosmith - I Don't Want To Miss A Thing
1096. Coldplay - The Scientist
1097. Coldplay - Trouble
1832. FatBoy Slim - Praise You
1833. FatBoy Slim - Right Here, Right Now
2068. Green Day - Time your life
DATA;
echo preg_replace('/^(?>\d+)\.\s+/m', '', $data);
?>

The non-backtracking subpattern for just digits is probably much faster.

ghostdog74 · October 19, 2008

Just to add to ghostdog's response, you'd want to use array_shift() to get the first element off.

my bad, misread the requirement.

foreach ( split("\n",$data) as $k=>$v ){
    $s = explode(". ",$v);
    echo end($s)."\n";
}

nrg_alpha · October 19, 2008

DarkWater, I tested your snippet.. nothing displayed.

Here is my attempt:

$data = <<<DATA
8. 3 Doors Down - Kryptonite
207. Aerosmith - I Don't Want To Miss A Thing
1096. Coldplay - The Scientist
1097. Coldplay - Trouble
1832. FatBoy Slim - Praise You
1833. FatBoy Slim - Right Here, Right Now
2068. Green Day - Time your life
DATA;

echo preg_replace("#^\d+\. #m", '', $data);

So all I did here was from the start (in multiline mode), match all consecutive digits, a dot then a space, and replaced that with nothing.

No backtracking nor capturing involved.

I suppose one could also use:

echo preg_replace("#^[^.]+\. #m", '', $data);

This would ensure that in the event any initial characters accidentally didn't have only digits before the dot would also be matched.

DarkWater · October 19, 2008

That's odd, it seemed to have stripped a ' or something.

<?php
$data = <<<DATA
8. 3 Doors Down - Kryptonite
207. Aerosmith - I Don't Want To Miss A Thing
1096. Coldplay - The Scientist
1097. Coldplay - Trouble
1832. FatBoy Slim - Praise You
1833. FatBoy Slim - Right Here, Right Now
2068. Green Day - Time your life
DATA;
echo preg_replace('/^(?>\d+)\.\s+/m', '', $data);
?>

Try that.

EDIT: Wth. It still stripped a '. Add a ' in right after the opening paren of preg_replace().

nrg_alpha · October 19, 2008

Oh yeah.. How did I miss that missing ' character? Must have been brain dead.

So we have a couple of solutions at our disposal in this thread. To the OP, pick your poison.

Sign In

Strip everything before first occurence of period

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information