Parsing text file - Help with preg_match_all regex

thewired · February 21, 2010

I'm trying to parse subtitle files containing text data along the lines of:

1
00:01:51,686 --> 00:01:53,646
My dear children,

2
00:01:54,022 --> 00:01:58,860
it is now better than several years
since I moved to New York

3
00:01:59,027 --> 00:02:03,114
and I haven't seen you
as much as I would like to.

4
00:02:03,615 --> 00:02:07,410
I hope you will come to this ceremony
of Papal honours,

I have a preg_match_all with my regex:

/([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/

Unfortunately that only grabs the first line of text after the timestamp type information. What I'd like is for the regex to match every line between the timestamp and the blank line.

Help anyone? :shrug:

sader · February 21, 2010

<?php
preg_match_all('/([0-9:,]+?) --> ([0-9:,]+?),(\\d+)/', $str, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) 
{
  $fullmatch = $result[0][$i];
  $left_ts = $result[1][$i];
  $right_ts = $result[2][$i];
  $magic_id = $result[3][$i]; //number after , at the end of line
// $result[0][$i];
}
?>

thewired · February 21, 2010

Thanks sader but I think you misunderstood. I should have been more clear. I do not actually care much about the time stamp, In my regex I only check to make sure it is formatted correctly. The array results I get from my current code looks like this:

Array
(
    [0] => Array
        (
            [0] => 1
00:01:51,686 --> 00:01:53,646
My dear children,
            [1] => 2
00:01:54,022 --> 00:01:58,860
it is now better than several years
            [2] => 3
00:01:59,027 --> 00:02:03,114
and I haven't seen you
            [3] => 4
00:02:03,615 --> 00:02:07,410
I hope you will come to this ceremony
        )

    [1] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
            [3] => 4
        )

    [2] => Array
        (
            [0] => My dear children,
            [1] => it is now better than several years
            [2] => and I haven't seen you
            [3] => I hope you will come to this ceremony
        )

)

As you can see I am not grabbing all of the subtitles. Instead of getting both lines

"it is now better than several years" and "since I moved to New York", I only get the first. It is possible that the subtitles could have multiple lines in each block. My code regex code currently just grabs the first line of subtitle text, however I'd like to grab all lines of text that may exist.

thewired · February 21, 2010

To make this a bit more clear on what I'm trying to accomplish, this is more or less what I'd like my end result to be:

Array
(
    [0] => Array
        (
            [0] => 1
00:01:51,686 --> 00:01:53,646
My dear children,
            [1] => 2
00:01:54,022 --> 00:01:58,860
it is now better than several years
since I moved to New York
            [2] => 3
00:01:59,027 --> 00:02:03,114
and I haven't seen you
as much as I would like to.
            [3] => 4
00:02:03,615 --> 00:02:07,410
I hope you will come to this ceremony
of Papal honours,
        )

    [1] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
            [3] => 4
        )

    [2] => Array
        (
            [0] => My dear children,
            [1] => it is now better than several years
            [2] => and I haven't seen you
            [3] => I hope you will come to this ceremony
        )

    [3] => Array
(
    [0] => 
    [1] => since I moved to New York
    [2] => as much as I would like to.
    [3] => of Papal honours,
)

sader · February 21, 2010

maybe this regexp is what u need

'/([0-9]+)\\r\\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/s'

thewired · February 21, 2010

That's not what I'm going for. You only grabbed the timestamp data (not important info). To sum it all up...

This is my source text I need to parse (important info i need to grab in bold):

1

00:01:51,686 --> 00:01:53,646

My dear children,

2

00:01:54,022 --> 00:01:58,860

it is now better than several years

since I moved to New York

3

00:01:59,027 --> 00:02:03,114

and I haven't seen you

as much as I would like to.

4

00:02:03,615 --> 00:02:07,410

I hope you will come to this ceremony

of Papal honours,

As you can see it consists of blocks of

Block #

Timestamp

Unknown number of lines of Text (subtitles). (I only showed 1 or 2 lines of text in the source but there could be more).

Here is my original regex:

/([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/

Basically most of the regex is fine because it grabs the first number as it should and looks for the timestamp in the right way, but the part that needs work is the end where it grabs the text.

The problem is at

/([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/

The \r\n(.+) at the last part of the regex just grabs one line of text, but I need it to grab infinite lines.

Hopefully it is clear what I am trying to accomplish ::)

sader · February 21, 2010

I come up with this one.

preg_match_all('/([0-9]{1,4})\\s+[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\\s+(.*)/i', $str, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) 
{
$id = $result[1][$i];
$text = $result[2][$i];
}

salathe · February 21, 2010

This looks to be something that can be done much simpler, for example with basic string functions, assuming that the subs file follows a (new line) pattern consistently.

Is there any particular reason why you've chosen to use strict matching against the lines' values with a regular expression?

thewired · February 22, 2010

This looks to be something that can be done much simpler, for example with basic string functions, assuming that the subs file follows a (new line) pattern consistently.

Is there any particular reason why you've chosen to use strict matching against the lines' values with a regular expression?

No, your right this could be done much simpler and yes the subs follow this new line pattern constantly. Can you help me out with this?

salathe · February 24, 2010

Sorry for the delay in replying, it is so easy to "lose" threads on this site!

By something simpler, I just meant breaking up the subs text based on the line structure, like:

$result = array();
foreach (explode("\n\n", $subs) as $item) {
list($num, $timestamp, $sub) = explode("\n", $item, 3);
$result[$num] = $sub;
}

print_r($result);

Which outputs:

Array
(
    [1] => My dear children,
    [2] => it is now better than several years
since I moved to New York
    [3] => and I haven't seen you
as much as I would like to.
    [4] => I hope you will come to this ceremony
of Papal honours,
)

Sign In

Parsing text file - Help with preg_match_all regex

Recommended Posts

thewired

Link to comment

Share on other sites

sader

Link to comment

Share on other sites

thewired

Link to comment

Share on other sites

thewired

Link to comment

Share on other sites

sader

Link to comment

Share on other sites

thewired

Link to comment

Share on other sites

sader

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

thewired

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information