Jump to content

Parsing text file - Help with preg_match_all regex


thewired

Recommended Posts

I'm trying to parse subtitle files containing text data along the lines of:

 

1
00:01:51,686 --> 00:01:53,646
My dear children,

2
00:01:54,022 --> 00:01:58,860
it is now better than several years
since I moved to New York

3
00:01:59,027 --> 00:02:03,114
and I haven't seen you
as much as I would like to.

4
00:02:03,615 --> 00:02:07,410
I hope you will come to this ceremony
of Papal honours,

 

I have a preg_match_all with my regex:

/([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/

 

Unfortunately that only grabs the first line of text after the timestamp type information. What I'd like is for the regex to match every line between the timestamp and the blank line.

 

Help anyone?  :shrug:

Link to comment
Share on other sites

<?php
preg_match_all('/([0-9:,]+?) --> ([0-9:,]+?),(\\d+)/', $str, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) 
{
  $fullmatch = $result[0][$i];
  $left_ts = $result[1][$i];
  $right_ts = $result[2][$i];
  $magic_id = $result[3][$i]; //number after , at the end of line
// $result[0][$i];
}
?>

Link to comment
Share on other sites

Thanks sader but I think you misunderstood. I should have been more clear. I do not actually care much about the time stamp, In my regex I only check to make sure it is formatted correctly. The array results I get from my current code looks like this:

 

Array
(
    [0] => Array
        (
            [0] => 1
00:01:51,686 --> 00:01:53,646
My dear children,
            [1] => 2
00:01:54,022 --> 00:01:58,860
it is now better than several years
            [2] => 3
00:01:59,027 --> 00:02:03,114
and I haven't seen you
            [3] => 4
00:02:03,615 --> 00:02:07,410
I hope you will come to this ceremony
        )

    [1] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
            [3] => 4
        )

    [2] => Array
        (
            [0] => My dear children,
            [1] => it is now better than several years
            [2] => and I haven't seen you
            [3] => I hope you will come to this ceremony
        )

)

 

As you can see I am not grabbing all of the subtitles.  Instead of getting both lines

"it is now better than several years" and "since I moved to New York", I only get the first. It is possible that the subtitles could have multiple lines in each block. My code regex code currently just grabs the first line of subtitle text, however I'd like to grab all lines of text that may exist.

Link to comment
Share on other sites

To make this a bit more clear on what I'm trying to accomplish, this is more or less what I'd like my end result to be:

 

Array
(
    [0] => Array
        (
            [0] => 1
00:01:51,686 --> 00:01:53,646
My dear children,
            [1] => 2
00:01:54,022 --> 00:01:58,860
it is now better than several years
since I moved to New York
            [2] => 3
00:01:59,027 --> 00:02:03,114
and I haven't seen you
as much as I would like to.
            [3] => 4
00:02:03,615 --> 00:02:07,410
I hope you will come to this ceremony
of Papal honours,
        )

    [1] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
            [3] => 4
        )

    [2] => Array
        (
            [0] => My dear children,
            [1] => it is now better than several years
            [2] => and I haven't seen you
            [3] => I hope you will come to this ceremony
        )

    [3] => Array
(
    [0] => 
    [1] => since I moved to New York
    [2] => as much as I would like to.
    [3] => of Papal honours,
)

Link to comment
Share on other sites

That's not what I'm going for. You only grabbed the timestamp data (not important info). To sum it all up...

 

This is my source text I need to parse (important info i need to grab in bold):

1

00:01:51,686 --> 00:01:53,646

My dear children,

 

2

00:01:54,022 --> 00:01:58,860

it is now better than several years

since I moved to New York

 

3

00:01:59,027 --> 00:02:03,114

and I haven't seen you

as much as I would like to.

 

4

00:02:03,615 --> 00:02:07,410

I hope you will come to this ceremony

of Papal honours,

 

 

 

As you can see it consists of blocks of

Block #

Timestamp

Unknown number of lines of Text (subtitles). (I only showed 1 or 2 lines of text in the source but there could be more).

 

 

Here is my original regex:

/([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/

 

Basically most of the regex is fine because it grabs the first number as it should and looks for the timestamp in the right way, but the part that needs work is the end where it grabs the text.

 

The problem is at

/([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/

 

The \r\n(.+) at the last part of the regex just grabs one line of text, but I need it to grab infinite lines.

 

Hopefully it is clear what I am trying to accomplish  ::)

Link to comment
Share on other sites

I come up with this one.

preg_match_all('/([0-9]{1,4})\\s+[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\\s+(.*)/i', $str, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) 
{
$id = $result[1][$i];
$text = $result[2][$i];
}

Link to comment
Share on other sites

This looks to be something that can be done much simpler, for example with basic string functions, assuming that the subs file follows a (new line) pattern consistently. 

 

Is there any particular reason why you've chosen to use strict matching against the lines' values with a regular expression?

Link to comment
Share on other sites

This looks to be something that can be done much simpler, for example with basic string functions, assuming that the subs file follows a (new line) pattern consistently. 

 

Is there any particular reason why you've chosen to use strict matching against the lines' values with a regular expression?

 

No, your right this could be done much simpler and yes the subs follow this new line pattern constantly. Can you help me out with this?

Link to comment
Share on other sites

Sorry for the delay in replying, it is so easy to "lose" threads on this site!

 

By something simpler, I just meant breaking up the subs text based on the line structure, like:

 

$result = array();
foreach (explode("\n\n", $subs) as $item) {
list($num, $timestamp, $sub) = explode("\n", $item, 3);
$result[$num] = $sub;
}

print_r($result);

 

Which outputs:

Array
(
    [1] => My dear children,
    [2] => it is now better than several years
since I moved to New York
    [3] => and I haven't seen you
as much as I would like to.
    [4] => I hope you will come to this ceremony
of Papal honours,
)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.