thewired Posted February 21, 2010 Share Posted February 21, 2010 I'm trying to parse subtitle files containing text data along the lines of: 1 00:01:51,686 --> 00:01:53,646 My dear children, 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, I have a preg_match_all with my regex: /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ Unfortunately that only grabs the first line of text after the timestamp type information. What I'd like is for the regex to match every line between the timestamp and the blank line. Help anyone? Quote Link to comment Share on other sites More sharing options...
sader Posted February 21, 2010 Share Posted February 21, 2010 <?php preg_match_all('/([0-9:,]+?) --> ([0-9:,]+?),(\\d+)/', $str, $result, PREG_PATTERN_ORDER); for ($i = 0; $i < count($result[0]); $i++) { $fullmatch = $result[0][$i]; $left_ts = $result[1][$i]; $right_ts = $result[2][$i]; $magic_id = $result[3][$i]; //number after , at the end of line // $result[0][$i]; } ?> Quote Link to comment Share on other sites More sharing options...
thewired Posted February 21, 2010 Author Share Posted February 21, 2010 Thanks sader but I think you misunderstood. I should have been more clear. I do not actually care much about the time stamp, In my regex I only check to make sure it is formatted correctly. The array results I get from my current code looks like this: Array ( [0] => Array ( [0] => 1 00:01:51,686 --> 00:01:53,646 My dear children, [1] => 2 00:01:54,022 --> 00:01:58,860 it is now better than several years [2] => 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you [3] => 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony ) [1] => Array ( [0] => 1 [1] => 2 [2] => 3 [3] => 4 ) [2] => Array ( [0] => My dear children, [1] => it is now better than several years [2] => and I haven't seen you [3] => I hope you will come to this ceremony ) ) As you can see I am not grabbing all of the subtitles. Instead of getting both lines "it is now better than several years" and "since I moved to New York", I only get the first. It is possible that the subtitles could have multiple lines in each block. My code regex code currently just grabs the first line of subtitle text, however I'd like to grab all lines of text that may exist. Quote Link to comment Share on other sites More sharing options...
thewired Posted February 21, 2010 Author Share Posted February 21, 2010 To make this a bit more clear on what I'm trying to accomplish, this is more or less what I'd like my end result to be: Array ( [0] => Array ( [0] => 1 00:01:51,686 --> 00:01:53,646 My dear children, [1] => 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York [2] => 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. [3] => 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, ) [1] => Array ( [0] => 1 [1] => 2 [2] => 3 [3] => 4 ) [2] => Array ( [0] => My dear children, [1] => it is now better than several years [2] => and I haven't seen you [3] => I hope you will come to this ceremony ) [3] => Array ( [0] => [1] => since I moved to New York [2] => as much as I would like to. [3] => of Papal honours, ) Quote Link to comment Share on other sites More sharing options...
sader Posted February 21, 2010 Share Posted February 21, 2010 maybe this regexp is what u need '/([0-9]+)\\r\\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/s' Quote Link to comment Share on other sites More sharing options...
thewired Posted February 21, 2010 Author Share Posted February 21, 2010 That's not what I'm going for. You only grabbed the timestamp data (not important info). To sum it all up... This is my source text I need to parse (important info i need to grab in bold): 1 00:01:51,686 --> 00:01:53,646 My dear children, 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, As you can see it consists of blocks of Block # Timestamp Unknown number of lines of Text (subtitles). (I only showed 1 or 2 lines of text in the source but there could be more). Here is my original regex: /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ Basically most of the regex is fine because it grabs the first number as it should and looks for the timestamp in the right way, but the part that needs work is the end where it grabs the text. The problem is at /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ The \r\n(.+) at the last part of the regex just grabs one line of text, but I need it to grab infinite lines. Hopefully it is clear what I am trying to accomplish Quote Link to comment Share on other sites More sharing options...
sader Posted February 21, 2010 Share Posted February 21, 2010 I come up with this one. preg_match_all('/([0-9]{1,4})\\s+[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\\s+(.*)/i', $str, $result, PREG_PATTERN_ORDER); for ($i = 0; $i < count($result[0]); $i++) { $id = $result[1][$i]; $text = $result[2][$i]; } Quote Link to comment Share on other sites More sharing options...
salathe Posted February 21, 2010 Share Posted February 21, 2010 This looks to be something that can be done much simpler, for example with basic string functions, assuming that the subs file follows a (new line) pattern consistently. Is there any particular reason why you've chosen to use strict matching against the lines' values with a regular expression? Quote Link to comment Share on other sites More sharing options...
thewired Posted February 22, 2010 Author Share Posted February 22, 2010 This looks to be something that can be done much simpler, for example with basic string functions, assuming that the subs file follows a (new line) pattern consistently. Is there any particular reason why you've chosen to use strict matching against the lines' values with a regular expression? No, your right this could be done much simpler and yes the subs follow this new line pattern constantly. Can you help me out with this? Quote Link to comment Share on other sites More sharing options...
salathe Posted February 24, 2010 Share Posted February 24, 2010 Sorry for the delay in replying, it is so easy to "lose" threads on this site! By something simpler, I just meant breaking up the subs text based on the line structure, like: $result = array(); foreach (explode("\n\n", $subs) as $item) { list($num, $timestamp, $sub) = explode("\n", $item, 3); $result[$num] = $sub; } print_r($result); Which outputs: Array ( [1] => My dear children, [2] => it is now better than several years since I moved to New York [3] => and I haven't seen you as much as I would like to. [4] => I hope you will come to this ceremony of Papal honours, ) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.