
thewired
Members-
Posts
14 -
Joined
-
Last visited
Never
Contact Methods
-
AIM
icem0nkey
Profile Information
-
Gender
Not Telling
thewired's Achievements

Newbie (1/5)
0
Reputation
-
Parsing text file - Help with preg_match_all regex
thewired replied to thewired's topic in Regex Help
No, your right this could be done much simpler and yes the subs follow this new line pattern constantly. Can you help me out with this? -
Parsing text file - Help with preg_match_all regex
thewired replied to thewired's topic in Regex Help
That's not what I'm going for. You only grabbed the timestamp data (not important info). To sum it all up... This is my source text I need to parse (important info i need to grab in bold): 1 00:01:51,686 --> 00:01:53,646 My dear children, 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, As you can see it consists of blocks of Block # Timestamp Unknown number of lines of Text (subtitles). (I only showed 1 or 2 lines of text in the source but there could be more). Here is my original regex: /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ Basically most of the regex is fine because it grabs the first number as it should and looks for the timestamp in the right way, but the part that needs work is the end where it grabs the text. The problem is at /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ The \r\n(.+) at the last part of the regex just grabs one line of text, but I need it to grab infinite lines. Hopefully it is clear what I am trying to accomplish -
Parsing text file - Help with preg_match_all regex
thewired replied to thewired's topic in Regex Help
To make this a bit more clear on what I'm trying to accomplish, this is more or less what I'd like my end result to be: Array ( [0] => Array ( [0] => 1 00:01:51,686 --> 00:01:53,646 My dear children, [1] => 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York [2] => 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. [3] => 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, ) [1] => Array ( [0] => 1 [1] => 2 [2] => 3 [3] => 4 ) [2] => Array ( [0] => My dear children, [1] => it is now better than several years [2] => and I haven't seen you [3] => I hope you will come to this ceremony ) [3] => Array ( [0] => [1] => since I moved to New York [2] => as much as I would like to. [3] => of Papal honours, ) -
Parsing text file - Help with preg_match_all regex
thewired replied to thewired's topic in Regex Help
Thanks sader but I think you misunderstood. I should have been more clear. I do not actually care much about the time stamp, In my regex I only check to make sure it is formatted correctly. The array results I get from my current code looks like this: Array ( [0] => Array ( [0] => 1 00:01:51,686 --> 00:01:53,646 My dear children, [1] => 2 00:01:54,022 --> 00:01:58,860 it is now better than several years [2] => 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you [3] => 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony ) [1] => Array ( [0] => 1 [1] => 2 [2] => 3 [3] => 4 ) [2] => Array ( [0] => My dear children, [1] => it is now better than several years [2] => and I haven't seen you [3] => I hope you will come to this ceremony ) ) As you can see I am not grabbing all of the subtitles. Instead of getting both lines "it is now better than several years" and "since I moved to New York", I only get the first. It is possible that the subtitles could have multiple lines in each block. My code regex code currently just grabs the first line of subtitle text, however I'd like to grab all lines of text that may exist. -
I'm trying to parse subtitle files containing text data along the lines of: 1 00:01:51,686 --> 00:01:53,646 My dear children, 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, I have a preg_match_all with my regex: /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ Unfortunately that only grabs the first line of text after the timestamp type information. What I'd like is for the regex to match every line between the timestamp and the blank line. Help anyone?
-
Yeah your right about funky results. For my uses however, my code should work fine. I can see how it would break like you said if some tags I can't predict show up and whatnot, but that shouldn't happen in my case. Anyway thanks for all the help with my regex questions!
-
Actually, when I replace (.*?) with ([^<]*), it fixes the problem! Yay So now my code is: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">([^<]*)</a>\s*</td>\s*<td class="crcsize">([^<]*)</td>\s*\s*<td class="seeds"><b>([^<]*)</b></td>\s*<td class="conns"><b>([^<]*)</b></td>~is' Of course the (potential) problem is see with this is if there is a < in the link title, it won't grab the whole title. Would something along these lines be valid ([^</]*) ? My objective with that is for it to stop only when it gets to a </ (ending html tag).
-
Hmm so how do you recommend I fix it? It should be something along the lines of ([^<]*) (keep matching till it hits <) right? Well this didn't help the problem, if its even valid... I tried: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>\s*</td>\s*<td class="crcsize">([^<]*)</td>\s*\s*<td class="seeds"><b>([^<]*)</b></td>\s*<td class="conns"><b>([^<]*)</b></td>~is'
-
I believe this is because it is grabbing the name along with all the source code that follows it. Here's an example to help make my problem more clear. The array looks like this: [12]=> string(35) "url name 12" [13]=> string(29) "url name 13" [14]=> string(2077) "url name</a> </td> <td class="crcsize">X</td> <td class="seeds" colspan="7"></td> </tr>
-
Thanks for your response Crayon it was very informative. I have now changed my regex a bit, and I am having problems, which I hope you or someone else can help me fix. The regex looks like this: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>\s*</td>\s*<td class="crcsize">([^"]*)</td>\s*\s*<td class="seeds"><b>([^"]*)</b></td>\s*<td class="conns"><b>([^"]*)</b></td>~is' And it is grabbing info from source code that looks like this: <td class="file"> <a href="link.com" title="title">grab1</a> </td> <td class="crcsize">grab2</td> <td class="seeds"><b>grab3</b></td> <td class="conns"><b>grab4</b></td> My problem is that some of the code on the page looks like this: <td class="file"> <a href="link.com" title="title">X</a> </td> <td class="crcsize">X</td> <td class="seeds" colspan="7"></td> That code is getting placed in a string in the array containing the link names. This is a problem. I do not want that code to even be taken into consideration, I want my code to complete ignore it and not take any values from those blocks of html. Help?
-
Thanks for the help guys. I tried DJTims and Cranyon's but your new code didn't work DJTim. So I'm using Crayon's. Can someone explain to me what the difference is between ([^"]*) and (.*?) ? Also it seems to work as is but does it need the backslashes like DJTim's code?
-
I tried using this for the regex, but no luck. I am assuming it is because the td class tag is on a separate line. '/\<td class\=\"file\"\>\<a href\=\"(.*?)\" title\=\"(.*?)\"\>(.*?)\<\/a\>/i'
-
Thanks, I appreciate the response. That is pretty close to what I want, however that will grab any link on the page and I only want to grab links preceded the line: <td class="file"> Little more help?
-
I have a site I am scraping source from and want to grab the info that says "WANT". The code looks like this, with the TD on a seperate line. Can I get help making a regex code for this? I am a regex noob The code looks like this, WANT and the random.com's will be different everytime. <td class="file"> <a href="random.com" title="random.com">WANT</a>