thewired

Parsing text file - Help with preg_match_all regex

thewired replied to thewired's topic in Regex Help

No, your right this could be done much simpler and yes the subs follow this new line pattern constantly. Can you help me out with this?

February 22, 2010
9 replies

Parsing text file - Help with preg_match_all regex

thewired replied to thewired's topic in Regex Help

That's not what I'm going for. You only grabbed the timestamp data (not important info). To sum it all up... This is my source text I need to parse (important info i need to grab in bold): 1 00:01:51,686 --> 00:01:53,646 My dear children, 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, As you can see it consists of blocks of Block # Timestamp Unknown number of lines of Text (subtitles). (I only showed 1 or 2 lines of text in the source but there could be more). Here is my original regex: /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ Basically most of the regex is fine because it grabs the first number as it should and looks for the timestamp in the right way, but the part that needs work is the end where it grabs the text. The problem is at /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ The \r\n(.+) at the last part of the regex just grabs one line of text, but I need it to grab infinite lines. Hopefully it is clear what I am trying to accomplish

February 21, 2010
9 replies

Parsing text file - Help with preg_match_all regex

thewired replied to thewired's topic in Regex Help

To make this a bit more clear on what I'm trying to accomplish, this is more or less what I'd like my end result to be: Array ( [0] => Array ( [0] => 1 00:01:51,686 --> 00:01:53,646 My dear children, [1] => 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York [2] => 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. [3] => 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, ) [1] => Array ( [0] => 1 [1] => 2 [2] => 3 [3] => 4 ) [2] => Array ( [0] => My dear children, [1] => it is now better than several years [2] => and I haven't seen you [3] => I hope you will come to this ceremony ) [3] => Array ( [0] => [1] => since I moved to New York [2] => as much as I would like to. [3] => of Papal honours, )

February 21, 2010
9 replies

Parsing text file - Help with preg_match_all regex

thewired replied to thewired's topic in Regex Help

Thanks sader but I think you misunderstood. I should have been more clear. I do not actually care much about the time stamp, In my regex I only check to make sure it is formatted correctly. The array results I get from my current code looks like this: Array ( [0] => Array ( [0] => 1 00:01:51,686 --> 00:01:53,646 My dear children, [1] => 2 00:01:54,022 --> 00:01:58,860 it is now better than several years [2] => 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you [3] => 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony ) [1] => Array ( [0] => 1 [1] => 2 [2] => 3 [3] => 4 ) [2] => Array ( [0] => My dear children, [1] => it is now better than several years [2] => and I haven't seen you [3] => I hope you will come to this ceremony ) ) As you can see I am not grabbing all of the subtitles. Instead of getting both lines "it is now better than several years" and "since I moved to New York", I only get the first. It is possible that the subtitles could have multiple lines in each block. My code regex code currently just grabs the first line of subtitle text, however I'd like to grab all lines of text that may exist.

February 21, 2010
9 replies

Parsing text file - Help with preg_match_all regex

thewired posted a topic in Regex Help

I'm trying to parse subtitle files containing text data along the lines of: 1 00:01:51,686 --> 00:01:53,646 My dear children, 2 00:01:54,022 --> 00:01:58,860 it is now better than several years since I moved to New York 3 00:01:59,027 --> 00:02:03,114 and I haven't seen you as much as I would like to. 4 00:02:03,615 --> 00:02:07,410 I hope you will come to this ceremony of Papal honours, I have a preg_match_all with my regex: /([0-9]+)\r\n[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} --> [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\r\n(.+)/ Unfortunately that only grabs the first line of text after the timestamp type information. What I'd like is for the regex to match every line between the timestamp and the blank line. Help anyone?

February 21, 2010
9 replies

regex code for grabbing info from link

thewired replied to thewired's topic in Regex Help

Yeah your right about funky results. For my uses however, my code should work fine. I can see how it would break like you said if some tags I can't predict show up and whatnot, but that shouldn't happen in my case. Anyway thanks for all the help with my regex questions!

March 22, 2009
17 replies

regex code for grabbing info from link

thewired replied to thewired's topic in Regex Help

Actually, when I replace (.*?) with ([^<]*), it fixes the problem! Yay So now my code is: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">([^<]*)</a>\s*</td>\s*<td class="crcsize">([^<]*)</td>\s*\s*<td class="seeds">([^<]*)</td>\s*<td class="conns">([^<]*)</td>~is' Of course the (potential) problem is see with this is if there is a < in the link title, it won't grab the whole title. Would something along these lines be valid ([^</]*) ? My objective with that is for it to stop only when it gets to a </ (ending html tag).

March 22, 2009
17 replies

regex code for grabbing info from link

thewired replied to thewired's topic in Regex Help

Hmm so how do you recommend I fix it? It should be something along the lines of ([^<]*) (keep matching till it hits <) right? Well this didn't help the problem, if its even valid... I tried: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>\s*</td>\s*<td class="crcsize">([^<]*)</td>\s*\s*<td class="seeds">([^<]*)</td>\s*<td class="conns">([^<]*)</td>~is'

March 22, 2009
17 replies

regex code for grabbing info from link

thewired replied to thewired's topic in Regex Help

I believe this is because it is grabbing the name along with all the source code that follows it. Here's an example to help make my problem more clear. The array looks like this: [12]=> string(35) "url name 12" [13]=> string(29) "url name 13" [14]=> string(2077) "url name</a> </td> <td class="crcsize">X</td> <td class="seeds" colspan="7"></td> </tr>

March 22, 2009
17 replies

regex code for grabbing info from link

thewired replied to thewired's topic in Regex Help

Thanks for your response Crayon it was very informative. I have now changed my regex a bit, and I am having problems, which I hope you or someone else can help me fix. The regex looks like this: '~<td class="file">\s*<a href="([^"]*)" title="([^"]*)">(.*?)</a>\s*</td>\s*<td class="crcsize">([^"]*)</td>\s*\s*<td class="seeds">([^"]*)</td>\s*<td class="conns">([^"]*)</td>~is' And it is grabbing info from source code that looks like this: <td class="file"> <a href="link.com" title="title">grab1</a> </td> <td class="crcsize">grab2</td> <td class="seeds">grab3</td> <td class="conns">grab4</td> My problem is that some of the code on the page looks like this: <td class="file"> <a href="link.com" title="title">X</a> </td> <td class="crcsize">X</td> <td class="seeds" colspan="7"></td> That code is getting placed in a string in the array containing the link names. This is a problem. I do not want that code to even be taken into consideration, I want my code to complete ignore it and not take any values from those blocks of html. Help?

March 22, 2009
17 replies

regex code for grabbing info from link

thewired replied to thewired's topic in Regex Help

Thanks for the help guys. I tried DJTims and Cranyon's but your new code didn't work DJTim. So I'm using Crayon's. Can someone explain to me what the difference is between ([^"]*) and (.*?) ? Also it seems to work as is but does it need the backslashes like DJTim's code?

March 22, 2009
17 replies

regex code for grabbing info from link

thewired replied to thewired's topic in Regex Help

I tried using this for the regex, but no luck. I am assuming it is because the td class tag is on a separate line. '/\<td class\=\"file\"\>\<a href\=\"(.*?)\" title\=\"(.*?)\"\>(.*?)\<\/a\>/i'

March 21, 2009
17 replies

regex code for grabbing info from link

thewired replied to thewired's topic in Regex Help

Thanks, I appreciate the response. That is pretty close to what I want, however that will grab any link on the page and I only want to grab links preceded the line: <td class="file"> Little more help?

March 21, 2009
17 replies

regex code for grabbing info from link

thewired posted a topic in Regex Help

I have a site I am scraping source from and want to grab the info that says "WANT". The code looks like this, with the TD on a seperate line. Can I get help making a regex code for this? I am a regex noob The code looks like this, WANT and the random.com's will be different everytime. <td class="file"> <a href="random.com" title="random.com">WANT</a>

March 21, 2009
17 replies

Sign In

Posts

Joined

Last visited

Contact Methods

Profile Information

thewired's Achievements

Newbie (1/5)

Reputation

Parsing text file - Help with preg_match_all regex

Parsing text file - Help with preg_match_all regex

Parsing text file - Help with preg_match_all regex

Parsing text file - Help with preg_match_all regex

Parsing text file - Help with preg_match_all regex

regex code for grabbing info from link

regex code for grabbing info from link

regex code for grabbing info from link

regex code for grabbing info from link

regex code for grabbing info from link

regex code for grabbing info from link

regex code for grabbing info from link

regex code for grabbing info from link

regex code for grabbing info from link

Browse

Activity

Important Information