Extract Sentences from Paragraphs

lococobra · September 4, 2009

Hey. Long time since I've posted here.

I'm trying to extract sentences from a paragraph. I think I've got a pretty good idea of how it should be done, my regex just isn't working as it should... Help!

Here's what I've got.

/(?<=[\.\?!]|^)[\s ]*([A-Z].*[\.\?!])[\s ]*(?=[A-Z]|$)/Us

You can break that up into 5 sections if it helps any:

(?<=[\.\?!]|^) - Positive look behind assertion, matches end sentence punctuation or the beginning of a line

[\s ]* - There might be some space here

([A-Z].*[\.\?!]) - This is the actual sentence, it starts with a capital letter and ends with punctuation

[\s ]* - There might be some space here

(?=[A-Z]|$) - Positive look ahead assertion, matches a capital character or the end of the line

Us: ungreedy/multiline

It's just not working very well, it seems to be capturing about half or less of the sentences and I can't seem to figure out why. There also some pretty obvious flaws to doing it this way... what if a sentence starts with a number, or what about names? "Mark P. Norman" would capture incorrectly. So I'm open to suggestions if there's a more reliable way to do this, but just getting a mostly functional version is more important.

.josh · September 4, 2009

You're fighting a lost cause. No point in looking for capital letters, as capital letters are used for any number of things in the middle of a sentence, not just the first word. Same with punctuation; periods can be used as abbreviations in the middle of the sentence, virtually any of the sentence ending punctuations can be used when quoting someone, etc... No point in looking for newlines either, as ideally there shouldn't even be newline characters in the paragraph except maybe at the end.

You're only hope for getting it accurate is to force a certain standard when it is being input in the first place. Like forcing the user to mark the end of a sentence with a special non-standard char(s).

lococobra · September 4, 2009

Based on that reply, I'm guessing you didn't actually look at my code. I'm not just checking for capital letters, punctuation, or the beginning/end of strings... I'm checking for a match between those things. You're right that what I have isn't foolproof by any means, but I need some kind of answer nevertheless, and I'm already very close to that answer, my regex just needs some tweaking. I didn't ask for a philosophical debate about whether it's a good idea to attempt a solution to this problem.

Here are my two options, as applied to the actual situation I'm using this for.

1. Cut off text after a certain number of characters

2. Cut off text after the end of the sentence that's closest to the number of characters I want..

AKA if I have a 400 word paragraph, and I need to cut it to 150 words, I want it to at least try to end on something that makes sense rather than just cut it off mid sentan

Which do you think makes more sense?

.josh · September 4, 2009

Based on that reply, I'm guessing you didn't actually look at my code. I'm not just checking for capital letters, punctuation, or the beginning/end of strings... I'm checking for a match between those things.

I read your post just fine.

I'm trying to extract sentences from a paragraph.

The point was that no matter how you try to slice it, there's no way to do that with 100% accuracy, or even 50% accuracy. You are not very close to solving it as you claim, and I gave you factual, not philosophical reasons why. You just think you are because you are using canned examples in your testing.

I will agree that advising you about attempting the impossible may be a question of philosophy, but telling you that it can't be done is not a matter of philosophy, but of fact. If you wanna nonetheless keep hammering away at it then suit yourself.

Here are my two options, as applied to the actual situation I'm using this for.

1. Cut off text after a certain number of characters

2. Cut off text after the end of the sentence that's closest to the number of characters I want..

AKA if I have a 400 word paragraph, and I need to cut it to 150 words, I want it to at least try to end on something that makes sense rather than just cut it off mid sentan

Which do you think makes more sense?

You said you don't want a philosophical debate, but asking which makes more sense is a matter of philosophy. A single sentence by itself will probably make less sense than the entire paragraph, as far as conveying a thought, in the same way that half a sentence will not make sense compared to the full sentence.

The point here is that you cannot use code to automatically generate something that "makes sense" in a human way. You ask me which "makes more sense" well to me it makes more sense to cut off at exactly x characters and trail it with "..." or something so that you don't have to worry about variable length matches messing up styling.

lococobra · September 4, 2009

Look... everything you're saying is completely beside the point. I came here to have someone help me out with my regular expression. That's all I want.

What is syntactically wrong with my regular expression that prevents it from doing what and exactly what I described it should do, which is:

(?<=[\.\?!]|^) - Positive look behind assertion, matches end sentence punctuation or the beginning of a line

[\s ]* - There might be some space here

([A-Z].*[\.\?!]) - This is the actual sentence, it starts with a capital letter and ends with punctuation

[\s ]* - There might be some space here

(?=[A-Z]|$) - Positive look ahead assertion, matches a capital character or the end of the line

Based on a sample paragraph I pulled from a blog post, my regular expression only matches approximately 50% of the sentences that it should match. I'm not bothering to take into account all the different scenarios that make this problem difficult. I just need it to match based on the regular expression that I've already come up with.

AKA: My regular expression should do this:

The quick brown fox jumped over the lazy dog. What do you think?

I don't think that's very impressive! Blah. Okay that's it

Should become...

The quick brown fox jumped over the lazy dog.
What do you think?
I don't think that's very impressive!
Blah. Okay that's it

Instead of matching how it should, my regex only gives the following matches:

The quick brown fox jumped over the lazy dog.
I don't think that's very impressive!

But hey, congratulations, you made this thread seem so convoluted and combative that I sincerely doubt anyone will actually come provide valuable insight at this point. Thanks.

.josh · September 4, 2009

No, the problem is that you are asking for a pattern to match something that can't be matched. So there's no "fixing" what you came up with or writing a pattern that will work. The end. The problem is that you can't seem to get that through your head.

.josh · September 4, 2009

I mean look at your own example:

"Blah. Okay that's it"

why is that considered one sentence to you? How would you suggest regex decide that that should be one sentence and not two? The answer is that IT CAN'T.

.josh · September 4, 2009

Look, here's a canned answer that works for your canned example. Good luck in your endeavors to make it work 100% of the time when you decide to add arbitrary paragraphs to it. If you wanna be a dumbass and not understand that what you're wanting to do is not possible, then that's your business.

$string = <<<EOF
The quick brown fox jumped over the lazy dog. What do you think?

I don't think that's very impressive! Blah. Okay that's it
EOF;

preg_match_all('~(?<=[.?!]|^).*?(?:[.?!]|$)~s',$string,$matches);
echo "<pre>";
print_r(array_map('trim',$matches[0]));

lococobra · September 5, 2009

Ok I'm just going to ignore you.

It seems like the problem with it right now is that, as in the example above... it's actually matching

The quick brown fox jumped over the lazy dog. W <-- note the W

I think that because it's detecting the first character of where the pattern should be catching the next occurance, it's not counting that as an unique pattern match. Is there a work around? I thought that was what the positive look ahead/behind was supposed to do, but it isn't working. I could do an odd/even sort of pattern where it grabs half the sentences one time then the other half the next time.

Also, the line breaks are screwing it up. The reason why I have alternation at the beginning and end with ^ and $ is to allow for line breaks to happen without messing up the pattern but it isn't working. Worst case scenario I could explode the entire thing line by line and check the pattern per line. Shouldn't be this difficult though. Any ideas?

EDIT: My mistake, you're right "Blah." should be considered a separate sentence.

EDIT2: Your pattern is pretty close to what I have already except it doesn't look for capital letters so it's not going to be as effective as what I'm trying to do.

lococobra · September 5, 2009

Got it.

$content = 'The quick brown fox jumped over the lazy dog. what do you think? Third sentance in a line.

I don\'t think that\'s very impressive! Blah. Okay that\'s it';

preg_match_all('/(?<=[.?!]|^).*?(?=([.?!]).{0,3}[A-Z]|$)/s',$content,$matches);
echo "<pre>";
for($i=0;$i<count($matches[0]);$i++)
$result[] = trim($matches[0][$i]).$matches[1][$i];
print_r($result);

.josh · September 5, 2009

'/(?<=[.?!]|^).*?(?=([.?!])\s{0,3}[A-Z]|$)/s'

would be better, as a dot will match any 0-3 things

lococobra · September 5, 2009

Yeah, I'll change it to that. It just doesn't really matter. If there's a capital within the first 3 characters it's probably the first letter of the sentence anyways.

nrg_alpha · September 5, 2009

Here was my crack at it (not perfect of course..):

$text = <<<EOF
This is a test sentence where Mark P. Norman would like to correctly match some sentences! But the problem is that sentences can vary in structure, so it may prove very difficult, yes?
As a member of the N.R.A, Mark could vent his frustrations at the shooting range... because there are god knows how many curve balls a sentence can throw at you. No solution will be perfect, that's for sure!?!
EOF;

$chunks = preg_split('#[\r\n]#', $text, -1, PREG_SPLIT_NO_EMPTY);

foreach($chunks as $val){
    preg_match_all('#(?:\s[a-z]\.(?:[a-z]\.)?|.)+?[.?!]+#i', $val, $paragraph);
    foreach($paragraph[0] as $val){
$sentences[] = ltrim($val);
    }
}

echo '<pre>'.print_r($sentences, true);

nrg_alpha · September 5, 2009

I suppose I could have split it with [\r\n]+, then there would have been no need to use the PREG_SPLIT_NO_EMPTY(I think).

Sign In

Extract Sentences from Paragraphs

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information