Jump to content

Extract Sentences from Paragraphs


lococobra

Recommended Posts

Hey. Long time since I've posted here.

 

I'm trying to extract sentences from a paragraph. I think I've got a pretty good idea of how it should be done, my regex just isn't working as it should... Help!

 

Here's what I've got.

 

/(?<=[\.\?!]|^)[\s ]*([A-Z].*[\.\?!])[\s ]*(?=[A-Z]|$)/Us

 

You can break that up into 5 sections if it helps any:

 

(?<=[\.\?!]|^)  -  Positive look behind assertion, matches end sentence punctuation or the beginning of a line

[\s ]*  -  There might be some space here

([A-Z].*[\.\?!])  -  This is the actual sentence, it starts with a capital letter and ends with punctuation

[\s ]*  -  There might be some space here

(?=[A-Z]|$)  -  Positive look ahead assertion, matches a capital character or the end of the line

 

Us: ungreedy/multiline

 

It's just not working very well, it seems to be capturing about half or less of the sentences and I can't seem to figure out why. There also some pretty obvious flaws to doing it this way... what if a sentence starts with a number, or what about names? "Mark P. Norman" would capture incorrectly. So I'm open to suggestions if there's a more reliable way to do this, but just getting a mostly functional version is more important.

Link to comment
Share on other sites

You're fighting a lost cause.  No point in looking for capital letters, as capital letters are used for any number of things in the middle of a sentence, not just the first word.  Same with punctuation; periods can be used as abbreviations in the middle of the sentence, virtually any of the sentence ending punctuations can be used when quoting someone, etc...  No point in looking for newlines either, as ideally there shouldn't even be newline characters in the paragraph except maybe at the end.

 

You're only hope for getting it accurate is to force a certain standard when it is being input in the first place.  Like forcing the user to mark the end of a sentence with a special non-standard char(s). 

 

 

Link to comment
Share on other sites

Based on that reply, I'm guessing you didn't actually look at my code. I'm not just checking for capital letters, punctuation, or the beginning/end of strings... I'm checking for a match between those things. You're right that what I have isn't foolproof by any means, but I need some kind of answer nevertheless, and I'm already very close to that answer, my regex just needs some tweaking. I didn't ask for a philosophical debate about whether it's a good idea to attempt a solution to this problem.

 

Here are my two options, as applied to the actual situation I'm using this for.

 

1. Cut off text after a certain number of characters

2. Cut off text after the end of the sentence that's closest to the number of characters I want..

 

AKA if I have a 400 word paragraph, and I need to cut it to 150 words, I want it to at least try to end on something that makes sense rather than just cut it off mid sentan

 

Which do you think makes more sense?

Link to comment
Share on other sites

Based on that reply, I'm guessing you didn't actually look at my code. I'm not just checking for capital letters, punctuation, or the beginning/end of strings... I'm checking for a match between those things.

 

I read your post just fine.

 

I'm trying to extract sentences from a paragraph.

 

The point was that no matter how you try to slice it, there's no way to do that with 100% accuracy, or even 50% accuracy.  You are not very close to solving it as you claim, and I gave you factual, not philosophical reasons why.  You just think you are because you are using canned examples in your testing.

 

I will agree that advising you about attempting the impossible may be a question of philosophy, but telling you that it can't be done is not a matter of philosophy, but of fact.  If you wanna nonetheless keep hammering away at it then suit yourself. 

 

Here are my two options, as applied to the actual situation I'm using this for.

 

1. Cut off text after a certain number of characters

2. Cut off text after the end of the sentence that's closest to the number of characters I want..

 

AKA if I have a 400 word paragraph, and I need to cut it to 150 words, I want it to at least try to end on something that makes sense rather than just cut it off mid sentan

 

Which do you think makes more sense?

 

You said you don't want a philosophical debate, but asking which makes more sense is a matter of philosophy.  A single sentence by itself will probably make less sense than the entire paragraph, as far as conveying a thought, in the same way that half a sentence will not make sense compared to the full sentence.

 

The point here is that you cannot use code to automatically generate something that "makes sense" in a human way.  You ask me which "makes more sense" well to me it makes more sense to cut off at exactly x characters and trail it with "..." or something so that you don't have to worry about variable length matches messing up styling.

Link to comment
Share on other sites

Look... everything you're saying is completely beside the point. I came here to have someone help me out with my regular expression. That's all I want.

 

What is syntactically wrong with my regular expression that prevents it from doing what and exactly what I described it should do, which is:

 

(?<=[\.\?!]|^)  -  Positive look behind assertion, matches end sentence punctuation or the beginning of a line

[\s ]*  -  There might be some space here

([A-Z].*[\.\?!])  -  This is the actual sentence, it starts with a capital letter and ends with punctuation

[\s ]*  -  There might be some space here

(?=[A-Z]|$)  -  Positive look ahead assertion, matches a capital character or the end of the line

 

Based on a sample paragraph I pulled from a blog post, my regular expression only matches approximately 50% of the sentences that it should match. I'm not bothering to take into account all the different scenarios that make this problem difficult. I just need it to match based on the regular expression that I've already come up with.

 

 

AKA: My regular expression should do this:

The quick brown fox jumped over the lazy dog. What do you think?

I don't think that's very impressive! Blah. Okay that's it

 

Should become...

  • The quick brown fox jumped over the lazy dog.
  • What do you think?
  • I don't think that's very impressive!
  • Blah. Okay that's it

 

Instead of matching how it should, my regex only gives the following matches:

 

  • The quick brown fox jumped over the lazy dog.
  • I don't think that's very impressive!

 

But hey, congratulations, you made this thread seem so convoluted and combative that I sincerely doubt anyone will actually come provide valuable insight at this point. Thanks.

Link to comment
Share on other sites

Look, here's a canned answer that works for your canned example.  Good luck in your endeavors to make it work 100% of the time when you decide to add arbitrary paragraphs to it.  If you wanna be a dumbass and not understand that what you're wanting to do is not possible, then that's your business.

 

$string = <<<EOF
The quick brown fox jumped over the lazy dog. What do you think?

I don't think that's very impressive! Blah. Okay that's it
EOF;

preg_match_all('~(?<=[.?!]|^).*?(?:[.?!]|$)~s',$string,$matches);
echo "<pre>";
print_r(array_map('trim',$matches[0]));

Link to comment
Share on other sites

Ok I'm just going to ignore you.

 

It seems like the problem with it right now is that, as in the example above... it's actually matching

 

The quick brown fox jumped over the lazy dog. W  <-- note the W

 

I think that because it's detecting the first character of where the pattern should be catching the next occurance, it's not counting that as an unique pattern match. Is there a work around? I thought that was what the positive look ahead/behind was supposed to do, but it isn't working. I could do an odd/even sort of pattern where it grabs half the sentences one time then the other half the next time.

 

Also, the line breaks are screwing it up. The reason why I have alternation at the beginning and end with ^ and $ is to allow for line breaks to happen without messing up the pattern but it isn't working. Worst case scenario I could explode the entire thing line by line and check the pattern per line. Shouldn't be this difficult though. Any ideas?

 

EDIT: My mistake, you're right "Blah." should be considered a separate sentence.

 

EDIT2: Your pattern is pretty close to what I have already except it doesn't look for capital letters so it's not going to be as effective as what I'm trying to do.

Link to comment
Share on other sites

Got it.

$content = 'The quick brown fox jumped over the lazy dog. what do you think? Third sentance in a line.

I don\'t think that\'s very impressive! Blah. Okay that\'s it';

preg_match_all('/(?<=[.?!]|^).*?(?=([.?!]).{0,3}[A-Z]|$)/s',$content,$matches);
echo "<pre>";
for($i=0;$i<count($matches[0]);$i++)
$result[] = trim($matches[0][$i]).$matches[1][$i];
print_r($result);

Link to comment
Share on other sites

Here was my crack at it (not perfect of course..):

 

$text = <<<EOF
This is a test sentence where Mark P. Norman would like to correctly match some sentences! But the problem is that sentences can vary in structure, so it may prove very difficult, yes?
As a member of the N.R.A, Mark could vent his frustrations at the shooting range... because there are god knows how many curve balls a sentence can throw at you. No solution will be perfect, that's for sure!?!
EOF;

$chunks = preg_split('#[\r\n]#', $text, -1, PREG_SPLIT_NO_EMPTY);

foreach($chunks as $val){
    preg_match_all('#(?:\s[a-z]\.(?:[a-z]\.)?|.)+?[.?!]+#i', $val, $paragraph);
    foreach($paragraph[0] as $val){
$sentences[] = ltrim($val);
    }
}

echo '<pre>'.print_r($sentences, true);

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.