Jump to content

Parsing Sentences/Determining Abbreviations


sug15

Recommended Posts

Hey, I want to be able to extract each sentence from a block of text. I could just use either explode() or something like:

preg_match_all("/(.+)\./Uism",$text,$match); $sentences = $match[0];

However, the problem is that this will also split abbreviations, for example, "The fox jumped over Mr. Smith's sheep." Would be split into 2 sentences. Does anyone know of a simple way to get past this? There's also abbreviations for middle initials, etc. It's not so simple, though, because 2 sentences could be "John is bigger than I. However, I am faster." Regardless if that is correct grammar or not, things like that could come up as an issue.

 

I was thinking about just trying to identify abbreviations by matching any group of 5 letters or less without a vowel and ending in a period, and manually compiling a list of abbreviations that contain vowels or are over 5 letters and matching those. Then, I could determine name abbreviations, such as "John H." by checking to see if the word before single-letter words followed by a period begin with a capital letter.

 

The above solution could work well, but it might run into issues, and is pretty complex. Anyone have an easier way or some code I could build off of? Or anyone know of an API?

 

Thanks!

Link to comment
Share on other sites

I don't think there is such a thing as an easy solution for this problem. The English language is rather complex even if everybody used it perfectly. To be able to cope with imperfect use is going to be an even more complex task. Even then I'm assuming you only wish to cope with English. Don't forget that a period is not the only valid character for ending a sentence, a question or exclamation mark are just as valid, as are perhaps other characters. Personally I think I might attempt to find and delimit exceptions with some kind of <ignore> tags, then use a simpler pattern for splitting (any valid char that's not between those tags), rather than writing a complex split pattern.

Link to comment
Share on other sites

I don't think there is such a thing as an easy solution for this problem. The English language is rather complex even if everybody used it perfectly. To be able to cope with imperfect use is going to be an even more complex task. Even then I'm assuming you only wish to cope with English. Don't forget that a period is not the only valid character for ending a sentence, a question or exclamation mark are just as valid, as are perhaps other characters. Personally I think I might attempt to find and delimit exceptions with some kind of <ignore> tags, then use a simpler pattern for splitting (any valid char that's not between those tags), rather than writing a complex split pattern.

Yeah, that's a good idea, I was thinking about doing something like that. Then just split:

."

.'

!"

!'

?"

?'

.

!

?

 

But anyone have an easy way/API to split sentences/determine valid sentence endings?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.