Parsing Sentences/Determining Abbreviations

sug15 · April 21, 2010

Hey, I want to be able to extract each sentence from a block of text. I could just use either explode() or something like:

preg_match_all("/(.+)\./Uism",$text,$match); $sentences = $match[0];

However, the problem is that this will also split abbreviations, for example, "The fox jumped over Mr. Smith's sheep." Would be split into 2 sentences. Does anyone know of a simple way to get past this? There's also abbreviations for middle initials, etc. It's not so simple, though, because 2 sentences could be "John is bigger than I. However, I am faster." Regardless if that is correct grammar or not, things like that could come up as an issue.

I was thinking about just trying to identify abbreviations by matching any group of 5 letters or less without a vowel and ending in a period, and manually compiling a list of abbreviations that contain vowels or are over 5 letters and matching those. Then, I could determine name abbreviations, such as "John H." by checking to see if the word before single-letter words followed by a period begin with a capital letter.

The above solution could work well, but it might run into issues, and is pretty complex. Anyone have an easier way or some code I could build off of? Or anyone know of an API?

Thanks!

cags · April 21, 2010

I don't think there is such a thing as an easy solution for this problem. The English language is rather complex even if everybody used it perfectly. To be able to cope with imperfect use is going to be an even more complex task. Even then I'm assuming you only wish to cope with English. Don't forget that a period is not the only valid character for ending a sentence, a question or exclamation mark are just as valid, as are perhaps other characters. Personally I think I might attempt to find and delimit exceptions with some kind of <ignore> tags, then use a simpler pattern for splitting (any valid char that's not between those tags), rather than writing a complex split pattern.

sug15 · April 21, 2010

I don't think there is such a thing as an easy solution for this problem. The English language is rather complex even if everybody used it perfectly. To be able to cope with imperfect use is going to be an even more complex task. Even then I'm assuming you only wish to cope with English. Don't forget that a period is not the only valid character for ending a sentence, a question or exclamation mark are just as valid, as are perhaps other characters. Personally I think I might attempt to find and delimit exceptions with some kind of <ignore> tags, then use a simpler pattern for splitting (any valid char that's not between those tags), rather than writing a complex split pattern.

Yeah, that's a good idea, I was thinking about doing something like that. Then just split:

."

.'

!"

!'

?"

?'

.

!

?

But anyone have an easy way/API to split sentences/determine valid sentence endings?

salathe · April 21, 2010

Have a human (who is familiar with the language and its particular quirks) do it.

Sign In

Parsing Sentences/Determining Abbreviations

Recommended Posts

sug15

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

sug15

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information