sug15 Posted April 21, 2010 Share Posted April 21, 2010 Hey, I want to be able to extract each sentence from a block of text. I could just use either explode() or something like: preg_match_all("/(.+)\./Uism",$text,$match); $sentences = $match[0]; However, the problem is that this will also split abbreviations, for example, "The fox jumped over Mr. Smith's sheep." Would be split into 2 sentences. Does anyone know of a simple way to get past this? There's also abbreviations for middle initials, etc. It's not so simple, though, because 2 sentences could be "John is bigger than I. However, I am faster." Regardless if that is correct grammar or not, things like that could come up as an issue. I was thinking about just trying to identify abbreviations by matching any group of 5 letters or less without a vowel and ending in a period, and manually compiling a list of abbreviations that contain vowels or are over 5 letters and matching those. Then, I could determine name abbreviations, such as "John H." by checking to see if the word before single-letter words followed by a period begin with a capital letter. The above solution could work well, but it might run into issues, and is pretty complex. Anyone have an easier way or some code I could build off of? Or anyone know of an API? Thanks! Quote Link to comment https://forums.phpfreaks.com/topic/199215-parsing-sentencesdetermining-abbreviations/ Share on other sites More sharing options...
cags Posted April 21, 2010 Share Posted April 21, 2010 I don't think there is such a thing as an easy solution for this problem. The English language is rather complex even if everybody used it perfectly. To be able to cope with imperfect use is going to be an even more complex task. Even then I'm assuming you only wish to cope with English. Don't forget that a period is not the only valid character for ending a sentence, a question or exclamation mark are just as valid, as are perhaps other characters. Personally I think I might attempt to find and delimit exceptions with some kind of <ignore> tags, then use a simpler pattern for splitting (any valid char that's not between those tags), rather than writing a complex split pattern. Quote Link to comment https://forums.phpfreaks.com/topic/199215-parsing-sentencesdetermining-abbreviations/#findComment-1045791 Share on other sites More sharing options...
sug15 Posted April 21, 2010 Author Share Posted April 21, 2010 I don't think there is such a thing as an easy solution for this problem. The English language is rather complex even if everybody used it perfectly. To be able to cope with imperfect use is going to be an even more complex task. Even then I'm assuming you only wish to cope with English. Don't forget that a period is not the only valid character for ending a sentence, a question or exclamation mark are just as valid, as are perhaps other characters. Personally I think I might attempt to find and delimit exceptions with some kind of <ignore> tags, then use a simpler pattern for splitting (any valid char that's not between those tags), rather than writing a complex split pattern. Yeah, that's a good idea, I was thinking about doing something like that. Then just split: ." .' !" !' ?" ?' . ! ? But anyone have an easy way/API to split sentences/determine valid sentence endings? Quote Link to comment https://forums.phpfreaks.com/topic/199215-parsing-sentencesdetermining-abbreviations/#findComment-1046057 Share on other sites More sharing options...
salathe Posted April 21, 2010 Share Posted April 21, 2010 Have a human (who is familiar with the language and its particular quirks) do it. Quote Link to comment https://forums.phpfreaks.com/topic/199215-parsing-sentencesdetermining-abbreviations/#findComment-1046147 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.