Jump to content

Write negated string according to POS tags


pr0no

Recommended Posts

Consider the following POS-tagged string:

 

It/PRP was/VBD not/RB okay/JJ or/CC funny/JJ and/CC I/NN will/MD never/RB buy/VB 
from/IN them/PRP ever/RB again/RB
(It was not okay or funny and I will never buy from them ever again)

 

I want to accomplish the following:

[*]Check for negating adverbs (RB) against defined array('not', 'never')

[*]When there's a match, remove the adverb

[*]Concatenate "not-" to the beginning of every subsequent adjective (JJ), adverb (RB), or verb (VB or VBN for past tense)

[*]Remove all POS-tags (/XX)

Thus, the desired output would be:

 

It was not-okay or not-funny and I will not-buy from them not-ever not-again

 

My first thought was to do this the way I know how to: explode the string on space, then explode every word on "/" to [JJ => okay], then make a switch statement to treat every word (case JJ: concatenate, etc.), but this seems very sloppy. Does anybody have a more clean and / or efficient way of doing this, for instance regex? The strings have been pre-cleaned, so they will always only contain words (no punctuation, other characters than a-z, etc.).

 

Any tips, example code fragments, etc. would be greatly appreciated!

 

*Edit: I am aware, btw, of the very basic character of this way of treating negations, but it is good enough for what I need. There will be an error margin, but that's ok :)*

This is rough, but it works. It only requires the first part of the input sting with each word and its type identifier.

 

$neg_adv = array('not', 'never');

$input = "It/PRP was/VBD not/RB okay/JJ or/CC funny/JJ and/CC I/NN will/MD never/RB buy/VB from/IN them/PRP ever/RB again/RB (It was not okay or funny and I will never buy from them ever again)";

$output = array();
foreach(explode(' ', $input) as $part)
{
    if(strpos($part, '/'))
    {
        list($word, $type) = explode('/', $part);
        if($type!='RB' || !in_array($word, $neg_adv))
        {
            if($type=='JJ' || $type=='RB' || $type=='VB' || $type=='VBN')
            {
                $output[] = 'not-'.$word;
            }
            else
            {
                $output[] = $word;
            }
        }
    }
}

echo implode(' ', $output);

Hey, thanks. It doesn't fully work as expected however. Consider the input:

It/PRP was/VBD not/RB okay/JJ or/CC funny/JJ and/CC I/NN will/MD never/RB buy/VB from/IN them/PRP ever/RB again/RB

The output now is:

It was not not-okay or funny and I will not-buy from them not-ever again

However, the expected output is:

It was not not-okay or not-funny and I will not-buy from them not-ever not-again

The difference is in "not-funny" and "not-again". They are respectively a JJ and RB word, but they do not get tagged like the others. I think this is due to the second if-statement:

if($type!='RB' || !in_array($word, $neg_adv)) {
  if($type=='JJ' || $type=='RB' ...

Why do you first check if $type is not 'RB', and then check if $type * is * 'RB'? Is the first one meant to remove the negation word (not, never)? I think this is stopping "funny" and "again" from being tagged. Could you explain?

Oh, nevermind! It works great; for some reason when I take live output from the database here, it makes the error described above. But it works perfectly with the string as I gave it in this post :) Thanks!

 

Yeah, there were spaces that were replaced with line-breaks. I assumed that was a copy/paste error.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.