Jump to content

Removing entire phrases from a string?


binxalot

Recommended Posts

Hi, I've been reading this forum for over and hour now and I can't seem to find anything along the lines of what I'm trying to do, which is remove specific phrases of pronouns from a string, leaving only adjectives and adverbs, etc...  The problem I'm having is that when I use

 $blob = str_replace($knownWords[$b], ' ', $blob); 

(where blob is the string of text, and knownwords is the list of words I'm looking to replace with spaces. ) If one of the known words is "on top of" for instance, then all letters in the string with "on" "top" or "of" get removed.  So I looked in to this type of string replacement

$blob = ereg_replace('~\b'.$knownWords[$b].'\b~', " ", $blob); 

but this seems to skip all of the words and doesn't remove anything. Has anyone ever delt with a situation like this before?  I see lots of posts about finding specific words, or trying to find text between tags, or removing special characters but when I search the forum for removing or even finding words in a phrase I get 2 hits, is this not the right direction I should be looking in for accomplishing this?

 

Also the reason I'm going about it this way is because I need to build up a list of adjectives using the ones left behind in large strings of phrases, I can't go about it in reverse because I can't know which adjectives will be used ahead of time.

Link to comment
Share on other sites

The solution was this:

$word_escaped = preg_quote($knownWords[$b], '~'); //array of phrases...

$pattern = '~\b' . $word_escaped . '\b~'; //mystery pattern...

$blob = preg_replace($pattern, "", $blob, -1); //removes all of the words, probably the -1 is not needed.

 

why this works I have no idea, but it does, it will remove phrases from a string while leaving behind words that also contain part of a phrase, so if you have text like "herself" but you're looking to remove the word "her" then only the word her will be removed, and the word "herself" will be unchanged.

Link to comment
Share on other sites

The \b matches word boundaries, that is to say the space between a 'word' character (letter, digit or underscore) and non-word character. The preg_quote function simply escapes the characters so that if you're pattern had, for example \b in it, then it would match the two literal characters \ and b rather than another word boundary.

Link to comment
Share on other sites

The tilde's in this case are delimiters, they don't have to be a tilde, it can be one of many characters, I forget what the exact requirements are but it's practically anything that isn't a word character I believe. As a rule of thumb the delimiter use should simply be a 'character' that is unlikely to appear in the the pattern you are attempting to match. The delimiters are in place to separate the pattern you are matching from the modifiers. In your case you don't have any modifiers, but you can have for example ...

 

$pattern = '~\b' . $word_escaped . '\b~i';

 

... to make the pattern case insensitive. You can find a list of the supported modifiers on the PCRE pages of the manual.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.