Substring whole words from any position

BagoZonde · February 11, 2013

I'm looking for regex pattern to substring whole words even from middle of text. It should divide for space, tab or CR.

Here's an simple example:

$string="The quick brown fox jumps over the lazy dog";
print preg_replace($pattern, '', substr($string, 7, 15));

So in this example for "ck brown fox ju" string I want to get:

brown fox

Of course I'm aware about substr from 0 or to last character of string, but that's easy, I need just regex pattern.

It looks like a common case, but I was trying myself with \b, \w, \s and other stuff then searching for net deeper, however I haven't found any solution yet.

I appreciate any help, I'm tired with regex for today. I've written this function using iteration method with substr(), but I'm not satisfied, I'm looking for something more elegant, so I'm concerned to learn more about regex.

Thanks!

Christian F. · February 11, 2013

The regular expression is actually the easy stuff for this request, the hard part comes when you're going to analyze the words against the English vocabulary. You have to figure our how a script, run by a computer who has absolutely 0 reading comprehension skills, can figure out what constitutes as a "valid" word in English.

Regular expressions can only tell you whether or not any collection of characters follows the structure of what constitutes a "word", not if it's actually a word or just some random data that looks like one.

Anyway, the RegExp you need is this:

'/([a-z\\pL]+)/iu'

Use that with preg_match_all () and you'll get everything that consists only of one or more letters.

BagoZonde · February 13, 2013

Thank you very much Christian, however it not meet my requirements but I'm on good track thanks to you. I need to break words if space, tab and CR as I mentioned in first post. Your code results with words divided for space character only. I get it with taking letters only, but I want to divide into array for space, tab and CR. So I need pattern which exclude exactly that. About commas, colons: it should be stick to word or digit as it is. Another example will tell you exactly what I'm looking for:

The bed is a bundle of paradoxes: we go to it with reluctance,
yet we quit it with regret;
we make up our minds every night to leave it early,
but we make up our bodies every morning to keep it late.

Ogden Nash

So I am writing a simple text search engine for easy purposes, so i.e. I was looking for "early" word. And now I want see results as a cutting, If word "early" was found on position 154 I want to take range -100 : +100. So I need only to cut part of this string to a words in array (with commas and other characters), then unset first and last word (as it would be not whole word) then implode with space character.

Using \S I can explode words but I don't know how to explode CR (^\n and ^\r not works). I was trying with that one:

preg_match_all('/([\S]+)/', $string, $words);

And why results array is doubled?

I found preg_split() so I think it could be better to focus on it in this case:

$words = preg_split( "/(\s|\t)/", $string);

However \n or \r or even \x0D or \x0D\x0A (chr(13).chr(10)) don't listen me in this OR statement. And I'm not sure about tab either.

Thank you for interesting!

Christian F. · February 13, 2013

The regular expression I posted above does indeed work like requested:

php > $string = "The bed is a bundle of paradoxes: we go to it with reluctance,
php " yet we quit it with regret;
php " we make up our minds every night to leave it early,
php " but we make up our bodies every morning to keep it late.
php " 
php " Ogden Nash"; //"
php > preg_match_all ('/([a-z\\pL]+)/iu', $string, $matches);
php > var_dump ($matches);
array(2) {
 [0]=>
 array(44) {
   [0]=>
   string(3) "The"
   [1]=>
   string(3) "bed"
   [2]=>
   string(2) "is"
   [3]=>
   string(1) "a"
   [4]=>
   string(6) "bundle"
   [5]=>
   string(2) "of"
   [6]=>
   string(9) "paradoxes"
   [7]=>
   string(2) "we"
   [8]=>
   string(2) "go"
   [9]=>
   string(2) "to"
   [10]=>
   string(2) "it"
   [11]=>
   string(4) "with"
   [12]=>
   string(10) "reluctance"
   [13]=>
   string(3) "yet"
   [14]=>
   string(2) "we"
   [15]=>
   string(4) "quit"
   [16]=>
   string(2) "it"
   [17]=>
   string(4) "with"
   [18]=>
   string(6) "regret"
   [19]=>
   string(2) "we"
   [20]=>
   string(4) "make"
   [21]=>
   string(2) "up"
   [22]=>
   string(3) "our"
   [23]=>
   string(5) "minds"
   [24]=>
   string(5) "every"
   [25]=>
   string(5) "night"
   [26]=>
   string(2) "to"
   [27]=>
   string(5) "leave"
   [28]=>
   string(2) "it"
   [29]=>
   string(5) "early"
   [30]=>
   string(3) "but"
   [31]=>
   string(2) "we"
   [32]=>
   string(4) "make"
   [33]=>
   string(2) "up"
   [34]=>
   string(3) "our"
   [35]=>
   string(6) "bodies"
   [36]=>
   string(5) "every"
   [37]=>
   string(7) "morning"
   [38]=>
   string(2) "to"
   [39]=>
   string(4) "keep"
   [40]=>
   string(2) "it"
   [41]=>
   string(4) "late"
   [42]=>
   string(5) "Ogden"
   [43]=>
   string(4) "Nash"
 }
// Snipped repeating array.

BagoZonde · February 13, 2013

Unfortunately not because commas and semicolon are missing, haven't you noticed that? I want to cut some range from this string into words (with semicolons, etc.), then implode back to string with space characters as I want to display something like that:

...minds every night to leave it early, but we make up our bodies every morning...

For now use of preg_split() working like a charm for my purposes, but I can't separate words if CR is between.

It's just some cut of text, so if I'm looking for word "early" I want to see some part of context, something like Google engine searcher can do.

Jessica · February 13, 2013

Christian's code does work with commas, he has one "reluctance,"

BagoZonde · February 13, 2013

Unfortunately I can't see any (even for rectulance) in his post (I was running that pattern on my server and that same was printed). And there's no "regret;", "early,", "late.".

Jessica · February 13, 2013

Try clicking the spoiler button. It's all there.

Christian F. · February 13, 2013

Jessica: It seems that he wants the punctuation a part of the results, not just the words.

BagoZonde: As stated, matching the words is not the same mas matching the words and the punctuation. However, it is easy to remedy: Just add the punctuation you want to match in a character group after the "word" character group, and make it optional.

Do take note that this will make it impossible to validate the words as proper English, unless you manually strip out the punctuation marks first. Again, contrary to what you desired according to your original post.

BagoZonde · February 13, 2013

Hello Christian!

Yes, I want words with punctuation. Subject of this thread is a short description of what I've written in first post as my target. I want to break if CR, space or tab. And I don't want to specify every sign as it's easier and safier to tell when to break. So I want to make "blacklist", not "whitelist".

So, is there easy way to specify breaking string when CR, tab or space? I know how to break when space character ommit as I mentioned in my second post for this thread. But I have no idea how to include CR or tab too.

BagoZonde · February 14, 2013

Ok, it was easy, I found this pattern:

$carriage=preg_split('/(\s|\t|\r)/', $string);

Cheers!

Sign In

Substring whole words from any position

Recommended Posts

BagoZonde

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

BagoZonde

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

BagoZonde

Link to comment

Share on other sites

Jessica

Link to comment

Share on other sites

BagoZonde

Link to comment

Share on other sites

Jessica

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

BagoZonde

Link to comment

Share on other sites

BagoZonde

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information