Jump to content

Substring whole words from any position


BagoZonde

Recommended Posts

I'm looking for regex pattern to substring whole words even from middle of text. It should divide for space, tab or CR.

 

Here's an simple example:

 

$string="The quick brown fox jumps over the lazy dog";
print preg_replace($pattern, '', substr($string, 7, 15));

 

So in this example for "ck brown fox ju" string I want to get:

 

brown fox

 

Of course I'm aware about substr from 0 or to last character of string, but that's easy, I need just regex pattern.

 

It looks like a common case, but I was trying myself with \b, \w, \s and other stuff then searching for net deeper, however I haven't found any solution yet.

 

I appreciate any help, I'm tired with regex for today. I've written this function using iteration method with substr(), but I'm not satisfied, I'm looking for something more elegant, so I'm concerned to learn more about regex.

 

Thanks!

Edited by BagoZonde
Link to comment
Share on other sites

The regular expression is actually the easy stuff for this request, the hard part comes when you're going to analyze the words against the English vocabulary. You have to figure our how a script, run by a computer who has absolutely 0 reading comprehension skills, can figure out what constitutes as a "valid" word in English.

Regular expressions can only tell you whether or not any collection of characters follows the structure of what constitutes a "word", not if it's actually a word or just some random data that looks like one.

 

Anyway, the RegExp you need is this:

'/([a-z\\pL]+)/iu'

 

Use that with preg_match_all () and you'll get everything that consists only of one or more letters.

Link to comment
Share on other sites

Thank you very much Christian, however it not meet my requirements but I'm on good track thanks to you. I need to break words if space, tab and CR as I mentioned in first post. Your code results with words divided for space character only. I get it with taking letters only, but I want to divide into array for space, tab and CR. So I need pattern which exclude exactly that. About commas, colons: it should be stick to word or digit as it is. Another example will tell you exactly what I'm looking for:

 

The bed is a bundle of paradoxes: we go to it with reluctance,
yet we quit it with regret;
we make up our minds every night to leave it early,
but we make up our bodies every morning to keep it late.

Ogden Nash

 

 

So I am writing a simple text search engine for easy purposes, so i.e. I was looking for "early" word. And now I want see results as a cutting, If word "early" was found on position 154 I want to take range -100 : +100. So I need only to cut part of this string to a words in array (with commas and other characters), then unset first and last word (as it would be not whole word) then implode with space character.

 

Using \S I can explode words but I don't know how to explode CR (^\n and ^\r not works). I was trying with that one:

 

preg_match_all('/([\S]+)/', $string, $words);

 

And why results array is doubled?

 

I found preg_split() so I think it could be better to focus on it in this case:

 

$words = preg_split( "/(\s|\t)/", $string);

 

However \n or \r or even \x0D or \x0D\x0A (chr(13).chr(10)) don't listen me in this OR statement. And I'm not sure about tab either.

 

Thank you for interesting!

Link to comment
Share on other sites

The regular expression I posted above does indeed work like requested:

 

php > $string = "The bed is a bundle of paradoxes: we go to it with reluctance,
php " yet we quit it with regret;
php " we make up our minds every night to leave it early,
php " but we make up our bodies every morning to keep it late.
php " 
php " Ogden Nash"; //"
php > preg_match_all ('/([a-z\\pL]+)/iu', $string, $matches);
php > var_dump ($matches);
array(2) {
 [0]=>
 array(44) {
   [0]=>
   string(3) "The"
   [1]=>
   string(3) "bed"
   [2]=>
   string(2) "is"
   [3]=>
   string(1) "a"
   [4]=>
   string(6) "bundle"
   [5]=>
   string(2) "of"
   [6]=>
   string(9) "paradoxes"
   [7]=>
   string(2) "we"
   [8]=>
   string(2) "go"
   [9]=>
   string(2) "to"
   [10]=>
   string(2) "it"
   [11]=>
   string(4) "with"
   [12]=>
   string(10) "reluctance"
   [13]=>
   string(3) "yet"
   [14]=>
   string(2) "we"
   [15]=>
   string(4) "quit"
   [16]=>
   string(2) "it"
   [17]=>
   string(4) "with"
   [18]=>
   string(6) "regret"
   [19]=>
   string(2) "we"
   [20]=>
   string(4) "make"
   [21]=>
   string(2) "up"
   [22]=>
   string(3) "our"
   [23]=>
   string(5) "minds"
   [24]=>
   string(5) "every"
   [25]=>
   string(5) "night"
   [26]=>
   string(2) "to"
   [27]=>
   string(5) "leave"
   [28]=>
   string(2) "it"
   [29]=>
   string(5) "early"
   [30]=>
   string(3) "but"
   [31]=>
   string(2) "we"
   [32]=>
   string(4) "make"
   [33]=>
   string(2) "up"
   [34]=>
   string(3) "our"
   [35]=>
   string(6) "bodies"
   [36]=>
   string(5) "every"
   [37]=>
   string(7) "morning"
   [38]=>
   string(2) "to"
   [39]=>
   string(4) "keep"
   [40]=>
   string(2) "it"
   [41]=>
   string(4) "late"
   [42]=>
   string(5) "Ogden"
   [43]=>
   string(4) "Nash"
 }
// Snipped repeating array.

 

Edited by Christian F.
Link to comment
Share on other sites

Unfortunately not because commas and semicolon are missing, haven't you noticed that? I want to cut some range from this string into words (with semicolons, etc.), then implode back to string with space characters as I want to display something like that:

 

...minds every night to leave it early, but we make up our bodies every morning...

 

For now use of preg_split() working like a charm for my purposes, but I can't separate words if CR is between.

It's just some cut of text, so if I'm looking for word "early" I want to see some part of context, something like Google engine searcher can do.

Edited by BagoZonde
Link to comment
Share on other sites

Jessica: It seems that he wants the punctuation a part of the results, not just the words.

 

BagoZonde: As stated, matching the words is not the same mas matching the words and the punctuation. However, it is easy to remedy: Just add the punctuation you want to match in a character group after the "word" character group, and make it optional.

 

Do take note that this will make it impossible to validate the words as proper English, unless you manually strip out the punctuation marks first. Again, contrary to what you desired according to your original post.

Edited by Christian F.
Link to comment
Share on other sites

Hello Christian!

Yes, I want words with punctuation. Subject of this thread is a short description of what I've written in first post as my target. I want to break if CR, space or tab. And I don't want to specify every sign as it's easier and safier to tell when to break. So I want to make "blacklist", not "whitelist".

 

So, is there easy way to specify breaking string when CR, tab or space? I know how to break when space character ommit as I mentioned in my second post for this thread. But I have no idea how to include CR or tab too.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.