Jump to content

Looking for a regex to find multiline text blocks that a) may or b) may not contain a keyword


Recommended Posts

Hi!

I have a longer text, where I want to distinguish between textblocks that do contain a certain keyword, and that don't.

smaple woiefjeowijji oj
oiewjfoewijfoiwejfiojewf keyword owiejfioejoij
oiewjfioewjf smaple
smaple ojioewj fijo
oieiojewf keyword owiejfioejoij
oiewjfioewjf smaple
smaple woiefjeowijji fijo
oiewjfoewijfoiwejf owiejfioejoij
oiewjfioewjf smaple

1. The textblocks I want to find start with "sample" and end with "sample.
2. The textblocks I want to find then start with "sample" and end with "sample anddo contain the "keyword".

I just can't find  the right regular expresstion. What should I use?

I'm not sure I follow what you are exactly trying to find. I see you gave some example input, it would have been helpful to see what you expect to be returned. Specifically, I'm not sure what you mean by "textblock". I *think* you mean where a line starts with 'sample' followed by however many lines until you find a line that ends with 'sample'. However, note that the word "sample" never appears in your text block. There is a word spelled "smaple" - I have no idea what that is. How you word #1 and #2 is confusing as well. Are you saying you want to find the first textblock which starts with "sample" and ends with "sample" and then find the NEXT textblock the same way, but the second one contains the keyword? Or does #2 mean the keyword is supposed to be in the first textblock?

 

 

This may work for you. Here is a function that returns the "textblocks" that begin/end with a delimiter string OR (if a keyword is provided, then it only returns "textblocks" that also contain that keyword. As I made the regular expressions programatical, it may be difficult to see how they are constructed. The two formats would look like this: #^smaple.*?smaple#ms and #^smaple.*?keyword.*?smaple#ms

<?php

$text = "smaple woiefjeowijji oj
oiewjfoewijfoiwejfiojewf keyword owiejfioejoij
oiewjfioewjf smaple
smaple ojioewj fijo
oieiojewf keyword owiejfioejoij
oiewjfioewjf smaple
smaple woiefjeowijji fijo
oiewjfoewijfoiwejf owiejfioejoij
oiewjfioewjf smaple";

function findTextBlocks($input, $delimiter, $keyword='')
{
    if($keyword!='') { $keyword = "{$keyword}.*?"; }
    $pattern = "#^{$delimiter}.*?{$keyword}{$delimiter}#ms";
    echo $pattern;
    preg_match_all($pattern, $input, $matches);
    return $matches;
}

$textBlocks = findTextBlocks($text, 'smaple');
echo "<pre>".print_r($textBlocks, true)."</pre>";

$textBlocksWithKeyword = findTextBlocks($text, 'smaple', 'keyword');
echo "<pre>".print_r($textBlocksWithKeyword, true)."</pre>";

?>

Output #1

Quote

Array
(
    [0] => Array
        (
            [0] => smaple woiefjeowijji oj
oiewjfoewijfoiwejfiojewf keyword owiejfioejoij
oiewjfioewjf smaple
            [1] => smaple ojioewj fijo
oieiojewf keyword owiejfioejoij
oiewjfioewjf smaple
            [2] => smaple woiefjeowijji fijo
oiewjfoewijfoiwejf owiejfioejoij
oiewjfioewjf smaple
        )
)

Output #2

Quote

Array
(
    [0] => Array
        (
            [0] => smaple woiefjeowijji oj
oiewjfoewijfoiwejfiojewf keyword owiejfioejoij
oiewjfioewjf smaple
            [1] => smaple ojioewj fijo
oieiojewf keyword owiejfioejoij
oiewjfioewjf smaple
        )
)

 

Definitly I was not exact enough:

Let's see another example: The whole text is made of multiline blocks that start with "start" and end with "end".
Each textblocks has either 0,1 or many occurences of "keyword".

In the first run I want to replace all textblocks that have 0 occurences of the "keyword".
In the second run on the original whole text I want to remove all textblocks that have 1 or many occurences of the "keyword".

start
wofj keyword
wopkefpwoekf
end
start
oidfgoj
pwefkoewfk
end
[and so on many many time]

First run would remove the second textblock. Second run would remove the first textblock.

It's pretty frustrating when someone asks for help and then they change the requirements. It would also be helpful if you provided REAL content instead of something with gibberish.

<?php

$text = "start
00000000000000000
REMOVE ON FIRST PASS
end
start
11111111111111111111
XXXXXXX keyword XXXXXXXXXXX
end
start
oidfgoj
11111111111111111111
keyword
REMOVE ON 2ND PASS
end
start
abcd keyword efg
2222222222222222222
REMOVE ON 2ND PASS
abcd keyword abcd
end
SOME OTHER TEXT
WILL NOT BE REPLACED
start
0000000000000000000
REMOVE ON FIRST PASS
end
start
keyword
oidfgoj
33333333333333333333333
abcd REMOVE ON 2ND PASS abcd
keyword fdsfs keyword
pwefkoewfk
end";

function replaceTextBlocks($input, $startDelimiter, $endDelimite, $keyword='keyword', $withKeyword=false)
{
    if($keyword!='') { $keyword = "{$keyword}.*?"; }
    //$pattern = "#^{$startDelimiter}.*?{$keyword}{$endDelimite}#ms";
    if(!$withKeyword) 
    {   //Remove blocks w/o the keyword
        $pattern = "#^{$startDelimiter}((?!{$keyword}).)*{$endDelimite}[\n\r]*#ms";
    }
    else
    {   //Remove blocks with the keyword
        $pattern = "#^{$startDelimiter}.*?{$keyword}.*?{$endDelimite}[\n\r]*#ms";
    }
    //echo $pattern;
    $newText = preg_replace($pattern, '', $input);
    return $newText;
}

echo "Original text: <pre>{$text}</pre><br>\n";
$text = replaceTextBlocks($text, 'start', 'end', 'keyword');
echo "First Pass: <pre>{$text}</pre><br>\n";
$text = replaceTextBlocks($text, 'start', 'end', 'keyword', true);
echo "Second Pass: <pre>{$text}</pre><br>\n";

?>

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.