Regex lookaround to prevent adding a link inside another link

webtek · July 16, 2021

I use this code to replace words with links:

      $text = preg_replace(
        '#\b(test\pL*)#ui',
        "<a href='$tag_url'>$1</a>", 
        $text,
        1
      );

But if I have a $text like this, it will try to create a link inside an existing <a href> tag, resulting in broken HTML code.

    $text = "hello this is a test with a <a href='https://google.com'>test link</a>";

How can I avoid this?

requinix · July 17, 2021

You're also going to have a problem any time you try to replace words like "href".

Use preg_split() to split the text into an array of alternating text and markup by looking for opening and closing HTML tags.

preg_split('#<\w+[^>]*>|</\w+>#'

Perform your replacements on the text-only items of that resulting array; for efficiency's sake, rather than loop through the array yourself, it would be better to split the array into two and give the text one's whole array to preg_replace.

webtek · July 17, 2021

Nice, I got it to work using your solution, thanks!

Is there a performance difference with this solution compared to using pure regex with lookarounds?

I got thousands of tags to replace with links in a text, and the more tags in the text, the bigger the array from preg_split will get. Have to loop through the array for each of the thousands tags

requinix · July 17, 2021

Just realized that my explanation needed a bit more: simply splitting on tags gets you text vs. markup, so the step after that would be scanning the array of alternating text/markup and skipping past the text sections that are between "<a" and "</a>" elements. A little more work.
The obvious alteration to that strategy, and I expect this is what you did, is split not on any tag but on a "<a...</a>" string, then deal with an array of alternating non-links/links. Still leaves you with the problem of accidentally replacing tags within markup, such as the "src" of an image.

An alternative to preg_split is preg_replace_callback, where you trade operating on an array or three with performing your tag replacements immediately. Still has the same text vs. markup/non-link vs. link problem.

9 hours ago, webtek said:

Is there a performance difference with this solution compared to using pure regex with lookarounds?

There will be a performance difference, not sure how large or in what direction, but that's not the first thing to worry about. The main problem is the accuracy of what you need to do.

Consider a tag like "href". If you use a lookahead* then that won't stop you from replacing it within an <a> tag. You end up having to create something more and more complex to eliminate edge cases until you end up with something incomprehensible.
Once you've done that, performance does become a secondary concern if simply because of the sheer complexity of the regex. How many <a>s are distributed within the text? Note that every single time the engine tries to replace a tag it has to look forward until it finds an <a> or </a>, and that work will cost time.

If you go with the split approach then you can eliminate the lookaheads entirely and allow the engine to optimize around a simple \b$tag\b or \b(tag1|tag2|...)\b expression. The downside is a bit more work ahead of time, plus the additional memory around duplicating the document into string pieces inside arrays (or with preg_replace_callback, dealing with tons of function calls). Honestly, though, PHP is not the sort of language where you should be worrying about the minutiae of optimizations and instead focusing on the larger wins: readability, maintainability, and general algorithmic complexity.

* You start by looking for a </a> as an indication of failure, but that will always match if there's any link at all later in the document, so the lookahead has to ensure that if there is another "a" tag that it is an opening tag - or in more precise words, it is not true that he next opening or closing "a" tag is a closing tag.

webtek · July 21, 2021

Tried various different ways but it always end up failing, it either miss some tags or it creates a link inside another link I'm not even sure if i'm understanding it correctly, my code is a mess...

$tags = array("test","testing");
$text = "I'm testing with this test zz ...";
foreach ( $tags as $tag ) {
      $array_text_only = preg_split('#<\w+[^>]*>|</\w+>#', $text);
      foreach ( $array_text_only as $text_only ) {

        $text_only_replace = preg_replace(
          '#\b('.$tag.'\pL*)#ui',
          "<a href=''>$1</a>", 
          $text_only,
          1
        );

        $text = str_replace($text_only,$text_only_replace,$text);

      }

}

What am I doing wrong?

requinix · July 21, 2021

Basically, my solution is that you set up a state machine to deal with the markup. After you've split on tags and non-tags, you go through the array while keeping track of where you've been.

But forget it. I fell for a classic mistake again: processing an HTML string like text. HTML should be treated like the structural thing it is.

1. Load your HTML string into a DOMDocument. That will also help clean up invalid markup, which regular expressions and string processing cannot do.
2. Loop through all the nodes. If it's a link, skip it. If it's text, do the replacement (which involves creating a new node, not simply adding a string). If it's some other element, go through its nodes recursively.
3. When you're done, dump the document back out.

Doing that isn't exactly the easiest thing when you're not familiar with this kind of work, so https://3v4l.org/sehhK

webtek · July 24, 2021

Thank you for the code, very helpful! It works perfectly using DOM. Your help is very appreciated

Edited July 24, 2021 by webtek

requinix · July 24, 2021

Silly oversight on my part:

if ($node instanceof DOMElement && $node->hasAttribute("href")) {

should be

if ($node instanceof DOMElement && $node->tagName == "a" && $node->hasAttribute("href")) {

Sign In

Regex lookaround to prevent adding a link inside another link

Recommended Posts

webtek

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

webtek

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

webtek

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

webtek

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information