webtek Posted July 16, 2021 Share Posted July 16, 2021 I use this code to replace words with links: $text = preg_replace( '#\b(test\pL*)#ui', "<a href='$tag_url'>$1</a>", $text, 1 ); But if I have a $text like this, it will try to create a link inside an existing <a href> tag, resulting in broken HTML code. $text = "hello this is a test with a <a href='https://google.com'>test link</a>"; How can I avoid this? Quote Link to comment https://forums.phpfreaks.com/topic/313396-regex-lookaround-to-prevent-adding-a-link-inside-another-link/ Share on other sites More sharing options...
requinix Posted July 17, 2021 Share Posted July 17, 2021 You're also going to have a problem any time you try to replace words like "href". Use preg_split() to split the text into an array of alternating text and markup by looking for opening and closing HTML tags. preg_split('#<\w+[^>]*>|</\w+>#' Perform your replacements on the text-only items of that resulting array; for efficiency's sake, rather than loop through the array yourself, it would be better to split the array into two and give the text one's whole array to preg_replace. Quote Link to comment https://forums.phpfreaks.com/topic/313396-regex-lookaround-to-prevent-adding-a-link-inside-another-link/#findComment-1588379 Share on other sites More sharing options...
webtek Posted July 17, 2021 Author Share Posted July 17, 2021 Nice, I got it to work using your solution, thanks! Is there a performance difference with this solution compared to using pure regex with lookarounds? I got thousands of tags to replace with links in a text, and the more tags in the text, the bigger the array from preg_split will get. Have to loop through the array for each of the thousands tags Quote Link to comment https://forums.phpfreaks.com/topic/313396-regex-lookaround-to-prevent-adding-a-link-inside-another-link/#findComment-1588380 Share on other sites More sharing options...
requinix Posted July 17, 2021 Share Posted July 17, 2021 Just realized that my explanation needed a bit more: simply splitting on tags gets you text vs. markup, so the step after that would be scanning the array of alternating text/markup and skipping past the text sections that are between "<a" and "</a>" elements. A little more work. The obvious alteration to that strategy, and I expect this is what you did, is split not on any tag but on a "<a...</a>" string, then deal with an array of alternating non-links/links. Still leaves you with the problem of accidentally replacing tags within markup, such as the "src" of an image. An alternative to preg_split is preg_replace_callback, where you trade operating on an array or three with performing your tag replacements immediately. Still has the same text vs. markup/non-link vs. link problem. 9 hours ago, webtek said: Is there a performance difference with this solution compared to using pure regex with lookarounds? There will be a performance difference, not sure how large or in what direction, but that's not the first thing to worry about. The main problem is the accuracy of what you need to do. Consider a tag like "href". If you use a lookahead* then that won't stop you from replacing it within an <a> tag. You end up having to create something more and more complex to eliminate edge cases until you end up with something incomprehensible. Once you've done that, performance does become a secondary concern if simply because of the sheer complexity of the regex. How many <a>s are distributed within the text? Note that every single time the engine tries to replace a tag it has to look forward until it finds an <a> or </a>, and that work will cost time. If you go with the split approach then you can eliminate the lookaheads entirely and allow the engine to optimize around a simple \b$tag\b or \b(tag1|tag2|...)\b expression. The downside is a bit more work ahead of time, plus the additional memory around duplicating the document into string pieces inside arrays (or with preg_replace_callback, dealing with tons of function calls). Honestly, though, PHP is not the sort of language where you should be worrying about the minutiae of optimizations and instead focusing on the larger wins: readability, maintainability, and general algorithmic complexity. * You start by looking for a </a> as an indication of failure, but that will always match if there's any link at all later in the document, so the lookahead has to ensure that if there is another "a" tag that it is an opening tag - or in more precise words, it is not true that he next opening or closing "a" tag is a closing tag. Quote Link to comment https://forums.phpfreaks.com/topic/313396-regex-lookaround-to-prevent-adding-a-link-inside-another-link/#findComment-1588384 Share on other sites More sharing options...
webtek Posted July 21, 2021 Author Share Posted July 21, 2021 Tried various different ways but it always end up failing, it either miss some tags or it creates a link inside another link I'm not even sure if i'm understanding it correctly, my code is a mess... $tags = array("test","testing"); $text = "I'm testing with this test zz ..."; foreach ( $tags as $tag ) { $array_text_only = preg_split('#<\w+[^>]*>|</\w+>#', $text); foreach ( $array_text_only as $text_only ) { $text_only_replace = preg_replace( '#\b('.$tag.'\pL*)#ui', "<a href=''>$1</a>", $text_only, 1 ); $text = str_replace($text_only,$text_only_replace,$text); } } What am I doing wrong? Quote Link to comment https://forums.phpfreaks.com/topic/313396-regex-lookaround-to-prevent-adding-a-link-inside-another-link/#findComment-1588472 Share on other sites More sharing options...
Solution requinix Posted July 21, 2021 Solution Share Posted July 21, 2021 Basically, my solution is that you set up a state machine to deal with the markup. After you've split on tags and non-tags, you go through the array while keeping track of where you've been. But forget it. I fell for a classic mistake again: processing an HTML string like text. HTML should be treated like the structural thing it is. 1. Load your HTML string into a DOMDocument. That will also help clean up invalid markup, which regular expressions and string processing cannot do. 2. Loop through all the nodes. If it's a link, skip it. If it's text, do the replacement (which involves creating a new node, not simply adding a string). If it's some other element, go through its nodes recursively. 3. When you're done, dump the document back out. Doing that isn't exactly the easiest thing when you're not familiar with this kind of work, so https://3v4l.org/sehhK Quote Link to comment https://forums.phpfreaks.com/topic/313396-regex-lookaround-to-prevent-adding-a-link-inside-another-link/#findComment-1588475 Share on other sites More sharing options...
webtek Posted July 24, 2021 Author Share Posted July 24, 2021 (edited) Thank you for the code, very helpful! It works perfectly using DOM. Your help is very appreciated Edited July 24, 2021 by webtek Quote Link to comment https://forums.phpfreaks.com/topic/313396-regex-lookaround-to-prevent-adding-a-link-inside-another-link/#findComment-1588550 Share on other sites More sharing options...
requinix Posted July 24, 2021 Share Posted July 24, 2021 Silly oversight on my part: if ($node instanceof DOMElement && $node->hasAttribute("href")) { should be if ($node instanceof DOMElement && $node->tagName == "a" && $node->hasAttribute("href")) { Quote Link to comment https://forums.phpfreaks.com/topic/313396-regex-lookaround-to-prevent-adding-a-link-inside-another-link/#findComment-1588569 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.