Jump to content

[SOLVED] Breaking up long words without interfering with URLs


Recommended Posts

Hey

 

I've been modifying an existing BBCode script for a forum I am making, but I found if someone put in a really long word, maybe just as spam, it broke up the content of the page.

 

I'm using these lines to stop that from happening:

 

$character_limit = 50;
$BBCode_Text = preg_replace('/([^\s]{'.$character_limit.'})(?=[^\s])/m', '$1 ', $BBCode_Text);

 

Which works great, but my problem is that if someone put in a URL, e.g.

 

http://www.a-very-long-website-name-maybe-a-huuuuuge-link-to-an-obscure-file.com/a-folder/some-other-folder-with-a-name-so-long-it-should-be-made-illegal/file.html

 

It gets broken up and doesn't work when it's clicked. I have the same problem for images, because their source is getting split apart. Is there any way of preventing this from happening, maybe by looking for a "www." or an ending such as ".com" or ".png" - I'm pretty stuck on this one :(

 

Thank you for your time and help

~Chris

Hi,

I've added 2 patterns in your regex. One to match links, the other to match images. They are generic patterns, you could change the first with something more specific or add some other images extensions in the second. Try it

 

$BBCode_Text=preg_replace('/(https?:\/\/(www)?)\S+|\S+\.(jpe?g|gif|png)|\S{'.$character_limit.'}(?=\S)/im', "$0 ", $BBCode_Text);

Thanks v much for the response

 

Unfortunately it's not made any difference. I've placed the code at the end of all the BBCode stuff so that it doesn't interfere with for example changing tags into HTML links.

 

This is what I've put in:

 

 

And this is what has been displayed:

<a href="http://www.hastheworldbeendestroyedbythelarg ehadroncollideryet.com">http://www.hastheworldbeen destroyedbythelargehadroncollideryet.com</a>

 

It hasn't really made any difference.

Would it be better / easier to make a regex code to remove spaces from parts between a or tag?

I'm brushing up on my regex atm, this is quite a tricky problem to sort out because it interferes with all the tags if there are enough characters before it, causing them to be missed out of searches

You can use this to only match 50 non-whitespace characters if they are found outside BB code tags:

 

$character_limit = 50;
$BBCode_Text = preg_replace('~\S{' . $character_limit. '}(?![^\[]*?\])~', '$0 ', $BBCode_Text);

I'm utilizing a negative lookahead.

 

The pattern searches for the 50 characters, and when found, checks the following characters. If any other character than [ is found 0 or more times, immediately followed by a ], the whole thing fails to match (= the keyword is inside a BB code tag, i.e. between [ and ]).

 

Hope that solves your problem.

thebadbad, that script works brilliantly to prevent the tags from breaking up -thanks very much :D

 

The only problem I have, which I have been playing around with your script to fix, is this:

img

or

 

Just specifically for those tags, it would be great if the regex could look for either the ] or [ur l tags (w/o spaces) and not add any spaces for the things inside those tags.

Is that possible?

 

This is what I tried, but it failed miserably I'm afraid:

$BBCode_Text = preg_replace('~\S{' . $character_limit. '}(?![^\[]*?\])|~\S{' . $character_limit. '}(?![^\[](url|img)\].*?\[\/(url|img)\])~', '$0 ', $BBCode_Text);

 

The error was "an unknown modifier: / ", am I at least on the right line? :S

Oh, I see. That's a problem.

 

It would actually be easiest to do the replacing after you've translated the BB code to (X)HTML. 'Cause then you could just use

 

$character_limit = 50;
$html = preg_replace('~\S{' . $character_limit. '}(?![^<]*?>)~', '$0 ', $html);

to make sure that only long strings found outside tags are broken up.

 

If that's possible in your current setup.

Wow, thank you so much - it works perfectly!

 

I think I understand how it works - It looks for 50 ($character_limit) non-whitespace characters and then looks back for "<tagname with properties etc>", that's actually genius, and so simple :)

 

Cheers, that's helped such a lot. Good luck in future!

~Chris

You're welcome, and thanks ;)

 

But you haven't got it entirely; the tricky part (?![^<]*?>) is a negative lookahead, meaning if the regular expression is matched, the overall pattern fails to match. And I already explained what the green expression (as part of the lookahead) does:

 

... If any other character than < is found 0 or more times, immediately followed by a >, the whole thing fails to match (= the keyword is inside a HTML tag, i.e. between < and >) ...
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.