Jump to content
jodunno

limit string size problem with optional filter

Recommended Posts

Posted (edited)

Hello forummembers,

 

I am battling a regex expression. I got the expression to work but i can't figure out how to limit the string size - No matter where i place the curly braces {2,48}.

my desire to use a regex is to filter my variables for matching formula. My formula is any word a-zA-Z with an optional string between that should only be a soft wordbreak in html code followed by optional letters a-zA-Z.

so Mystringname should pass, Mystring<wbr>­name should pass but Mystring<script>alert(f**k off)</script> should fail.

I got it working but i can't figure out how to limit the string length to 48 characters. Anyone able to help me?

here is my working regex (maybe someone thinks of a better regex?)

/^(?:^)([a-zA-Z]+)(\<wbr\>\&shy\;)?([a-zA-Z]+)?(?:$)$/

EDIT: I use php, so PCRE regex rules.

Thank you.

Edited by jodunno
forgot to mention PCRE PHP

Share this post


Link to post
Share on other sites

I found a solution. Thanks anyway.

^(?:^)([a-zA-Z+]){2,26}(\<wbr\>\&shy\;)?([a-zA-Z+]){2,12}?(?:$)$

it works, i guess. The letters after the soft hyphen shouldn't be more than 8-10 letters anyway. i'll recalculate but it works.

Now i wonder how to allow a single space between first word and last optional word?

Share this post


Link to post
Share on other sites

You don't have to do everything with a regular expression. If you want to limit the length of a string, there are easier ways.

What you have there is... well, I take it you've spent a lot of time throwing syntax at it until you got something that worked?

Do me a favor and state the full and exact requirements for your input string and we'll see if we can't come up with something that does it all in a cleaner way.

  • Like 1

Share this post


Link to post
Share on other sites

Hi requinix,

I have spent all day working on this problem. I could be doing alot of other things. I'm getting sick of this problem now. So, yes, i tried playing with regex until it did what i wanted it to do. I imagine a smart regex hacker, like you, can see a very simple solution. I can cnot see it. I have tried all day.

I hope that i can explain the problem clear enough: i am using data as a hyperlink beneathe an icon. The data is not user supplied but i use wordbreaks with soft hyphens to break long words where i want them to break. I like to check my strings to verify that they are a-zA-Z only excluding the wordbreak. Thus my problems to solve are as follows:

The entire strings must be no shorter than 3 letters and no longer than 48 letters including wordbreak code and a single space. (i've counted and the longest word yet is 36 characters).

The string should be a-zA-Z only with one space allowed between two words. So two words are allowed following these rules. So 1 or 2 words with a space between 2 words.

The string could be one word with a wordbreak and soft hyphen. Thus less than wbr greater than ampersand shy semicolon is allowed.
 

RequinixIsAregexPro should pass as a-zA-Z less than 48 characters

American Robin needs to pass as it is a-zA-Z and a single space between the two words.

The German word Ruechenseitentiere is too long for me, so it should be allowed to pass as Ruckenseiten<wbr>&shy;tiere

Nothing else should pass. Especially php, html and javascriptcode.

Is it possible to simplify my hard-earned amateur expression?

Thank you for taking time to read my post and reply. Meantime, i will keep trying to understand this problem and brush up on my regex knowledge. I am tired today now so i think that i will put this one away for now.

Share this post


Link to post
Share on other sites

I forgot to mention that the icon names are not user supplied now but i'd like them to be editable in the future. so i'd like to work this code into my php code all ready. I actually spent 10 hours on this today. I can't even get it working until just before i posted here. I really wanted to allow utf-8 characters so umlauts of German can be included as well as French accents but this is too much for me, so az-AZ is good and an optional space before an optional second word or one word with a wordbreak soft hyphen only.

Share this post


Link to post
Share on other sites
Quote

i use wordbreaks with soft hyphens to break long words where i want them to break.

Then I'm not sure why you have this complicated scheme of checking the length of the whole string, and looking strictly at letters. Simple hyphenation is fairly simple: insert a hyphen into a word after about X characters but not within Y characters of its end.

Sounds like you're going for X=10 and Y=2, so RequinixIsAregexPro hyphenates as RequinixIs-AregexPro, Ruechenseitentiere as Ruechensei-tentiere, and Ruckenseitentiere as Ruckenseit-entiere. You can adjust your X and Y to make these examples look better, but you can't set up hyphenation rules for every single word. Especially not with German.

/\pL{10}(?=\pL{2})/u

Find 10 letters, require that there are at least two more after it, and insert a soft hyphen.

If you have to deal with HTML then sure it's a lot more complicated. You need to ignore <script>s and their contents, and ignore other tags but not their contents.
Basically the only way you can do that is look letter by letter. Count out 10 letters, skipping over HTML as you go.

/(\pL(?><(script).*?<\/\2>|<.*?>)*){10}(?=\pL{2})/su

 

 

  • Like 1

Share this post


Link to post
Share on other sites

Hi requinix and Thank you for taking time to reply,

i understand your reply but i wonder why you think that my expression is invalid? it is working on my xampp with php, so i assume that the expression is valid according to regex rules. I say this because it is actually doing what i want it to do, albeit, i need a second check to see if it falls into the non html coded word:

 

$checkname = 'Name Me';
//$checkname = 'NameMe';
//$checkname = 'Rückenseiten<wbr>&shy;tiere';
//$checkname = 'Name Me';
$checktrue4 = 'Name<script>alert(\'Fuckoff\')</script>Me';
if (!preg_match("/^(?:^)(\p{L}){2,24}(\<wbr\>\&shy\;)?(\p{L}){2,24}?(?:$)$/", $checkname)) {
  if (!preg_match("/^(?:^)(\p{L}){2,24}(\s)?(\p{L}){2,24}?(?:$)$/", $checkname)) {
       echo 'checkname === 0';
    } else {
       echo 'checkname === 1';
  }
} else {
   echo 'checkname === 1';
}

try my code and uncomment each variable to test it. i seem to have accomplished all of my goals without ignoring code. Is this really invalid?

Share this post


Link to post
Share on other sites
9 minutes ago, jodunno said:

Is this really invalid?

No, it is not invalid. I didn't say it was invalid. I suggested that maybe there was a better way to do what you were trying to do.

Because

^(?:^)([a-zA-Z+]){2,26}(\<wbr\>\&shy\;)?([a-zA-Z+]){2,12}?(?:$)$

1. You have unnecessary (?:^) and (?:$) assertions
2. You put a + in the character set, where it will mean a literal plus and not repetition
3. The parentheses for grouping around the letters do nothing
4. < > & ; are not special characters and do not need to be escaped
5. The anchors and the fact that this looks for the hyphen in a string suggests you're running this regex multiple times to add multiple hyphens?

Some of those are minor flaws and some of those impact how the regex works, but mostly it just didn't feel right to me and I'm not sure that it will properly handle all the various inputs you haven't tested yet.

  • Like 1

Share this post


Link to post
Share on other sites

yes, i am an idiot about this subject. i don't really understand it yet. Thus, i did not know that lt and gt do not need to be escaped. I also did not know that the assertions are useless. why, may i ask? nevermind. i can read about that in spare time. it's not your business to educate me. my apologies for asking.

anyway, i have alot of words that are broken where i want them to be broken - so not really 10 and 2. Plus, user supplied names could be broken wherever a user wishes. I just want to be sure that only lt wbr gt amp shy semicolon is allowed. Nothing else between words except a space whenever a wordbreak is not included.

i will try to rewrite my code and to understand where i go wrong.

Share this post


Link to post
Share on other sites
Posted (edited)

oh, lord. I just realized that i didn't edit the script part of my code. I apologize! i cannot believe that i left it in the code. I hope that noone here is offended. I didn't mean to leave the 'f' word in my code example. i will edit it now.

edit: the edit button disappeared. please will a moderator escape the 'f' word in my code? i sincerely apologize for this error.

Edited by jodunno
explanation for my f word code

Share this post


Link to post
Share on other sites

Hi requinix,

i've cleaned the code now to exclude my mistakes. sorry. i really don't know alot about regex. I know one thing: you are very kind and helpful. Thank you! I really appreciate the interaction with you. i am mentally tired now and i really needed a coder to look at my expressions. I'm shutting the xampp down now. i go to bed soon.

$checkname = 'NameMe';
//$checkname = 'Rückenseiten<wbr>&shy;tiere';
//$checkname = 'Name Me';
//$checkname = 'Name<script>alert(\'F**koff\')</script>Me';
if (!preg_match("/^\p{L}{2,32}<wbr>&shy;?\p{L}{2,16}?$/", $checkname)) {
  if (!preg_match("/^\p{L}{2,24}\s?\p{L}{2,24}?$/", $checkname)) {
       echo 'checkname === 0';
    } else {
       echo 'checkname === 1';
  }
} else {
   echo 'checkname === 1';
}

this time i have coded the f-word in the script test.

Share this post


Link to post
Share on other sites

so today i have tried multiple cominations and they have failed as you had predicted. I really don't want to ignore html code and seek only script. I really just wanted to verify letters plus optional wordbreak with soft hyphen. Rather than spend weeks trying to learn the correct regex statement, i have a better idea: i will check each name for wbr shy code, extract the code and set the extracted text to a temp variable, then apply html entities to the string, then reapply the wbr shy. a bit more work but better than rattling my brain for weeks trying to get the regex correct.

 

Thank you for your time and patience.

Share this post


Link to post
Share on other sites

so here is my simplified non-regex code:

$checkname = 'Ruecken<script>alert(\'F**k off!\');</script>seiten<wbr>&shy;tiere';
$wbrshy = '<wbr>&shy;';
$nameFilter = strpos($checkname, $wbrshy);
$temp = null;
if ($nameFilter !== false) {
  $temp = str_replace('<wbr>&shy;', '?', $checkname);
  $temp = htmlentities(htmlentities($temp, ENT_QUOTES), ENT_QUOTES);
  echo $temp . '<br>';
  $checkname = str_replace('?', '<wbr>&shy;', $temp);
  echo $checkname;
} else {
  echo htmlentities(htmlentities($checkname, ENT_QUOTES), ENT_QUOTES);
}

so now i can check string length less than 50 (after extracting/preserving wbr shy and adding one for the question mark which will make 48+q mark = 49).

before the string length check, i can use a simple regex to check for \p{L} to enforce letters only (thus preserving my question mark method).

however, i don't need to apply htmlentities because i have an error page set up. if the name does not conform, then none of the icons are displayed. you will see the error page in place of the content. hackers are much smarter than me, so i'm not playing around with non conforming names. who knows what it is? i move on as error.

I think this solution is much easier, faster and better than a regex solution.

Once again, Thank you for your time and patience and understanding. thread closed.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.