Jump to content

regex to find newlines


LuAn

Recommended Posts

Given that one of my contact forms has recently been used for spamming by injecting extra header information into the email I'm looking to improve security.

 

As I understand it, the to, from, and subject can be 'easily' checked by looking for newline characters (because there shouldn't be any).  My problem is: what should I be looking for?

 

I found this out on the web:

 

if (preg_match("/(%0A|%0D|\n+|\r+)/i", $_POST['subject'])) {
// reject this data
}

 

but I've also seen it written as:

 

if (preg_match("/(%0A|%0D|\\n+|\\r+)/i", $_POST['subject'])) {
// reject this data
}

 

Obviously the difference is with the 3rd and 4th alternatives in the search.  As I understand it, the \n and \r in the first one will get converted prior to searching (so we'll be searching for the actual newline and return characters) whereas in the second piece of code the additional forward slash will prevent conversion (so we'll actually be looking for \n and \r).  But which should we be looking for?

 

I think I am right in saying that the %0A and %0D will be looking for those actual 'terms', and not the characters they represent (otherwise there wouldn't be any point in the i at the end of the first expression), so presumably the 3rd and 4th terms should also be the representations and not the actual characters?

 

I've done a few tests by creating a form into which I can type a test expression and a piece of text to search but I am not sure that such a test is an accurate representation of the manner in which a would be spammer would attack.

 

An additional concern, because if I'm right it suggests that the person who 'created' the code above doesn't really understand it, is that I can't see any reason for the + characters.

 

Any thoughts?

 

Link to comment
Share on other sites

Without addressing your purpose, I can comment on the regular expression.  If you're just looking for a newline/return, then the "+" is unnecessary since it means "one or more."  Well, you're happy finding one, and not replacing them, so you won't need the "one or more" -- it'll work, but it's superfluous.

 

Also, you are correct in your statement that the %0A/D characters are literal.  The % has no special meaning in PCREs.  These patterns, %0A and %0D, will match the hexadecimal representations of line feed (new line) and carriage return, respectively, in case \n \r have been urlencode()'ed (like a space is converted to %20 in URLs).

 

The only thing I would change is to put the pattern in single quotes and get rid of the extra slash in front of \n and \r.

Link to comment
Share on other sites

The only thing I would change is to put the pattern in single quotes and get rid of the extra slash in front of \n and \r.

 

So you reckon we should be looking for backslash n and backslash r as opposed to the actual newline and carriage return characters?

 

The bit I can't get my head around here is how they are likely to 'appear' in the $_POST['subject'] string and which will be a problem when passed to the mail() function.

 

Given that the author of the code (presumably) didn't realise that the + was superfluous I don't trust them to have taken all of the subtleties of this issue into consideration.

Link to comment
Share on other sites

The only thing I would change is to put the pattern in single quotes and get rid of the extra slash in front of \n and \r.

 

So you reckon we should be looking for backslash n and backslash r as opposed to the actual newline and carriage return characters?

 

I'd write it like this:

preg_match('/(%0[AD]|\n|\r)/i', $_POST['subject'])

 

...encasing it in single quotes to avoid double slashing (\\n).  That's a tiny little issue, and I'm borderline OCD (not really).  The \r and \n will look for the literal newline/CR characters; that's just how they're written in the pattern.

 

I really don't know how likely this is to help you combat spam, though.  You'll have to decide what "thing" makes spambot-generated submissions different from legitimate, human-generated submissions and screen input based on that.  If it's these LF/CR characters, then that regex will find them.  Is $_POST['subject'] a single line, text input on your form?  It seems unlikely, then, that a human would be able to insert a newline in a text input as browsers often interpret pressing the enter key as a method of submitting the form, so maybe this will help you.

 

Link to comment
Share on other sites

The \r and \n will look for the literal newline/CR characters; that's just how they're written in the pattern.

 

Yeah, but what I am unsure of is how they are likely to appear in $_POST['subject'] because if I check for backslash n then and the spammer injects a newling character...

 

I really don't know how likely this is to help you combat spam, though.

 

It's not about preventing spam being sent to the form owner.  It's about preventing the form being 'hi-jacked' for sending spam in the form owners name.

 

Let's say you have a "contact me" type form that lets somebody send you a message with a subject line of their choice and supply an email address for you to reply to.  You could send it to yourself with code like this:

 

$to = "yourname@yourdomain.com";
$headers = "From: ".$_POST['from']."\nReply-To: ".$_POST['from']."\n";
mail($to, $_POST['subject'], $_POST['message'], $headers);

 

The problem is that because there's no checking, a spammer could call that page from a form of their own and, for example, arrange it such that $_POST['from'] equates to:

 

"addr1@domain1.com\nbcc: addr2@domain2.com,addr3@domain3.com,addr4@domain4.com"

 

Thus in addition to sending the message to yourname@yourdomain.com the message will also be send as bcc to three other addresses!

 

Link to comment
Share on other sites

That sounds reasonable however you've put it in single quotes and as I undertand it, they need to be double quotes so that the \\ becomes \ and the \n and \r get converted into the characters the represent.

 

Thus I believe it should be:

 

"/\\[rn]|\n|\r|%0[AD]/i"

 

 

Link to comment
Share on other sites

It'll probably work either way.  I usually prefer single quotes for regex.  \n and \r are correctly interpreted by the regex engine as the special characters CR/LF (cf. PCRE Syntax).  Using double quotes causes PHP to interpolate them before the regex sees them, but they should be interpreted as the literal characters then.  I could be wrong about that, though, but I know the single quote version will work.  Also, the double slash (\\) is interpreted by the regex as the single slash we really want; it must be escaped since it is a meta character with respect to regular expressions.

Link to comment
Share on other sites

It'll probably work either way.

 

No.  After reading what you just said a dozen times I think it has to be single quotes.

 

I think the \n and \r would be okay either way.  If we use single quotes then PHP will pass it to the regex engine as \n and \r and the regex engine will convert them, whereas if we use double quotes PHP will convert them - which is fine.

 

The \\[rn] however is a different kettle of fish because if we use double quotes it will be \[rn] by the time the regex engine sees it and the regex engine will then interpret the \ as an escape on the [ as opposed to a slash character that we want it to look for.

 

I think.

 

But I'm not sure because my brain just melted.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.