Jump to content

how to filter meta tags from xss


web_craftsman
Go to solution Solved by Jacques1,

Recommended Posts

At my CMS I want to give site moderators ability to associate any meta information to a page. For meta keywords and description I have different fields but all other stuff are inserted like raw html , like this:

<meta name="Generator" content="SomeCMS" />
<meta name="robots" content="nofollow" />
<link rel="canonical" href="http://example.com/content/poisk-i-upravlenie-kontentom" />

This html will be echoed to the page.

Mainly only meta tags and link(rel=canonical) will be here. And now I think I have to make sure there is no xss attack in this code. So I need to filter it before saving to database.

HtmlPurifier or http://github.com/voku/anti-xss don't work with meta tags. So what would you advise me? To parse text with regexp for meta tags and then check every metatag found for any style or on attributes or http-equiv="refresh"(to deny malicious metatag)?

 

Link to comment
Share on other sites

What exactly prevents you from storing the meta elements as key/value pairs rather than raw HTML? This will drastically reduce the risk of XSS vulnerabilities.

 

In any case, allowing arbitrary meta elements is a risk, no matter how hard you try to blacklist dangerous combinations. There will always be a problem you haven't considered yet (for example: using a <meta charset> element to break HTML-escaping).

 

Link elements are even worse, because now we're talking about external resources like stylesheets (which can be used for attacks).

Edited by Jacques1
Link to comment
Share on other sites

DTD, XSD, or Relax NG schema validation could do a lot of work, but AFAIK wouldn't be able to validate actual attribute values (eg, that the canonical URL has the right domain).

 

That basically leaves you with your own validation routines. As long as you approach it like a whitelist - specifically allow certain structures, disallow everything else - then this is quite possible. However it gets exponentially more difficult as you allow more complex HTML; just that example there would be fine, but I'm more concerned about what else would be possible.

 

Unless you want to get really sophisticated with this, you should do the specific key/value meta pairs thing (and a separate entry for the canonical URL, plus whatever else). It's the safest course of action, and it doesn't require that the user understand writing HTML markup.

As a secondary option for the user, you could allow them to input HTML and then scan it for particular elements to keep. As in load the string into DOMDocument (no regular expressions), search for and tags, then extract the data into that key/value system.

 

I could write a proof of concept for the "sophisticated" approach, if someone asks for it, but I just started playing FFXIV and right now I'd rather do that.

Link to comment
Share on other sites

Avoiding raw HTML is also a matter of usability. I do know HTML, and I would still very much prefer a proper GUI with a combobox over having to manually write down tags, some of which I would first have to look up. Now imagine a layman struggling with a syntax error somewhere in a big block of markup.

  • Like 1
Link to comment
Share on other sites

The only reason I can think of that someone would have the markup is because they're copying it from somewhere, and that's an instance where not being familiar with HTML causes the opposite problem. For example, Google Analytics gives you some

 

On that note,

1. Stuff like the generator should be automatic anyways - not manually written out by someone.

2. The robots thing should be a global- or page-level option that a user enables in some configuration area, then rendered into HTML appropriately - not manually written out by someone.

3. The canonical URL should definitely be automatic - unless you want someone to be able to say that a particular page is derivative of some other page on some other website (which would be quite suspicious).

Edited by requinix
Link to comment
Share on other sites

What exactly prevents you from storing the meta elements as key/value pairs rather than raw HTML? This will drastically reduce the risk of XSS vulnerabilities.

 

I will need to create a whole meta tag constructor for this, with all features like changing order, adding, deleting, it is a big piece of work and there is one crutial problem:

It looks that seo specialists like to add some very specific meta tags, how could I guess what they need?

For example, by googling there is info than meta tag could have the next attributes: name, content, scheme, http-equiv.

It does not say about charset attribute, in which case it is a single meta tag's attribute.

And after looking at web sites I very soon found meta tag like:

<meta property="fb:app_id" content="966242223397117" />

So It looks a bit comlicated to create constructor for all cases

Link to comment
Share on other sites

The only reason I can think of that someone would have the markup is because they're copying it from somewhere, and that's an instance where not being familiar with HTML causes the opposite problem.

When people use all kings of WISYWIG editors they are working with raw html too

Edited by web_craftsman
Link to comment
Share on other sites

Avoiding raw HTML is also a matter of usability. I do know HTML, and I would still very much prefer a proper GUI with a combobox over having to manually write down tags, some of which I would first have to look up. Now imagine a layman struggling with a syntax error somewhere in a big block of markup.

Web site content managers are supposed to know html

Link to comment
Share on other sites

  • Solution
So It looks a bit comlicated to create constructor for all cases

 

How many cases are there in reality? You definitely don't want the admin to mess with the document encoding, so the charset attribute is out of the question. Setting arbitrary HTTP options also isn't recommended, so http-equiv is irrelevant as well.

 

That leaves you with exactly two cases: <meta name="..." content="..."> (HTML) and <meta property="..." content="...">  (RDFa). 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.