Jump to content
Sign in to follow this  
web_craftsman

how to filter meta tags from xss

Recommended Posts

At my CMS I want to give site moderators ability to associate any meta information to a page. For meta keywords and description I have different fields but all other stuff are inserted like raw html , like this:

<meta name="Generator" content="SomeCMS" />
<meta name="robots" content="nofollow" />
<link rel="canonical" href="http://example.com/content/poisk-i-upravlenie-kontentom" />

This html will be echoed to the page.

Mainly only meta tags and link(rel=canonical) will be here. And now I think I have to make sure there is no xss attack in this code. So I need to filter it before saving to database.

HtmlPurifier or http://github.com/voku/anti-xss don't work with meta tags. So what would you advise me? To parse text with regexp for meta tags and then check every metatag found for any style or on attributes or http-equiv="refresh"(to deny malicious metatag)?

 

Share this post


Link to post
Share on other sites

What exactly prevents you from storing the meta elements as key/value pairs rather than raw HTML? This will drastically reduce the risk of XSS vulnerabilities.

 

In any case, allowing arbitrary meta elements is a risk, no matter how hard you try to blacklist dangerous combinations. There will always be a problem you haven't considered yet (for example: using a <meta charset> element to break HTML-escaping).

 

Link elements are even worse, because now we're talking about external resources like stylesheets (which can be used for attacks).

Edited by Jacques1

Share this post


Link to post
Share on other sites

DTD, XSD, or Relax NG schema validation could do a lot of work, but AFAIK wouldn't be able to validate actual attribute values (eg, that the canonical URL has the right domain).

 

That basically leaves you with your own validation routines. As long as you approach it like a whitelist - specifically allow certain structures, disallow everything else - then this is quite possible. However it gets exponentially more difficult as you allow more complex HTML; just that example there would be fine, but I'm more concerned about what else would be possible.

 

Unless you want to get really sophisticated with this, you should do the specific key/value meta pairs thing (and a separate entry for the canonical URL, plus whatever else). It's the safest course of action, and it doesn't require that the user understand writing HTML markup.

As a secondary option for the user, you could allow them to input HTML and then scan it for particular elements to keep. As in load the string into DOMDocument (no regular expressions), search for and tags, then extract the data into that key/value system.

 

I could write a proof of concept for the "sophisticated" approach, if someone asks for it, but I just started playing FFXIV and right now I'd rather do that.

Share this post


Link to post
Share on other sites

Avoiding raw HTML is also a matter of usability. I do know HTML, and I would still very much prefer a proper GUI with a combobox over having to manually write down tags, some of which I would first have to look up. Now imagine a layman struggling with a syntax error somewhere in a big block of markup.

  • Like 1

Share this post


Link to post
Share on other sites

The only reason I can think of that someone would have the markup is because they're copying it from somewhere, and that's an instance where not being familiar with HTML causes the opposite problem. For example, Google Analytics gives you some

 

On that note,

1. Stuff like the generator should be automatic anyways - not manually written out by someone.

2. The robots thing should be a global- or page-level option that a user enables in some configuration area, then rendered into HTML appropriately - not manually written out by someone.

3. The canonical URL should definitely be automatic - unless you want someone to be able to say that a particular page is derivative of some other page on some other website (which would be quite suspicious).

Edited by requinix

Share this post


Link to post
Share on other sites

What exactly prevents you from storing the meta elements as key/value pairs rather than raw HTML? This will drastically reduce the risk of XSS vulnerabilities.

 

I will need to create a whole meta tag constructor for this, with all features like changing order, adding, deleting, it is a big piece of work and there is one crutial problem:

It looks that seo specialists like to add some very specific meta tags, how could I guess what they need?

For example, by googling there is info than meta tag could have the next attributes: name, content, scheme, http-equiv.

It does not say about charset attribute, in which case it is a single meta tag's attribute.

And after looking at web sites I very soon found meta tag like:

<meta property="fb:app_id" content="966242223397117" />

So It looks a bit comlicated to create constructor for all cases

Share this post


Link to post
Share on other sites

The only reason I can think of that someone would have the markup is because they're copying it from somewhere, and that's an instance where not being familiar with HTML causes the opposite problem.

When people use all kings of WISYWIG editors they are working with raw html too

Edited by web_craftsman

Share this post


Link to post
Share on other sites

Avoiding raw HTML is also a matter of usability. I do know HTML, and I would still very much prefer a proper GUI with a combobox over having to manually write down tags, some of which I would first have to look up. Now imagine a layman struggling with a syntax error somewhere in a big block of markup.

Web site content managers are supposed to know html

Share this post


Link to post
Share on other sites
So It looks a bit comlicated to create constructor for all cases

 

How many cases are there in reality? You definitely don't want the admin to mess with the document encoding, so the charset attribute is out of the question. Setting arbitrary HTTP options also isn't recommended, so http-equiv is irrelevant as well.

 

That leaves you with exactly two cases: <meta name="..." content="..."> (HTML) and <meta property="..." content="...">  (RDFa). 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.