Properly encoding and escaping for the web

encoding

When processing untrusted user input for (web) applications, filter the input, and encode the output. That is the most widely given advice in order to prevent (server-side) injections. Yet it can be deceivingly difficult to properly encode (user) input. Encoding is dependent on the type of output - which means that for example a string, which will be used in a JavaScript variable, should be treated (encoded) differently than a string which will be used in plain HTML.

When outputting untrusted user input, one should encode or escape, based on the context, the location of the output.

And what's the difference between escaping and encoding ?

Encoding is transforming data from one format into another format.

Escaping is a subset of encoding, where not all characters need to be encoded. Only some characters are encoded (by using an escape character).

There are quite a number of encoding mechanisms, which make this more difficult than it might look at first glance.

URL encoding

URL encoding is a method to encode information in a Uniform Resource Identifier. There's a set of reserved characters, which have special meaning, and unreserved, or safe characters, which are safe to use. If a character is reserved, then the character is encoded using the percent '%' sign, followed by its hexadecimal digits. That's why URL encoding is sometimes referred to as percent encoding.

Note

A Uniform Resource Identifier (URI) identifies an abstract or physical resource. The URI set includes a Uniform Resource Locator (URL), as well as a Uniform Resource Name (URN). A URL includes the 'access mechanism', the scheme. Example of a URN is extract.pw, and a URL is https://extract.pw. Both are URI's, as all URL and URN are. tel:+61-123-123 is a URI, but not a URL.

The current set of reserved characters for a URI are ! * ' ( ) ; : @ & = + $ , / ? # [ ] . See RFC 3986 section 2.2. Note however that the reserved and safe character sets change over time.

An example of URL encoding:

&   ------->   %26
'   ------->   %27
/   ------->   %2F

The following is an example of URL encoding using Python, where only reserved characters are encoded:

% python -c "import urllib;print(urllib.quote_plus('ape&nutz/better '));"

ape%26nutz%2Fbetter+`

The following is an example of URL encoding, where all characters except the access mechanism (protocol) are encoded. This can be used by adversaries to obfuscate valid addresses, while it's still a valid address.

https://%65%78%74%72%61%63%74%2e%70%77

Modern browsers will show you the decoded URL when hovering over these kinds of obfuscated links.

Usually, when people refer to URI encoding, they signify only encoding reserved, non-safe characters. But again, the above example is still valid.

Unicode encoding

Unicode is a standard for encoding, representing and handling characters in most (if not all) languages. Best known is the UTF-8 character encoding standard, which is a variable length encoding (1, 2, 3 or 4 units of 8 bits, hence the name UTF-8). It's very similar to ASCII, as in: the first 128 characters of Unicode are ASCII. That makes it a compact encoding. Other well known Unicode encodings are UTF-16, and Latin-1 (ISO 8859-1). Unicode characters are represented by code points. A character can therefore exist of one or more 8-bit units. Glyphs are representations of characters on the screen.

U+2206 ∆ (Increment)

The code point is U+2206, it's called the Increment character and it is represented by the glyph ∆ on the screen.

Note

Sometimes vulnerabilities can lurk in the part where Unicode characters are decoded, for example by using different word-size encodings, or by using illegal Unicode encodings. Unicode decoding vulnerabilities warrant their own article.

HTML encoding

Also known as HTML entity encoding. The & < > " ' / characters should always be HTML entity encoded before outputting it into content. This is used for example in HTML attributes.

&   ------->   &amp;
'   ------->   &x27;

When outputting untrusted user input in HTML, make sure to properly HTML encode the following characters:

&   ------->   &amp;
<   ------->   &lt;
>   ------->   &gt;
"   ------->   &quot;
'   ------->   &#x27;
/   ------->   &#x2F;

Why ? Because these characters can "break out of the context".

JavaScript encoding

In general, encode all non alphanumeric characters instead of using escape characters. This prevents double encoding / decoding issues, in combination with HTML encoding.

JavaScript encoding is done by prepending the value with \x followed by its hexadecimal digits.

&   ------->   \x26
"   ------->   \x22
'   ------->   \x27

Proper encoding or escaping, based on the context, is really difficult, and easier to get wrong than right. Make sure you know when to use which format.

URL encoding

Unicode encoding

HTML encoding

JavaScript encoding

Related Posts:

Comments