When outputting untrusted user input, one should encode or escape, based on the context, the location of the output.
And what's the difference between escaping and encoding ?
Encoding is transforming data from one format into another format.
Escaping is a subset of encoding, where not all characters need to be encoded. Only some characters are encoded (by using an escape character).
There are quite a number of encoding mechanisms, which make this more difficult than it might look at first glance.
URL encoding is a method to encode information in a Uniform Resource Identifier. There's a set of reserved characters, which have special meaning, and unreserved, or safe characters, which are safe to use. If a character is reserved, then the character is encoded using the percent '%' sign, followed by its hexadecimal digits. That's why URL encoding is sometimes referred to as percent encoding.
A Uniform Resource Identifier (URI) identifies an abstract or physical resource. The URI set includes a Uniform Resource Locator (URL), as well as a Uniform Resource Name (URN). A URL includes the 'access mechanism', the scheme. Example of a URN is extract.pw, and a URL is https://extract.pw. Both are URI's, as all URL and URN are. tel:+61-123-123 is a URI, but not a URL.
The current set of reserved characters for a URI are
! * ' ( ) ; : @ &
= + $ , / ? # [ ]
. See RFC 3986 section 2.2. Note however that the
reserved and safe character sets change over time.
An example of URL encoding:
& -------> %26 ' -------> %27 / -------> %2F
The following is an example of URL encoding using Python, where only reserved characters are encoded:
% python -c "import urllib;print(urllib.quote_plus('ape&nutz/better '));" ape%26nutz%2Fbetter+`
The following is an example of URL encoding, where all characters except the access mechanism (protocol) are encoded. This can be used by adversaries to obfuscate valid addresses, while it's still a valid address.
Modern browsers will show you the decoded URL when hovering over these kinds of obfuscated links.
Usually, when people refer to URI encoding, they signify only encoding reserved, non-safe characters. But again, the above example is still valid.
Unicode is a standard for encoding, representing and handling characters in most (if not all) languages. Best known is the UTF-8 character encoding standard, which is a variable length encoding (1, 2, 3 or 4 units of 8 bits, hence the name UTF-8). It's very similar to ASCII, as in: the first 128 characters of Unicode are ASCII. That makes it a compact encoding. Other well known Unicode encodings are UTF-16, and Latin-1 (ISO 8859-1). Unicode characters are represented by code points. A character can therefore exist of one or more 8-bit units. Glyphs are representations of characters on the screen.
U+2206 ∆ (Increment)
The code point is U+2206, it's called the Increment character and it is represented by the glyph ∆ on the screen.
Sometimes vulnerabilities can lurk in the part where Unicode characters are decoded, for example by using different word-size encodings, or by using illegal Unicode encodings. Unicode decoding vulnerabilities warrant their own article.
Also known as HTML entity encoding. The
& < > " ' /
always be HTML entity encoded before outputting it into content. This is used
for example in HTML attributes.
& -------> & ' -------> &x27;
When outputting untrusted user input in HTML, make sure to properly HTML encode the following characters:
& -------> & < -------> < > -------> > " -------> " ' -------> ' / -------> /
Why ? Because these characters can "break out of the context".
In general, encode all non alphanumeric characters instead of using escape characters. This prevents double encoding / decoding issues, in combination with HTML encoding.
its hexadecimal digits.
& -------> \x26 " -------> \x22 ' -------> \x27
Proper encoding or escaping, based on the context, is really difficult, and easier to get wrong than right. Make sure you know when to use which format.