# Data Sanitization and Validation With WordPress

Proper security is critical to keeping your site or that of your theme or plug-in users safe. Part of that means appropriate data validation and sanitization. In this article we are going to look at why this is important, what needs to be done, and what functions WordPress provides to help.

Since there seem to be various interpretations of what the terms 'validation', 'escaping' and 'sanitization' mean, I'll first clarify what I mean by them in this article:

• Validation – These are the checks that are run to ensure the data you have is what it should be. For instance, that an e-mail looks like an e-mail address, that a date is a date and that a number is (or is cast as) an integer
• Sanitization / Escaping – These are the filters that are applied to data to make it 'safe' in a specific context. For instance, to display HTML code in a text area it would be necessary to replace all the HTML tags by their entity equivalents

## Why Is Sanitization Important?

When data is included in some context (say in a HTML document) – that data could be misinterpreted as a code for that environment (for example HTML code). If that data contains malicious code, then using that data without sanitizing it, means that code will be executed. The code doesn't even necessarily have to be malicious for it to cause undesired effects. The job of sanitization is to make sure that any code in the data isn't interpreted as code – otherwise you may end up like Bobby Tables' school...

• wp_unique_filename( $dir,$filename ) – returns a unique (for directory $dir), sanitized filename (it uses sanitize_file_name). ### Data From Text Fields When receiving data inputted into a text field, you'll probably want to strip out extra white spaces, tabs and line breaks, as well as stripping out any tags. For this WordPress provides sanitize_text_field(). ### Keys WordPress also provides sanitize_key. This is a very generic (and occasionally useful) function. It simply ensures the returned variable contains only lower-case alpha-numerics, dashes, and underscores. ## Data Sanitization Whereas validation is concerned with making sure data is valid – data sanitization is about making it safe. While some of the validation functions referred to above might be useful in making sure data is safe – in general, it is not sufficient. Even 'valid' data might be unsafe in certain contexts. ## Rule No. 4: Making Data Safe Is About Context Simply put you cannot ask "How do I make this data safe?". Instead you should ask, "How do I make this data safe for using it in X". To illustrate this point, suppose you have a widget with a textarea where you intend to allow the user to enter some HTML. Suppose they then enter: This is perfectly valid, and safe, HTML – however when you click save, we find that the text has jumped out of the textarea. The HTML code is not safe as a value for the textarea: What is safe to use in one context, is not necessarily safe in another. Whenever you use or display data you must keep in mind what forms of sanitization need to be done in order to make using that data safe. This is why WordPress often provides several functions for the same content, for instance: These all perform the necessary sanitization for a particular context – and if you're using them you should be sure to use the correct one. Sometimes though, we'll need to perform our own sanitization – often because we have custom input beyond the standard post title, permalink, content etc. that WordPress handles for us. ### Escaping HTML When printing variables to the page we need to be mindful of how the browser will interpret them. Let's consider the following example: Suppose $title = <script>alert('Injected javascript')</script>. Rather than displaying the HTML <script> tags, they will be interpreted as HTML and the enclosed javascript would be injected into the page.

This form of injection (as also demonstrated in the search form example) is called Cross-site scripting and this benign example belies its severity. Injected script can essentially control the browser and 'act on behalf' of the user or steal the user's cookies. This becomes an even more serious issue if the user is logged in. To prevent variables printed inside HTML being interpreted as HTML, WordPress provides the well known esc_html function. In this example:

### Escaping Attributes

Now consider the following example:

Because $value contains double quotes, unescaped it can jump out of the value attribute and inject script, for example, by using the onfocus attribute. To escape unsafe characters (such as quotes, and double-quotes in this case), WordPress provides the function esc_attr. Like esc_html it replaces 'unsafe' characters by their entity equivalents. In fact, at the time of writing, these functions are identical – but you should still use the one that is appropriate for the context. For this example we should have: Both esc_html and esc_attr also come with __,  _e, and _x variants. • esc_html__('Text to translate', 'plugin-domain') / esc_attr__ – returns the escaped translated text, • esc_html_e('Text to translate', 'plugin-domain') / esc_attr_e – displays the escaped translated text and finally the • esc_html_x('Text to translate',$context, 'plugin-domain') / esc_attr_x – translates the text according to the passed context, and then returns the escaped translation

### HTML Class Names

For class names, WordPress provides sanitize_html_class – this escapes variables for use in class names, simply by restricting the returned value to alpha-numerics, hyphens and underscores. Note: It does not ensure the class name is valid (reference: http://www.w3.org/TR/CSS21/syndata.html#value-def-identifier).

In CSS, identifiers can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code.

### Escaping URLs

Let's now look at another common practise, printing variables into the href attribute:

Clearly it is vulnerable to the same form of attack as illustrated in escaping HTML and attributes. But what if the $url was set as follows: On clicking the link, the alert function would be fired. This contains no HTML, or any quotes that allow it to jump out of the href attribute – so esc_attr is not sufficient here. This is why context matters: esc_attr($url) would be safe in the title attribute, but not for the href attribute – and this is because of the javascript protocol – which while perfectly valid – is not to be considered safe in this context. Instead you should use:

esc_url strips out various offending characters, and replaces quotes and ampersands with their entity equivalents. It then checks that the protocol being used is allowed (javascript, by default, isn't).

What esc_url_raw does is almost identical to esc_url, but it does not replace ampersands and single quotes (which you don't want to, when using the URL as an URL, rather than displaying it).

In this example, we are displaying the URL, so we use esc_url:

Although not necessary in most cases, both functions accept an optional array to specify which protocols (such as http, https, ftp, ftps, mailto, etc) you wish to allow.

### Escaping JavaScript

Sometimes you'll want to print javascript variables to a page (usually in the head):

In fact, if you are doing this, you should almost certainly be using wp_localize_script() – which handles sanitization for you. (If anyone can think of a reason why you might need to use the above method instead, I would like to hear it).

However, to make the above example safe, you can use the esc_js function:

### Escaping Textarea

When displaying content in a textarea, esc_html is not sufficient because it does not double encode entities. For example:

$var printed in the textarea will appear as: Rather than also encoding the & as &amp; in the <b> tags. For this WordPress provides esc_textarea, which is almost identical to esc_html, but does double encode entities. Essentially it is little more than a wrapper for htmlspecialchars. In this example: ### Antispambot Displaying e-mail addresses on your website leaves them prone to e-mail harvesters. One simple method is to disguise the e-mail address. WordPress provides antispambot, which encodes random parts of the e-mail address into their HTML entities (and hexadecimal equivalents if $mailto = 1). On each page load the encoding should be different and while the returned address renders correctly in the browser, it should appear as gobbledygook to the spambots. The function accepts two arguments:

• e-mail – the address to obfuscate
• mailto – 1 or 0 (1 if using the mailto protocol in a link tag)

### Query Strings

If you wish to add (or remove) variables from a query string (this is very useful if you wish to allow users to select an order for your posts), the safest and easiest way is to use add_query_arg and remove_query_arg. These functions handle all the necessary escaping for for the arguments and their values for use in the URL.

add_query_arg accepts two arguments:

• query parameters – an associative array of parameters -> values
• url – the URL to add the parameters and their values to. If omitted, the URL of the current page is used

remove_query_arg also accepts two arguments, the first is an array of parameters to remove, the second is as above.

## Validation & Sanitization

As previously mentioned, sanitization doesn't make much sense without a context – so it's pretty pointless to sanitize data when writing to the database. Often, you need to store data in its raw format anyway, and in any case – Rule No. 1 dictates that we should always sanitize on output.

Validation of data, on the other hand, should be done as soon as it's received and before it's written to the database. The idea is that 'invalid' data should either be auto-corrected, or be flagged to the data, and only valid data should be given to the database.

That said – you may want to also perform validation when data is displayed too. In fact sometimes, 'validation' will also ensure the data is safe. But the priority here is on safety and you should avoid excessive validation that would run on every page load (the wp_kses_* functions, for instance, are very expensive to perform).

## Database Escaping

When using functions such as get_posts or classes such as WP_Query and WP_User_Query, WordPress takes care of the necessary sanitization in querying the database. However, when retrieving data from a custom table, or otherwise performing a direct SQL query on the database – proper sanitization is then up to you. WordPress, however, provides a helpful class, the $wpdb class, that helps with escaping SQL queries. Let's consider this basic 'SELECT' command, where $age and $firstname are variables storing an age and name that we are querying: We have not escaped these variables, so potentially further commands could be injected in. Borrowing xkcd's example from above: Will run as the command(s): And delete our entire Students table. To prevent this, we can use the $wpdb->prepare method. This accepts two parameters:

• The SQL command as a string, where string variables are replaced by the placeholder %s and decimal numbers are replaced by the placeholder %d and floats by %f
• An array of values for the above placeholders, in the order they appear in the query

In this example:

The escaped SQL query ($sql in this example) can then be used with one of the methods: • $wpdb->get_row($sql) • $wpdb->get_var($sql) • $wpdb->get_results($sql) • $wpdb->get_col($sql) • $wpdb->query($sql) ### Inserting and Updating Data For inserting or updating data, WordPress makes life even easier by providing the $wpdb->insert() and $wpdb->update() methods. The $wpdb->insert() method accepts three arguments:

• Table name – the name of the table
• Data – array of data to insert as column->value pairs
• Formats – array of formats for the corresponding values ('%s','%d' or'%f')

The $wpdb->update() method accepts five arguments: • Table name – the name of the table • Data – array of data to update as column->value pairs • Where – array of data to match as column->value pairs • Data Format – array of formats for the corresponding data values • Where Format – array of formats for the corresponding 'where' values Both the $wpdb->insert() and the $wpdb->update() methods perform all the necessary sanitization for writing to the database. ### Like Statements Because the $wpdb->prepare method uses % to distinguish the place-holders, care needs to be taken when using the % wildcard in SQL LIKE-statements. The Codex suggests escaping them with a second %. Alternatively you can escape the term to be searched for with like_escape and then add the wildcard % where appropriate, before including this in the query using the prepare method. For instance:

This isn't an exhaustive list of the functions available for validation and sanitization, but it should cover the vast majority of use cases. A lot of these (and other) functions can be found in /wp-includes/formatting.php and I'd strongly recommend digging into the core code and having a look into how WordPress core does validation and sanitization of data.