Form Validation And Why You Are Probably Doing It Wrong - Part 2
In the previous part of the article I outlined how a lot of developers are validating their forms badly, letting assumptions blinker their otherwise decent logical skills. There I used names as an example; an easy shot with obvious issues. But what about something more specific in its structure, something that has a well-defined pattern, like a domain name?
Domain names are integral to the web, and quite an important part of some forms. Imagine a competition entry form where you need to add your website or Twitter URL. That needs to be correctly validated, and it's not as simple as a lot of people seem to think.
The first thing to note is that a domain can't be more than 253 characters, and each label (sub-domain, domain, etc) no more than 63 character each. This character limit is for ASCII, so IDNs which get converted to ASCII punycode reach the limit more easily than plain ASCII domains.
Starting at the Top
I'm sure most of us have seen domain name validation that looks something like this:
- Starts with a
- Has a main domain name part consisting of one or more letters
- If they're really thinking, then there might be 2 or 3 of those, e.g.
- If they're really thinking, then there might be 2 or 3 of those, e.g.
- Ends in a known domain name, like
The thing is, that's just not going to cut the mustard these days. Let's just consider just the TLD. The list of possible ones is huge, ranging from the traditional
.co.uk, right the way through to
.recipes (yes, this latter one exists and ironically is operated by a company called Donuts!). You can see a fuller list at https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains.
If you scan down the list a little to the https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#Internationalized_country_code_top-level_domains you'll see that the TLD isn't restricted to pure ASCII. They can be represented in ASCII using a method called Punycode (https://en.wikipedia.org/wiki/Punycode) which allows older technologies to access the domains.
Beyond these are some of the more recent TLD additions which are more commercial in nature, sometimes with whole top level domains belonging to an entire company.
Take a minute to consider now that the complete range of TLDs now runs into the 100s; something that is not feasible to place into a regular expression (or at least one that no human ever wants to maintain!).
The Middle Ground
So, with the huge crazy ball of fun that is the TLD, what awaits us inside the rest of domain name? Well this is a bit more of the same; each part of the domain is separated by periods. The difference here is that unlike the TLD, the rest of the domain name can contain almost any other character. This is why the following domains are completely valid:
Using a regular expression for these is also a pretty gargantuan task given the sheer range of characters that are allowed. In fact, it's probably easier to allow everything except for a few specific characters as false positives are probably better than false negatives when it comes to checking the validity of the domain.
The Beginning of the End
The last part of the domain (as they are read right-to-left) is the first sub-domain. For the majority of websites this defaults to
www, but sometimes others are used (typically for websites that want to create separate parts, like a shop, for example), and on some few occasions are not used at all.
You've Got Mail
So now you've seen how complicated domain names are, what about an email address? That's easy right?
- A name
- Starts with a letter
- Has one or more letters, numbers, or hyphens after it
- There might be more than one name separated by a period
- An @ sign
- A domain (using the same domain checking logic as seen in the previous section)
If only it were that simple. Just the name part is an incredibly complicated:
What's in a Name?
The local name part of an email is more than just a few letters and numbers before the @ symbol. In fact, the full specification allows it to even contain more @ symbols! The basics are:
- Any character that is valid for a domain name label (see above) can be used in the local name, including
- Almost any other character can be used if quoted within double quotes when that quoted string is a whole label of the local part. The following are valid (examples from Wikipedia):
The specification does advise that email addresses like these are best not used due to the fact that systems that handle email couldn't be bothered to implement the specification in the first place and will break when encountering such email addresses. That being said, there are still systems out there that can handle them correctly, which means that an email address using these format features could exist in the wild. You'd be remiss to ignore them just because it's less work.
There are many other types of input which are fairly standard in online forms which are, thankfully, easier to validate than domains and email address, but are still prone to being validated in the wrong way.
The next most abused form of validation is that which deals with numbers. In regular parlance we refer to many things as numbers when they are actually not:
- Phone numbers
- in many places these start with leading zeros, and for that fact alone makes them unsuitable for being represented as numbers. Also, phone numbers can (and often do) contain other symbols like + to identify country codes, brackets to group digits for area codes, and even spaces to make them more readable. Even the length of a phone number is not something that can easily be counted on. Short 5-digit numbers are often used for SMS services, and length of regular phone numbers often changes a lot from country to country.
- Zip Codes (the American kind)
- Again, numbers are not a suitable representation for these, as they can start with a leading zero (for states like Massachusetts or Puerto Rico) which means that trying to treat them as pure numbers means that they might be broken by that process
- Bank Account Numbers
- Another thing that is a string of digits and spaces, and can have a leading zero.
The main problem that can occur with validation of these 'numbers' is the validation method. The easiest way to validate a number in form input is to see if it can be cast as a number with
intval() (which is faster and more efficient than using string parsing on the input). However, given what we've seen, some inputs are not numbers despite their name, and are actually strings which look a bit like numbers. For things like these, they should be treated as text.
It's A Date
We all know what a date looks like, right? It's the day, then the month, then the year. But is that month a number or a month name? Do numbers have leading zeros if they're single digits? Is the year 2 or 4 digits long? Are month names 3 letters or the full name? Oh wait, the website is American so swap the month and the day round. Is the date in the form one single input, or is it 3 individual form fields that you need to validate?
A lot of questions for something so simple that we were able to write on our school work when we were 5.
File uploads are a less common element of forms compared to some of the things already mentioned, but are arguably one of the most important to get right given the security risk they pose. Most of the time I see file validation as a simple check on the file extension. This is a naive approach and doesn't take into accounts that a files contents may not match the extension it has (and sometimes a file doesn't even need an extension).
One method is to check the files contents against to determine the type, which can be done with various server-side tools, like the built-in
mime_content_type() in PHP.
If you don't have access to one of those tools then you can attempt to verify based on the files you want to accept:
- you can attempt to generate a thumbnail with something like GD or ImageMagick. If they are unable to read in the file as a valid image, then there's a chance that it actually isn't.
- modern document types like
.xlsxare actually zip files, so you can attempt to open them and read in a list of its contents. For older formats, there are other server-side plugins and modules you can use to read in things like the document meta data to determine if it appears to be valid or not
- Other types
- generally, the first few bytes of a file identify its type. Jpeg images contain the letters
'jfif'in the first characters, Windows executables contains
'Mz'. You can open a few different files on your own machine to see what they contain. Doing this on the server is not a bad last resort if you have no better tools.
A more advanced validation method you can perform for files on the server is to run an anti-virus check on them. Bear in mind though, that this is an incredibly slow process (compared to the typical validation already discussed in the article) but essential if you're expecting to process a multitude of file types that could be virus containers. These would range from the obvious,
.dmg, etc, right the way through to the ones that are less so;
.jar. I won't go into how to set that up here, as it's a bit out of scope for this article, but consider this route if it is right for your needs.
Form validation is hard, and easy, and everything in between. The one good thing is that the basic elements of security and validation don't change much. New aspects are added over time as technology improves, but the fundamentals remain stalwart. Consider each part of form data as its own entity, and think about exactly what that entity really is. If you're unsure, it is never a bad idea to do some research. It takes minutes to discover that you can't treat a phone number as a real number, but it could take weeks to rectify this mistake if you've validated and stored them as such!