Ashley Sheridan​.co.uk

Form Validation And Why You Are Probably Doing It Wrong - Part 1

Posted on

Tags:

Forms are an integral part of websites, whether it be a simple contact form on your personal blog, an address form for a shopping delivery, or a complex tax return form.

If you're the one in charge of that form, you will be validating the user input (if you're not, you're an idiot), but have you ever stopped to think about exactly what it is you're validating?

Validate Twice, Process Once

There's an old carpentary proverb “measure twice and cut once” which basically boils down to double-checking things before you do something that can't be undone. The same is true for forms and processing the user input (although for slightly different reasons.) There are two ways you can do validation: front-end through JavaScript of HTML5, and back-end on the server, through PHP, .Net, or Node.js. Don't be fooled, they actually perform very different tasks.

Front-end validation is not, I repeat, not, a security mechanism. It will do nothing to protect you whatsoever. The only real purpose it serves is to provide more real-time feedback to your visitor/user that something is not right. They get an immediate indication they typed in something wrong in the email field, instead of having to submit the form and wait for the page to reload with an error (which can be a long time on a slow connection.)

The back-end validation is the real stalwart hero here. It can perform much better validation (like checking that the user details form being submitted really belongs to the user who's logged in, or run a virus scan on an uploaded file) and it can't be turned off or disabled like the front-end.

So what's the issue, you've already set up your validation at both ends, that's it right?

Bad Assumptions - What's in a Name

Validation is hard, which is probably why so many people do it wrong. Usually this stems from developers making assumptions on data validation which fall afoul of being too sure about exactly what that data is and how it is structured.

Usually the first thing I encounter on validation that isn't working correctly is that it won't accept so-called "special" characters (I'm not a large fan of this moniker, but for the article it will do.)

Consider when you last ever wrote some validation for a forename field in a form? I'll make a few guesses which are probably accurate for the majority of you:

  • You used regular expressions.
  • Your regex looked like '^[a-z]+$' because names are just the letters a-z, right?
  • If you were really putting the extra effort in, your regex was more like '^[a-z \'\-]+$' because you knew names could have hyphens and apostrophes in them.

If you've used this sort of thing on your form, then congratulations, you've successfully blocked the majority of the world from using it.

The thing about names is, they can pretty much be anything. All of the following are valid:

  • Talula Does The Hula From Hawaii (this was a real name given to a girl in New Zealand until a judge ruled for a rename)
  • 艾什莉
  • Caroll-Smith-Jones O'Leary
  • Renée
  • Zoë (quite popular in the UK)
  • Queen Elizabeth 2nd (I'm taking liberties here slightly, but some names do contain actual numbers)

Names don't follow the rules you probably thought they did. Essentially, a name can be almost anything, and the rules for naming conventions in one country won't necessarily hold true for another. They main thing to remember is that names are not just the letters a to z; they can and do have all sorts of other things in them.

I realise that many of you may be thinking: “well, this form is only for a small competition open for a few months and only to people in the UK, they all have regular names right?”. This is actually very blinkered thinking, and as you can see in the above list, some names in the UK are quite popular and contain letters with umlauts (although most people may just type 'ë' as 'e' because it's easier on their keyboard).

So, what's the solution? It's actually very easy:

Winning the Name Game

The answer is actually very simple, and you don't have to completely throw out your regular expressions (because we all love them and they're so easy to read, right?!), you just convert them to Unicode regexs:

So, the old one would look something like:

^[a-z\'\-]+$

Simply swap out the 'a-z' part for '\p{L}':

^[\p{L}\'\-]+$

What does that do? Quite simple \p{L} matches any letter, from any character set, in any case (for those languages that have upper and lower-case letters).

These extensions allow you to do a lot more than just let you match letters though. If you do really have a genuine need to only allow letters from a specific language, then you can do things like:

\p{Greek} \p{Egyptian_Hieroglyphs}

Although, in practice this last method is probably less useful than you might first assume.

There is a full list of supported escape sequences can be found at http://php.net/manual/en/regexp.reference.unicode.php, but the most useful are:

Property Matches
\p{L} Any letter
\p{Ll} Lower-case letter
\p{Lu} Upper-case letter
\p{Lt} Title-case letter (e.g. Dž which is actually a single character but may be entered as two)
\p{N} Number (e.g. 0-9, ①-⑨, ⑴-⒇, etc)
\p{P} Punctuation
\p{Sc} Currency symbol (e.g. $฿€₹£)
\p{Zs} Space separator

Supporting the Team

There are some caveats to using these unicode regular expressions. Currently, they are supported by the following languages out of the box:

  • .Net
  • Java
  • Ruby
  • Perl
  • PHP

Unfortunately, one of the languages where this may be the most useful as a feature is still not available, and that language is JavaScript. There is a library which adds this support though, available at http://xregexp.com/plugins/. This allows you to use these expressions both on the browser and in the Node.js language on the server. However, be prepared to encounter a performance hit for it not being native to the language.

In the next part I'll go over validating other data types, like emails, URLs, and numbers.

Comments

Leave a comment