My users seem to have a problem entering email addresses. Not too often, say 1 in 50, but it is an issue. Forcing them to enter it twice would help them to fix some type-o’s and it does prevent some invalid emails, but I suspect those who make the most errors will defeat the duplicate entry with a simple copy-and-paste. I need a way to give them better feedback without extra typing work.

I find the official spec for email addresses to be too liberal. There are complete email parsers out there but I find they approve too much when compared to real world situations. They will allow things like embedded comments within an email address and formats I’ve never seen in practice.

My goal is to balance completeness with practical real world usage. Here is my JavaScript mask.

^[-_\.+'\da-z]+@([-\da-z]{1,63}\.){1,4}[a-z]{2}[-\da-z]{0,61}$

The recipient part (before the “@”) is any letters, numbers, and symbols of dash, underscore, period, apostrophe, and plus (note the dash comes first so it doesn’t need to be quoted, otherwise it is treated as a range operator). The official email spec allows all sorts of unusual structures like colons and paired double quotes but these are not seen in practice, at least in my experience. Some special characters like the tilda used to be more common but the world has moved on and I think it is fine to support the greater good and better validate the majority of addresses than support a couple of legacy ones.

Conversely there are cases where special characters need to creep back into the general mask. The plus “+” is an excellent example, where Google pioneered allowing Google+ users to add an arbitrary extension to their single gMail account, so you could use myaddress+site1@gmail.com and myaddress+site2@gmail.com, all going to myaddress@gmail.com. I love this for automatically sorting into folders and tracking who is selling my email address. Clearly others love this too and it needs to be supported.

After the “@” comes the domain name. This is quite straight forward, following the domain name rules. It can be up to 63 characters, and contain only letters, numbers, and dashes. I should note there are some additional domain rules I’m not explicitly covering such as not allowing two dashes in a row or starting with a dash.

Finally, the TLD is added. Unfortunately the .co.uk construct requires two levels, so there is an optional suffix country code. Technically second level domains are not restricted to country codes, in practice they are. It can even be 5 levels deep, such as teachername@school.district.k12.state.us, so there can be up to three secondary level domains (plus the domain, plus the TLD). Technically it can be deeper but this is the longest I’ve seen in practice.

Also, although the longest TLD’s in English are “museum” and “travel” weighing in at 6 characters, there are much more complex structures out there. The worst one is the word “Singapore” as encoded from Tamil script, which comes out as “XN-CLCHC0EA0B2G2A9GCD”. This not only makes the practical TLD length 22 characters long, it adds digits and a dash. Although I don’t run into them in my work I want to support all TLD’s, so I allow the full 63 characters. I do enforce that there are at least 2 characters and the first two are alpha.

I’m also limiting my mask to the DNS public internet. This means no IP addresses or routable internal domain names without a suffix TLD. So I don’t allow either myemail@10.0.0.1 nor myemail@internalmailserver to pass the test.

Before performing the regex check your code needs to trim whitespace from the user input, zap any trailing semicolon or comma, and  lower case the string.

This has reduced the errors to primarily “fat finger” items. There is more that I can do to help with this. In particular, I could maintain a list of TLDs, and then either make the RegEx really ugly or outside of the RegEx mask do a lookup to ensure the TLD is valid. I could go further and check that the domain name itself is valid by doing a whois call. I could do a fancy matching of the user’s first and last name to the email address, looking for partial matches with incorrect sub-strings (although this results in a user confirmation rather than rejection). I’ve not reached the threshold of needing these techniques, but even if I do my mask will stay the same, it seems to be holding up well.