Articles | Addressing Form Field Validation
| back
Introduction
As most of us are aware, using a form on a website is
an effective way to gather information from a visitor. Information can be
requested, mailing lists can be subscribed to, and comments and feedback can be
submitted. For those of you who have not yet implemented a form on your website
and are wondering about the HTML syntax in doing so, I will first delve into
such details. If you are already familiar with forms, you may skip this section
and move onto the next.
A form on a website is embodied inside the <FORM>
tag, so we will look at this one first and in some detail. Like most HTML tags,
the <FORM> tag takes a number of attributes related to the form
itself. It possesses the following syntax:
<FORM attribute1=".." attribute2=".."> ... </FORM>
|
where the attributes can be one (or more) of the
following:
- ACTION=".." - this attribute
specifies the location of the CGI script or email address that the
information will be sent to once the form is submitted. If you want your
form to do anything useful, this is a required field.
- METHOD=".." - this attribute
specifies how the form information is sent. Possible values are GET
and POST. This attribute is not required, and defaults to GET
if left out. With GET, the form information is appended to the URL
of the CGI script, while with POST, the form information is sent as
an encoded string via http.
- TARGET=".." - this attribute
specifies the frame in which the returned results will be loaded into - a
useful feature if you have a thank-you message that your CGI script displays
and you don't want to load it in the same window your form is located.
- NAME=".." - for JavaScript, this
is probably the most important attribute for a form. Since JavaScript stores
items that appear on your website in arrays, giving your form a name makes
it easier to work with.
Inside the <form> tag, you place all the
elements that you want to use in the form itself. The first element is an <input>
tag which takes attributes the define how it appears on the page. It is an
open-ended tag, which means that it does not have a corresponding </input>
tag to close it. These <input> tags can be of type text,
checkbox, image, password, radio, submit,
and reset. There also exists a <textarea> tag which
creates a large area for entering multi-lined information. The <select>
tag is used to create a drop-down list of items.
Defining each of these items is not the purpose of this
article, but if you would like to know more, please visit the World Wide Web
Consortium at http://www.w3.org
for more information. For now, suffice it to say that qualifying each component
of your form with a name=".." attribute is required if you
wish to work with either a CGI script or a JavaScript function.
The simple form used in this example contains two text
fields, one for a name, and the other for an email address. One button has been
added, by which the form is submitted to the server.
<form name="form_name" onSubmit=
"return isReady(this)" action="">
<table cellpadding=0 cellspacing=5 border=0><tr>
<td align="left">Your Name:</td><td align="left">
<input type="text" name="Name"></td>
</tr><tr>
<td align="left">Your Email Address:</td><td align="left">
<input type="text" name="address"></td>
</tr></table>
</form>
|
Note that the form has been given a name, which is
passed to the isReady() function upon submission through the this
keyword. In JavaScript, this implies the current object.
The dangers of CGI
As we have seen, coupling interactivity via forms and
programs or scripts on a server through the Common Gateway Interface (CGI) is an
effective way to obtain information from individuals visiting your website.
However, there are risks associated with running a CGI script from the web.
Poorly written scripts that accept malformed information from an unknowing or
malicious user could be made to do things that could bring your server to its
knees.
For example, imagine operating a website that contains
a field that allows a user to enter the name of a directory on the server.
Certainly not the smartest idea, but they are out there. If someone were to put
the following in as the directory they wanted listed, bad things could happen:
web_directory ; /bin/rm *
Quite possibly, the command to list the directory would
be carried out normally, and then the second command (/bin/rm *) could be
carried out and erase a directory.
There are several ways to prevent this sort of thing
from happening, and some are better than others, depending on the situation.
First and foremost, the script itself could be written to verify that the form
submitted to it does not contain any malicious code. Upon detecting such an
attempt, the script could refuse to process the entry and store the submitter's
IP address in a file for future reference. Or, more simply, the script could
simply display an alternate page telling the visitor that their input was not
accepted.
While this is a very good method to use when validating
form field input, it does have its disadvantages. One of the biggest is the
overhead involved with parsing input on the server. A busy server that parses
all of its requests could be slowed considerably, resulting in a website that
appears sluggish. Here is where JavaScript comes to the rescue!
By passing the contents of the form to a JavaScript
function before submission, the contents can be validated before being sent to
the server, which reduces server overhead.
Beware, however, that a poorly written script can still
accept requests that do not come from the form. It is possible that a malicious
user from a completely different domain could run your script directly and feed
it bad information. Fortunately, there are several ways around this. One of the
easiest is to make your CGI script examine the HTTP_REFERER and REMOTE_HOST
environmental variables that are submitted with every request. These variables
contain the URL of the requesting document and the domain name of the foreign
server respectively, and could be checked to ensure that the request was
submitted from a user on an allowed domain (in particular, your own). If the
request is not allowed, the foreign domain name could be logged in a file and
refused access to the script.
It is also important to ensure that some form of error
checking still takes place, even if the request is a legitimate one. A visitor
using a browser that does not support JavaScript could still conceivably submit
malformed code.
Using JavaScript 1.0 Validation
JavaScript 1.0 offered a way to check and see if a
field contained certain characters using the indexOf() method. If a
character was found, the position of the character was returned as a number. For
example:
var a = "This is my field's contents";
var b = a.indexOf("my"); // b now contains 9.
|
As you can see, b now contains the position
(starting from 0) that the pattern "my" was located at. If
the pattern searched for was not found, the indexOf() method returns
-1.
But what if you wanted to check for several characters
all at once? What if you wanted to make sure that an email address only
contained numbers, letters, an "at sign", and a period? By using indexOf(),
you would be required to write several lines of code, each using indexOf()
to look for ALL the characters you didn't want to find. If an illegal character
is found, an alert box could be flashed asking the user to re-enter their
information. The following functions use JavaScript 1.0 functionality to examine
either a text field containing regular text or a text field containing an email
address. By passing the contents of the form to the isReady() function
using the onSubmit event handler, the information is validated before
being sent to the server. If the function returns true (i.e. everything checks
out), the ACTION attribute of the form is run.
Note that these functions can be used independently of
a form. These methods can be used anywhere, as long as an appropriate string
value is passed as an argument.
<script language="JavaScript"><!--
function isEmail(string) {
if (!string) return false;
var iChars = "*|,\":<>[]{}`\';()&$#%";
for (var i = 0; i < string.length; i++) {
if (iChars.indexOf(string.charAt(i)) != -1)
return false;
}
return true;
}
function isProper(string) {
if (!string) return false;
var iChars = "*|,\":<>[]{}`\';()@&$#%";
for (var i = 0; i < string.length; i++) {
if (iChars.indexOf(string.charAt(i)) != -1)
return false;
}
return true;
}
function isReady(form) {
if (isEmail(form.address.value) == false) {
alert("Please enter a valid email address.");
form.address.focus();
return false;
}
if (isProper(form.username.value) == false) {
alert("Please enter a valid username.");
form.username.focus();
return false;
}
return true;
}
//--></script>
|
Although this method works fine if you want to ensure
that certain characters are not present in the field, it falls short when trying
to ensure that certain patterns ARE present. What if you only wanted to allow
email addresses from a certain domain, while not allowing others? What if only word-word@word-word.word
email addresses were allowed? These things would be incredibly difficult, if not
impossible, to do with indexOf() and JavaScript 1.0.
Using JavaScript 1.2 and Regular
Expressions
JavaScript 1.2 shows the way through the power of
regular expressions. These expressions, which offer the same functionality as
regular expressions taken from Perl, a very popular scripting language, add the
ability to parse form field input in ways that were simply not possible before.
The examples below, which only work in Netscape Navigator 4.0x and Internet
Explorer 4, illuminate the power associated with these new additions.
First off, what is a regular expression? Put simply, a
regular expression is a string of special values that programmers can use to
explicitly match a specific string of text.
Before we get into using regular expressions to parse
text, it is important that you understand a bit about how regular expressions
work and what special characters do what. There is just too much to get into
here, but here are a few that come up often:
. matches any singular character.
? matches one or none of the preceding character.
+ matches at least one of the preceding character.
* matches none or all of the preceding character.
^ matches the absolute beginning of the string.
$ matches the absolute end of the string.
\w+ matches a whole word.
\w matches a "word" character (alphanumerics and the "_" character).
\W+ matches whitespace.
x|y matches one or the other of x or y.
[0..9] matches ONE number, ranging from 0 to 9.
[A-Za-z] matches any letter, uppercase or lowercase.
Parentheses can be used to group characters together.
(this)+ matches at least one occurrence of "this".
If you wish to search for one of the special
characters, you must first delimit it with a backslash(\).
\. matches a period.
\? matches a question mark.
\[ matches a left square bracket.
\| matches a "pipe" character.
In addition to these, modifiers can be added after the
regular expression to control how it searches through the string. Some of more
useful ones include these:
/somematch/g - global (matches all instances).
/somematch/i - ignore case.
/somematch/gi - you can combine them, too.
JavaScript 1.2 contains a number of new constructors
and methods that allow a programmer to parse a string of text using regular
expressions. The first thing you must do before you can begin parsing a string
is to determine exactly what your regular expression will be. There are two ways
to do this. The first is to specify it by hand using normal syntax, and the
second is to use the new RegExp() constructor. The following two
statements are equivalent:
pattern = /:+/; // matches one or more colons
pattern = new RegExp(":+"); // same thing.
|
There is one very important thing to notice here. With
the first method, it is important to remember to delimit your expression using
slashes. A slash specifies the beginning or the end of a regular expression. You
may also place the regular expression directly into the function without first
defining it using the RegExp() method, which is what I do in the
examples below.
The replace() method allows a programmer to
replace a found match with another string. It takes two arguments, one being the
regular expression you want searched for, and the other being the replacement
text you want substituted. For example:
var t = "javascript is great";
var s = t.replace(/javascript/, "JavaScript");
// fixes the capitalization.
|
The variable s now contains "JavaScript
is great". The next method is the search() method. This method
searches the source string and returns the location of the first match if the
pattern is found, otherwise -1. It effectively duplicates the functionality of
JavaScript 1.0's indexOf() method. Example:
var s = "Let's use Regular Expressions";
var found = s.search(/use/); // found now contains 6.
|
If the search string is not located, the function
returns -1. This method is the one that will enable us to parse a field's
contents to make sure that people aren't submitting information that could
damage our server. Before we do that, however, let's take a look at the next
method provided for regular expressions, the split() method. The split()
method is actually present in older versions of JavaScript but has been updated
for JavaScript 1.2 to accommodate regular expressions. It searches through a
string and "breaks apart" the string and stores each part in an array.
The example below uses a pattern that looks for a colon and stores each part in
the array a.
var s = "Webmaster:this:is:great:don't:you:think";
var a = s.split(/:/);
|
In this case, a becomes the array containing
["Webmaster", "this", "is",
"great", "don't", "you", "think"]. In
common CGI applications, this same technique is used to separate a comma
delimited text file that perhaps serves as a database containing user
information.
The match() method searches a string in a
different way. It returns an array consisting of all the matches found in the
string that match the regular expression. If no matches are found, it returns
null.
var s = "Thank you, there, for thinking about me.";
var a = s.match(/th\w+/gi); // matches a word beginning
with th, globally, and ignore case.
|
a is an array that now contains
["Thank", "there", "thinking"].
Now, finally, we get to do some useful things with
regular expressions. The following function will parse a form consisting of a
username and an email address, and alert the user if the username is not
entirely made up of characters, numbers or spaces. The function will also alert
the user if the email address contains more than just alphanumerics, an
"at" sign, periods, or hyphens.
Since regular expressions are only a part of JavaScript
1.2, we must determine the browser being used and plan accordingly. Since all
other browsers ignore JavaScript 1.2, we can simply use the language="JavaScript1.2"
qualifier to refine our parsing functions. Older browsers will simply skip over
this code.
<SCRIPT language="JavaScript1.2">
function isEmail(string) {
if (string.search
(/^\w+((-\w+)|(\.\w+))*\@[A-Za-z0-9]+
((\.|-)[A-Za-z0-9]+)*\.[A-Za-z0-9]+$/) != -1)
return true;
else
return false;
}
function isProper(string) {
if (string.search(/^\w+( \w+)?$/) != -1)
return true;
else
return false;
}
//--></SCRIPT>
|
Ok. Let's stop and examine the regular expressions used
in the functions above. First, let's look at the isProper() function
since it is simpler. The Regular Expression used is /^\w+( \w+)?$/.
- The first / is the leftmost delimiter for
the regular expression. No surprises there.
- The ^ (caret) symbol represents the
absolute beginning of the function. This is important, since if it were left
out the match would return true if the search() method found the
pattern ANYWHERE in the string. Malicious users could then include illegal
characters before a valid name and get away with it.
- The \w+ which is next indicates that we
want to match at LEAST one or more alphanumeric characters, including the
underscore. the \w represents the character, and the +
symbol means at least one. No magic there.
- The following part of the regular expression is
special, since it has to be treated together. Let's break it down a bit,
however. First, notice the whole picture. What we are doing is using a ?
symbol, which means match one or none of the preceding character. So, what
happens is the regular expression looks for one or none of a space, followed
by at least one legal word character. This represents the optional last name
of the user. Please take a moment to understand that.
- The last $ symbol represents the end of the
string. This makes sure that no characters can appear after our matched
string in the regular expression, thus eliminating the possibility of
someone sending bad data after a valid username.
- The final / is the rightmost delimiter for
the regular expression. Again, no surprises there.
Ok. Shall we move on to the isMail() function?
The Regular Expression is /^\w+((-\w+)|(\.\w+))*\@[A-Za-z0-9]+((\.|-)[A-Za-z0-9]+)*\.[A-Za-z0-9]+$/.
- The email regular expression begins with a /,
again representing the leftmost delimiter.
- Once again, we have a ^ symbol,
representing the absolute beginning of the string for the same reason as
before.
- The following \w+ matches one or more
alphanumeric characters.
- The next chunk is where this gets interesting. I
will try to break it down into manageable pieces, so please bear with me.
The part we will look at is ((-\w+)|(\.\w+))*
- First, note that the whole thing is surrounded
by ()* which means that we want to match zero or more of them.
Inside the parentheses, we have (-\w+)|(\.\w+) which means to
match EITHER -\w+ OR \.\w+ so lets take a look at each
of them in turn. The first one indicates that we should have a match if
we find a hyphen followed immediately by a set of alphanumeric
characters. The second part matches if we find a period followed
immediately by a set of alphanumeric characters. Remember that a period
by itself is a special character so we must delimit it by placing a
backslash in front of it. In essence, what this inside bit does is allow
someone to submit an email address that has a hyphenated or
dot-separated email address before an "at" sign.
- After this match, comes an @ sign. This
is delimited to ensure that it isn't taken for special meaning.
- Immediately following the "at" sign is
[A-Za-z0-9]+ which matches a set of alphanumeric characters
(excluding any _ characters, which we would have got if we had
just used \w).
- The final / is the rightmost delimiter
for the regular expression.
- After this, we have another interesting bit ((\.|-)[A-Za-z0-9]+)*.
Let's go through it.
- Again, note that we are matching one or none of a
match using the * sign. Since parentheses are used, the entire
match is taken into consideration. Let's look inside at the (\.|-)[A-Za-z0-9]+
pattern. Inside the parentheses, we have \.|- which implies that we
will match either a period or a hyphen. Since this pattern is followed by a [A-Za-z0-9]+,
the match only works if the period or hyphen is followed by a set of
alphanumeric characters. This effectively represents an email address that
contains a (possible) set of .word or -word sections. Because the *
is used, the pattern works if they are present and also if they aren't.
- The last \.[A-Za-z0-9]+ pattern matches a
period followed by a set of alphanumerics. Because it is the last part of
the regular expression, it represents the final part of the email address,
which is the top level domain. Because [A-Za-z0-9]+ does not match
non-alphanumerics, this pattern will not match email addresses that do not
contain some sort of "real-looking" domain.
- The final $ symbol ensures that the pattern
is against the end of string for the same reasons as the previous example.
This pattern allows for email addresses like the
following. With this particular regular expression, the bare minimum that a
person could enter as an email address is x@x.x, where x is any alphanumeric
character:
someone@somewhere.com
someone.somebody@somewhere.com
someone.sombody@somewhere.where.com
some-one@somewhere.com
some-one.somewhere@wherever.com
some-one.somewhere@where-ever.com
Working Example
Why not try the example
out, which works in Netscape Navigator 2, 3 and 4, as well as Internet Explorer
3 and 4.
Source Code
You can view the source
code of the working example.
Further Information
If you are interested in learning more about JavaScript
1.2, feel free to examine these sources of information on the web:
What's new in JavaScript 1.2: http://developer.netscape.com/library/documentation
/communicator/jsguide/js1_2.htm
JavaScript 1.2 Reference: http://developer.netscape.com/library/documentation/communicator/jsref/index.htm
For a good introduction to regular expressions, please
check out: ftp://ftp.ou.edu/mirrors/CPAN/doc/manual/html/pod/perlre.html
In addition, you might want to check out Tom
Christiansen's page on Regular Expressions in Perl 5, which can be found at: http://www.perl.com/CPAN-local/doc/FMTEYEWTK/regexps.html
The FMTEYEWTK stands for "Far More Than Everything You Ever Wanted To
Know".
Articles | Addressing Form Field Validation
|
back
|