CMPS 335 Advanced Web Publishing
Perl and CGI Programming
Regular Expressions
One of the greatest strengths of Perl is its powerful text
manipulation capabilities. For manipulating strings and pattern
matching, Perl is the easiest and the most powerful language.
Regular expressions can be used to describe and create patterns, from the
simplest to the most complex. You can use a regular expression to
search a string for any pattern, and possibly, replace it with another
pattern. The term "regular" in regular expressions is somewhat
misleading. In addition to their extraordinary capabilities,
regular expressions usually look strange, complex, and intimidating.
Elements of Regular Expressions
A regular expression is a string of characters consisting
of meta-characters, regular characters, and patternmodifiers.
Building blocks of regular expressions are atoms. An
atom is any single item in a pattern and it can be any of the following:
- A single character or digit
- A meta character
- An octal or hexadecimal code (such as \o74,\xFF)
- A regular expression enclosed in parentheses
- A list of atoms enclosed in square brackets
- A back reference to a previous pattern match
Meta-Characters
A meta-character is a character with special meaning used to
build a regular expression. Meta-characters allow you to
match just the character or characters you want. Some of the
meta-characters are listed below:
Meta-Characters and Their Functions
--------------------------------------------------------------
\ (escape) Do not interprete the following meta-character
| Match either of the alternatives
( ) Match or create a single expression or atom
[ ] Match one of the enclosed atoms (atom alternatives)
{m,} Match at least m times
{m} Match exactly m times
{m,n} Match at least m and at most n times
* Match an atom zero or more times, same as {0,}
+ Match an atom one or more times, same as {1,}
? Match an atom zero or one time, same as {0,1}
^ Match an atom at the beginning of a string
$ Match an atom at the end of a string
. Match any single character
\d Match a digit (equivalent to [0-9])
\D Match a non-digit
\f Form feed
\n New line
\r Carriage return
\t Tab
\w Match word (an alphanumeric character or
the underscore (equivalent to [a-zA-Z0-9])
\W Match a non-alphanumeric character
\A Alternative to meta-character ^
\Z Alternative to meta-character $
\s Match a space character (space, \f,\n,\r,\t)
\S Not space character
\xnn match the character having hexadecimal value nn
Regular characters match themselves, except for the special
characters +?.*^S@()[]|\ used as meta-characters listed in the above
table. To match one of these special characters, you have to use
an escapte character (\). The escape character changes
the meaning of the characte following it. Examples:
print "Testing the escapte character \n";
print "Testing the escapte character \\\n";
print "Testing the escapte character \\n";
Output
Testing the escapte character
Testing the escapte character \
Testing the escapte character \n
Pattern Modifiers
A pattern modifier is appended to a regular expression after the
trailing delimiter and it affects the action of the entire regular
expression. Some of the commonly used pattern modifiers are listed
below:
Modifier Meaning
---------------------------------------------------------
g Instead of matching only the first occurrence of
the pattern, the pattern is matched repeatedly
i Ignore character case when matching the pattern
e Evaluate the right hand side of the pattern
The $_ Variable
The $_ variable is a special variable that contains the result
of the last Perl operation. It is used as the default input for many of
Perl functions.
Back Reference Variables $1, $2, $3,...
Back reference variables are special variables created during a
successful regular expression pattern match. A back reference
variable is created when an atom in a regular expression is surrounded by a
pair of parentheses. It contains the string matched by
the pattern. Back reference variables are named $1 to $n, where n is
equal to the number of the parentheses pairs used in the regular expression.
Perl has three primary functions that use regular
expressions: split, substitute (s///), and pattern-match
(m//). These functions provide powerful string manipulation
capabilities.
The split Function
The syntax of the split function can be any of the following:
- split(/PATTERN/, $string, $limit)
- split(/PATTERN/, $string)
- split(/PATTERN/)
- split
The split function returns an array derived from splitting $string into an
array of strings of items equal to the number of elements in $limit.
If $string is not specified, the $_ variable is used. If PATTERN is
not specified, the function splits at the white spaces after the leading
white spaces have been removed.
Examples
(code)
$buffer = "name1=David&name2=Mary&name3=Kenneth";
@pairs = split(/&/, $buffer);
($key,$value) = split(/=/, $pairs[0]);
print "$key -- $value\n";
$_ = "ABC abc";
($name,$value) = split;
print "$name $value\n";
(Output)
name1 -- David
ABC abc
The m// Function
The syntax of the pattern-match function can be any of the
following:
- $string =~ m/PATTERN/;
- m/PATTERN/;
- /PATTERN/;
The symbol =~ in (a) is called binding operator. It
associates the string on the left side of the operator with the regular
expression match. The match function returns true if $string
contains PATTERN. PATTERN can an atom or a signle expression.
If $string is omitted as in (b), PATTERN is tested against the default input
variable $_. The pattern-match function can be used without the
preceding m as shown in (c) and is frequently used in
conditional expressions. If you use m, you can use any pair of
delimiters you want - rather than slashes.
Examples
$string1 =~ /\d/;
# This statement searches for a single digit in $string1.
$total =~ /(\d*\.\d\d)/;
# This statement searches for any number of digits (\d*), followed
by a period (\.), followed by two digits (\d\d).
$string2 =~ /(.*)/;
# This greedy match matches the entire $string2
$text1 =~ /\(800\) \d\d\d-\d\d\d\d/;
$text1 =~ /\(800\) \d{3}-\d{4}/;
Either one of the above will match any phone number like (800) 549-5314
# The following code can be used to validate an email address.
$email = 'jhu@selu.edu';
if ($email =~ /.*\@.*/) {
print "The email address has an @ sign";
}
else {
print "The email address does not have an @ sign";
}
The Substitute Function (s///)
Unlike the translate function, the substitute function operates on
regular expressions and provides powerful search and replace operations.
The substitute function searches either a bound string or the default
variable $_ for a pattern and performs a replacement operation as
described in the following forms:
- $string =~ s/PATTERN/REPLACEMENT/;
- s/PATTERN/REPLACEMENT/;
The first occurrence of PATTERN is replaced with REPLACEMENT.
You can replace every occurrence of PATTERN by adding the g
modifier. You can also ignore character case by adding the i
modifier.
Example 1. The following code searches and replaces a
pattern at the start of a string with a new pattern.
$string1 = "Advanced Web Publishing";
$string2 = "Web Publishing";
$string1 =~ s/^Web/Desktop/; # $string1 remains
unchanged
$string2 =~ s/^Web/Desktop/; # $string2 becomes
Desktop Publishing
Example 2. The following form-decoding statement used in
the Parse_Form script removes any possible server side include
statements from incoming data as a security precaution. Note that a
null substitution character // (without a space between two slashes) is
used. $value =~ s/<!--(.|\n)*-->//g;
Example 3. The following code removes all HTML tags from a
string:
$text = "<br><b>Advanced <br>Web <br>Publishing</b>";
$text =~ s/<[^>]+>//g;
print "$text \n";
# "Advanced Web Publishing" is printed
Click [Here] for more
examples of regular expressions
The pack Function
The syntax of the pack function is:
pack("TEMPLATE", inputList);
The pack function takes a character code and translates a
corresponding character to the format specified. The pack
function is kind of like the tr function, and it can translate a lot
more than just characters. Two of the character codes and their
meanings are shown in the following table:
Character
Code Meaning
------------------------------------------------
A Convert to its ASCII character value
and pad empty characters with spaces
C Convert to unsigned character
Pack template examples
$packed = pack("A10","CMPS");
print "$packed (A10)\n";
$packed = pack("A5","CMPS");
print "$packed (A5)\n";
$packed = pack("A","CMPS");
print "$packed (A)\n";
$packed = pack("C3",67,66,83);
print "$packed (C3)\n";
$packed = pack("C2",67,66,83);
print "$packed (C2)\n";
$packed = pack("C",67,66,83);
print "$packed (C)\n";
(Output)
CMPS (A10)
CMPS (A5)
C (A)
CBS (C3)
CB (C2)
C (C)
More Regular Expression Examples
Example 1. The substitute statement in the following
code is to match a hexadecimal number identified by a preceding percent sign
and then converts it to the character represented by its ASCII value.
The pair of parentheses around the hexadecimal part of the pattern puts the
matched result in the back-reference variable $1. The hex function
converts the hexadecimal value to decimal value and the pack function
converts it to the character represented by this decimal ASCII number.
$string1 = "There are %35 books";
$string1 =~ s/%([\da-fA-F][\da-fA-F])/pack("C", hex($1))/e;
print "$string1 \n";
# "There are 5 books" is printed
Example 2. Consider the following important
form-decoding statement used in the Parse_Form script:
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/eg;
This statement matches a two-digit hex number prefixed by a %-sign.
Since the 2-letter code is surrounded by parentheses, the hexadecimal
number is passed on to the second half of the match via the variable $1.
The "e" at the end is a quantifier that tells
Perl to evaluate the right-hand side of the pattern. The pack
function converts the actual hex code (hex($1)) back to a character.
The "g" quantifier tells Perl to make this change globally across
the entire string.
Differences Between Substitute Function
and Translate Function
Unlike the substitute function that uses regular expressions to
construct patterns, the translate function does not perform regular
expression match. The translate function is for a geneal purpose
character translation. The following code demonstrates the differences
between these two functions.
#!/usr/bin/perl
$str1 = "Advanced Web Publishing";
$str1 =~ s/Web/Desktop/;
print "$str1 \n";
$str2 = "Advanced Web Publishing";
$str2 =~ tr/Web/Desktop/;
print "$str2 \n";
$str3 = "Advanced Web Publishing";
$str3 =~ s/./a/;
print "$str3 \n";
$str4 = "Advanced Web Publishing";
$str4 =~ tr/./a/;
print "$str4 \n";
$str5 = "Advanced Web Publishing";
$str5 =~ s/./a/g;
print "$str5 \n";
# Replacing s/./a/g with tr/./a/g results in syntax error
(Output)
Advanced Desktop Publishing
Advanced Des Puslishing
advanced Web Publishing
Advanced Web Publishing
aaaaaaaaaaaaaaaaaaaaaaa
References
- Elizabeth Castro, PERL and CGI, Chpater 11.
- Eric Herrmann, Mastering Perl 5, Chapter 16.
- Jacqueline Hamilton, CGI Programming 101, Chapter 14.
Return to CMPS 335 Home Page
Return to Web Site Home Page