CMPS 335 Advanced Web Publishing Perl and CGI Programming


Regular Expressions

One of the greatest strengths of Perl is its powerful text manipulation capabilities.  For manipulating strings and pattern matching, Perl is the easiest and the most powerful language.  Regular expressions can be used to describe and create patterns, from the simplest to the most complex.  You can use a regular expression to search a string for any pattern, and possibly, replace it with another pattern.  The term "regular" in regular expressions is somewhat misleading.  In addition to their extraordinary capabilities, regular expressions usually look strange, complex, and intimidating.

Elements of Regular Expressions

A regular expression is a string of characters consisting of meta-characters, regular characters, and patternmodifiers.   Building blocks of regular expressions are atoms.  An atom is any single item in a pattern and it can be any of the following:
Meta-Characters

A meta-character is a character with special meaning used to build a regular expression.  Meta-characters allow you to match just the character or characters you want.  Some of the meta-characters are listed below:
    Meta-Characters and Their Functions
   --------------------------------------------------------------
    \ (escape)  Do not interprete the following meta-character       
    |       Match either of the alternatives
    ( )     Match or create a single expression or atom
    [ ]     Match one of the enclosed atoms (atom alternatives)
    {m,}    Match at least m times
    {m}     Match exactly m times
    {m,n}   Match at least m and at most n times 
    *       Match an atom zero or more times, same as {0,}
    +       Match an atom one or more times, same as {1,}
    ?       Match an atom zero or one time, same as {0,1}
    ^       Match an atom at the beginning of a string
    $       Match an atom at the end of a string
    .       Match any single character
    \d      Match a digit (equivalent to [0-9])
    \D      Match a non-digit 
    \f      Form feed
    \n      New line
    \r      Carriage return
    \t      Tab
    \w      Match word (an alphanumeric character or 
            the underscore (equivalent to [a-zA-Z0-9])
    \W      Match a non-alphanumeric character    
    \A      Alternative to meta-character ^
    \Z      Alternative to meta-character $
    \s      Match a space character (space, \f,\n,\r,\t)
    \S      Not space character
    \xnn    match the character having hexadecimal value nn
Regular characters match themselves, except for the special characters +?.*^S@()[]|\ used as meta-characters listed in the above table.  To match one of these special characters, you have to use an escapte character (\).  The escape character changes the meaning of the characte following it.  Examples:
 
   print "Testing the escapte character \n";
   print "Testing the escapte character \\\n";
   print "Testing the escapte character \\n";

   Output
   Testing the escapte character
   Testing the escapte character \
   Testing the escapte character \n 

Pattern Modifiers

A pattern modifier is appended to a regular expression after the trailing delimiter and it affects the action of the entire regular expression.  Some of the commonly used pattern modifiers are listed below:
    Modifier         Meaning
   ---------------------------------------------------------
    g     Instead of matching only the first occurrence of 
          the pattern, the pattern is matched repeatedly  
    i     Ignore character case when matching the pattern
    e     Evaluate the right hand side of the pattern      

The $_ Variable

The $_ variable is a special variable that contains the result of the last Perl operation.  It is used as the default input for many of Perl functions.

Back Reference Variables $1, $2, $3,...

Back reference variables are special variables created during a successful regular expression pattern match.  A back reference variable is created when an atom in a regular expression is surrounded by a pair of parentheses.  It contains the string matched by the pattern.  Back reference variables are named $1 to $n, where n is equal to the number of the parentheses pairs used in the regular expression.

Perl has three primary functions that use regular expressions: split, substitute (s///), and pattern-match (m//).  These functions provide powerful string manipulation capabilities.

The split Function

The syntax of the split function can be any of the following:
  1. split(/PATTERN/, $string, $limit)
  2. split(/PATTERN/, $string)
  3. split(/PATTERN/)
  4. split
The split function returns an array derived from splitting $string into an array of strings of items equal to the number of elements in $limit.  If $string is not specified, the $_ variable is used.  If PATTERN is not specified, the function splits at the white spaces after the leading white spaces have been removed.

Examples
The m// Function

The syntax of the pattern-match function can be any of the following:
  1. $string =~ m/PATTERN/;
  2. m/PATTERN/;
  3. /PATTERN/;
The symbol =~ in (a) is called binding operator.  It associates the string on the left side of the operator with the regular expression match.  The match function returns true if $string contains PATTERN.  PATTERN can an atom or a signle expression.  If $string is omitted as in (b), PATTERN is tested against the default input variable $_.  The pattern-match function can be used without the preceding m as shown in (c) and is frequently used in conditional expressions.  If you use m, you can use any pair of delimiters you want - rather than slashes.

Examples
The Substitute Function (s///)

Unlike the translate function, the substitute function operates on regular expressions and provides powerful search and replace operations.  The substitute function searches either a bound string or the default variable $_ for a pattern and performs a replacement operation as described in the following forms:

The first occurrence of PATTERN is replaced with REPLACEMENT.  You can replace every occurrence of PATTERN by adding the g modifier.  You can also ignore character case by adding the i modifier.

Example 1.  The following code searches and replaces a pattern at the start of a string with a new pattern.
Example 2.  The following form-decoding statement used in the Parse_Form script removes any possible server side include statements from incoming data as a security precaution.  Note that a null substitution character // (without a space between two slashes) is used.
Example 3.  The following code removes all HTML tags from a string:
Click [Here] for more examples of regular expressions

The pack Function

The syntax of the pack function is:
    pack("TEMPLATE", inputList);

The pack function takes a character code and translates a corresponding character to the format specified.  The pack function is kind of like the tr function, and it can translate a lot more than just characters.  Two of the character codes and their meanings are shown in the following table:
 
     Character
      Code     Meaning
     ------------------------------------------------
       A       Convert to its ASCII character value 
               and pad empty characters with spaces
       C       Convert to unsigned character 

   Pack template examples
   $packed = pack("A10","CMPS");
   print "$packed (A10)\n";
   $packed = pack("A5","CMPS");
   print "$packed (A5)\n";
   $packed = pack("A","CMPS");
   print "$packed (A)\n";
   $packed = pack("C3",67,66,83);
   print "$packed (C3)\n";
   $packed = pack("C2",67,66,83);
   print "$packed (C2)\n";
   $packed = pack("C",67,66,83);
   print "$packed (C)\n";

   (Output)
   CMPS       (A10)  
   CMPS  (A5)
   C (A)
   CBS (C3)
   CB (C2)
   C (C)      

More Regular Expression Examples

Example 1.  The substitute statement in the following code is to match a hexadecimal number identified by a preceding percent sign and then converts it to the character represented by its ASCII value.  The pair of parentheses around the hexadecimal part of the pattern puts the matched result in the back-reference variable $1.  The hex function converts the hexadecimal value to decimal value and the pack function converts it to the character represented by this decimal ASCII number.
Example 2.  Consider the following important form-decoding statement used in the Parse_Form script:

$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/eg;
This statement matches a two-digit hex number prefixed by a %-sign.  Since the 2-letter code is surrounded by parentheses, the hexadecimal number is passed on to the second half of the match via the variable $1.   The "e" at the end is a quantifier that tells Perl to evaluate the right-hand side of the pattern.  The pack function converts the actual hex code (hex($1)) back to a character.   The "g" quantifier tells Perl to make this change globally across the entire string.


Differences Between Substitute Function and Translate Function

Unlike the substitute function that uses regular expressions to construct patterns, the translate function does not perform regular expression match.  The translate function is for a geneal purpose character translation.  The following code demonstrates the differences between these two functions.
References
Return to CMPS 335 Home Page
Return to Web Site Home Page