Back to ObjectValue Logo articles

Object-Oriented CGI Scripting with PERL

by Immo Hüneke

Introduction

We hear an awful lot about Java these days, so it is easy to forget that much of the WorldWide Web and possibly an even greater proportion of intranets run on Perl. This article sheds some light on why Perl is such a good match to the Web, and gives some examples of its use to provide intranet and Internet application functions. The Perl model of object oriented programming is essential to its success in this role.

What is Perl

Brief History

This language is essentially the creation of one man, Larry Wall. First released into the public domain in 1987, it has been gaining in popularity ever since. In the give-and-take tradition of the Internet, a large army of unpaid collaborators has contributed ideas, improvements and extensions, as well as extensive documentation. Originally designed as the UNIX system manager’s Swiss Army knife, Perl really came into its own when people discovered how simple it was to build Web applications using Perl as the scripting language.

Perl is a semi-compiled language optimised for scanning arbitrary text files, extracting information from them and printing reports based on that information. The language is easy to use, efficient and fairly complete, but not elegant in the Smalltalk mould. Hence the name: Practical Extraction and Reporting Language. It combines some of the best features of sed, awk, sh and csh. Its expression syntax quite closely resembles that of C.

Advantages of Perl

Unlike most Unix utilities, Perl does not arbitrarily limit anything – for example, if you have enough memory, Perl can read in your whole file as a single string. Integers can be indefinitely large. Floats can be of infinite precision. Recursion is of unlimited depth. Arrays, and the hash tables used by associative arrays, grow as necessary to prevent overflow or degraded performance. The great strength of Perl lies in its sophisticated pattern matching techniques that can scan or change large amounts of data very quickly. Although optimized for scanning text, Perl can also deal with binary data, and can make dbm files look like associative arrays (where dbm is available – and can simulate it where it isn’t!).

Setuid Perl scripts are safer than C programs through a dataflow tracing mechanism (“tainting”), which prevents many stupid security holes. This is particularly important for Common Gateway Interface (CGI) scripts running under a Web server. The CGI sets up environment variables and standard i/o channels from an incoming HTTP request and invokes the requested script with the remainder of the URL as argument values. Any values received by the script via environment variables or command-line arguments are therefore “untrusted”. With tainting turned on, Perl ensures that, for example, other programs and scripts cannot be invoked via the system call if the PATH variable was inherited and not explicitly set by the Perl script.

Because Perl has been ported to a large number of platforms, including virtually every flavour of UNIX, Windows NT, MacOS, OS/2 and even DOS, many scripts require only minimal adaptation to port them to a different operating system. Another advantage of Perl is the large and growing class library of re-usable “modules”, which in favourable cases can turn a multi-man-month programming task into the work of a few hours. More about the Perl module library anon.

Object Oriented Programming in Perl

Perl has an object-oriented model similar to Java or Smalltalk (with automatic garbage collection). In Perl, classes are defined using the package keyword and conventionally instantiated using new. But underlying this familiar-looking syntax is a more powerful and generic mechanism.

There are only three basic data types in Perl: scalars (denoted by a leading $), lists (denoted by a leading @) and hashes (denoted by a leading %). A list is an array of scalars, while a hash is an associative array of scalars. A scalar is a string, whether it contains text, a number or packed binary data. Conversion to integer or floating point occurs on the fly during the evaluation of arithmetic expressions. Perl doesn’t support pointers, but it does have references (which are stored in scalars like anything else). An object is referenced by a reference that has the type of a class, and a class is a package with methods. A method is a subroutine that can be invoked with an object reference (or a package name, in the case of a class method) as its first argument, and Perl supports a method invocation syntax that makes this parameter implicit. There is no private or protected information, as in C++ – Perl gives you as much programming power as possible and expects you to use it responsibly.

A constructor method is any method that returns an object reference. Conventionally, this is called new, so that it can be invoked as in this example:

 

$item = new WebItem ($path);

Here is part of the constructor function in the package WebItem (from the source file WebItem.pm). Note how the argument values are extracted from the array @_ and the use of the bless keyword to specify the class of the memory location referenced by $self.

 

sub new

{

    my ($class, $filename) = @_;

...

    my $self         = [];

    $self->[$path]   = $filename;

...

    return bless ($self, $class);

}

Run-time binding is supported, as in C++ or Java, so the returned reference can be implicitly cast to any supertype of the specified class $class.

Perl doesn’t support the concept of user-defined types, e.g. record structures, so member variables of a class need to be stored somewhere. A frequently used pattern is to create an associative array (a hash, using the names of the members as subscripts). In the above example, however, $self is created as a reference to an empty array. The member variables are identified by predefined constants, such as

 

$path = 0;

By providing a hierarchical package naming system and isolating the namespaces between packages, Perl makes it easy to create reusable modules. Some modules are described later in this article.

Secure CGI

Normal convention on Web servers is to place CGI scripts in a directory named cgi-bin, which is normally directly beneath the document root. A typical URL for a script is thus http://www.logica.com/cgi-bin/feedback. The Web server is configured to recognise the cgi-bin directory as the signal to pass the file designated by the path to the CGI instead of returning it to the browser. The CGI invokes the script or program using a system call – so any shell metacharacters embedded in the URL will be interpreted by the shell without any safeguards.

To prevent a hacker from executing arbitrary commands on the Web server, I use a “CGI wrapper” script, which allows me to implement additional safeguards. All scripts are kept together in a directory named scripts, which is not under the server’s document root, so the server will not show the scripts to anyone. A so-called URL mapping is triggered by the /cgi-bin/ prefix of URLs to execute the wrapper, which lives in the scripts directory and is called cgi-bin to make clear its function. The remainder of the URL is passed by the CGI to the wrapper as PATH_INFO and QUERY_STRING environment variables.

The environment is made available to Perl scripts in the hash %ENV. So to extract the environment variable X, use $ENV{'X'}. For example,

 

if ($ENV{'QUERY_STRING'} =~ /^advanced/i) {...}

NB under Windows NT, most Web servers will only execute binaries and batch files, not Perl scripts (UNIX supports the convention that the first line of a script may define the script interpreter). To run a batch file, the CGI executes command.com, so circumventing the wrapper. To avoid this, create a small C program and build it as, for example, cgi-bin.exe:

 

int main (int argc, const char *argv[], const char *envp[])

{

    execlpe ("c:\utils\perl.exe",

             "perl",

             "d:\www\scripts\cgi-bin",

             NULL,

             envp);

    // ... print HTML error page here if exec failed

}

The PATH_INFO environment variable contains the URL, minus the leading substring containing the script name. The wrapper extracts the leading pathname component of the remainder (using a match expression to “untaint” the string), which it treats as the name of the script to be executed. The PATH_INFO is adjusted accordingly before the script is actually run by means of the statement

 

eval {require $script};

However, before doing so, the wrapper performs a number of other useful and security functions. First of all, it consults a small database that contains characteristics of each script. Unless the script is in the database the server will refuse to run it. Scripts within the database are characterised as internal or external, Perl or binary, plain-text or HTML (or in certain cases, “noheaders”). “Internal” scripts are prohibited from running unless the wrapper determines that the user is an administrator or a developer (how this is determined depends on circumstances – usually, for example, the server manager runs on a special port number with restricted access; this scheme could be extended to cover secure scripts accessible only via SSL). If the script is permitted to run, the wrapper produces the appropriate HTTP header and parses the body of the request (typically the fields of a form), to save the script having to do this.

The standard output of a CGI script is piped by the Web server directly back to the client. CGI scripts therefore usually output HTML pages for display in the browser. In Perl, there are several ways to do this. The most obvious, a series of print statements, is also the least efficient and the most difficult to maintain. A better method is to use a format statement to define a page template and to generate the page with a corresponding write statement once all the field values have been computed.

The method I normally use is to hold the page template in a separate file, which has the advantage that a standard HTML editor can be used to create and maintain it. Field placeholders are indicated by HTML comments of the form <!--afield--> and an associative array (a “hash”) holds the corresponding values. When values are substituted for placeholders, any that are undefined simply remain as comments and will not be displayed by the browser. This scheme is simple but powerful, and can be extended to embrace nested and recursive macro definitions, parameterised macros and more.

Newcomers to CGI programming are usually puzzled when their script produces no output or just a “Server Error”. What has normally happened is that a run-time error has been reported by Perl (e.g. failure to compile due to a syntax error) and that the Web server has been unable to parse the resulting output as valid HTML. By trapping error output using the eval statement, above, the wrapper is able to generate intelligible error reports to the user, log the problem in a separate script-error log and even email the Web site owner to notify her of the problem.

To summarise, the CGI wrapper isn’t an object in the Perl language, but uses an object oriented approach – it forms a framework with strictly defined external and internal APIs, within which other large-grained objects (scripts) can perform useful functions.

Useful Classes for Web Servers

This section briefly describes some reusable utility classes that I have written for Web projects.

Meta Data Extractor

We have already met the WebItem class in some of the preceding examples. The WebItem class is designed to provide a convenient interface to any object  that may be stored on a Web server. To create an instance of this class, pass it the pathname of the file on disk where the corresponding object is stored. The WebItem can then return information about the object, such as its content-type and size, and if available, document management info such as author and title.

To incorporate the package in a program, the following syntax is used:

 

use WebItem;

# ... put pathname of some file into $path

my $item = new WebItem ($path);

The meta data of the file is then extracted by invoking member functions of the WebItem. The following statements print out all the meta-information:

 

print $item->pathname, "\n",

      $item->title,    "\n",

      $item->author,   "\n",

      $item->publisher,"\n",

      $item->size,     "\n",

      $item->mimetype, "\n",

      $item->encoding, "\n",

      $item->category, "\n";

my ($cday, $cmonth, $cyear) = $item->creatdate;

my ($mday, $mmonth, $myear) = $item->moddate;

print join ('/', $cday, $cmonth + 1, $cyear),"\n",

      join ('/', $mday, $mmonth + 1, $myear),"\n";

The meta-data is gathered by the WebItem class during execution of the constructor function. It makes use of the stat system call to obtain size and timestamp information, and if the file is an HTML page, uses pattern matching to find any META tags containing useful information. I chose to ignore efficiency in the implementation of this class in the interest of simplicity. It could be optimised to filter information out of the file only when required. In Perl, it is easy to tell whether a variable has been initialised, using the if (defined ($x)) syntax.

The member functions that return values are very simple. Here is an example:

 

sub title

{

    my ($self) = @_;

    return $self->[$title];

}

Configurable Error Reporter

In one project recently I needed to co-ordinate the diagnostic output of a number of different scripts, each of which performed a processing step in a Web publishing system. Moreover, another program needed to be able to parse the output in the case that the scripts were being run under machine control. A second requirement was to place all message texts in a file, so that it would be easy to change them into a different language or alter a particular condition from e.g. a warning to a serious error (in other words, to set policy regarding pass and fail under specific conditions).

A file format for the message file was defined and a package (MsgFile) created to encapsulate it, which all scripts could then use. This guaranteed consistent formatting of the warning and error messages. Here is how it is used:

 

use MsgFile;

 

my $msgpath  = '/usr/www/config/example.cfg';

my $me       = 'Testing';

my $subsys   = 99;

my $messages = new MsgFile ($msgpath, $subsys);

Assume that the file example.cfg contains some definitions as shown below:

 

# Level 2 means warning

= illegal_entity        2 001

the following entity is not valid within the current character set: %1%

(suggested alternative: %2%)

# Level 3 means error

= no_such_file          3 002

referenced file %1% not found

Now simulate the occurrence and reporting of an error:

 

my $file1  = 'example1/nonesuch.txt';

my $file2  = 'example2/nonesuch.dat';

my $symb1  = 'no_such_file';

my ($severity, $error_code, $error_msg) = $messages->defn ($symb1, $file1);

print "$severity $error_code, $file2 line 999: $error_msg\n";

This results in the following output:

 

ERROR 00993002, example2/nonesuch.dat line 999: referenced file

 example1/nonesuch.txt not found

The same result can be produced using the recommended formatting function:

 

my $diag = $messages->format ($symb1, $file2, 999, $file1);

print $diag;

Another way to format the error message is to encode it as a META tag for inclusion in an HTML header (using the page template approach described earlier):

 

$macro{'error_meta'} = $messages->meta ($me, $diag);

The diagnostic message may be more useful if a piece of the offending file is shown.

 

my $linenum = ''; # The next procedure will set it to a number

my $file3   = 'example3/nonesuch.htm';

my $symb2   = $messages->symbol (00992001);

print $messages->listing (undef, 4321, \$linenum, $file3);

print $messages->format ($symb2, $file3, $linenum, '&UUML', '&Uuml;');

This might result in the following:

 

<P>This is the line with a bad &UUML; entity in it.</P>

                               ^

WARNING 00992001, example3/nonesuch.htm line 132: the following entity is not valid within the current character set: &UUML

(suggested alternative: &Uuml;)

CGI Simulator

Perl contains a useful debugger, which allows Perl scripts to be listed, breakpointed, single-stepped and so on. Whenever the script is halted in the debugger, arbitrary Perl expressions can be evaluated – which means that the developer can inspect and change variable values, invoke subroutines with arbitrary parameters etc.

However, the debugger cannot be used under the CGI, as the standard input and output are tied via pipes to the Web server. So I have created a simple script that sets up the environment variables as if the script had been invoked from the CGI, goes into debug mode and then invokes the CGI wrapper script. The trick is to start the script off with the line

 

#!/usr/local/bin/perl -dw

which invokes the debugger. The initialisation statements are all contained in a BEGIN block, which is evaluated during compilation and therefore not subject to debugging. The first statement after the BEGIN block is an eval that runs the real CGI wrapper. The debugger halts the script on encountering this statement. It is now possible to input manually any further initialisation statements necessary (e.g. to include needed library packages) before single-stepping into the CGI wrapper.

On another occasion it was necessary to execute some of the Netscape server manager CGI scripts within another script (CGI or command-line). Again, the easiest way to do this was to emulate the Netscape server manager CGI. I created the package NetAdmin, which could invoke any script under Netscape’s bin directory. Setting up the environment proved tricky, requiring a combination of inherited environment variables, registry entries and so on. A new instance of NetAdmin is created for each script; the script name and query string are passed as parameters to the constructor function. HTML form fields can be individually set using access methods – when the script is finally invoked (using NetAdmin::doit) these are concatenated using the usual HTTP encoding and piped to the script’s standard input.

Perl Packages

This section describes a few of the standard packages that I have found to be useful.

Net::Domain

Along with a number of other packages in the Net hierarchy, this provides access to standard IP functionality. Net::Domain offers the methods hostname, domainname and fqdn, which allow a script to find out on which machine / network it is running. Beneath the covers, this package is pretty intelligent – it knows half a dozen ways of discovering the hostname, for example. Once found, the information is cached for a fast response to later invocations.

Net::SMTP

The SMTP, FTP etc. packages provide basic APIs to the important ARPA services. For example, the following simple script fragment connects to a mail server, submits a message body in an envelope containing one “from” address and multiple “to” addresses, and disconnects. (The message subject will be in the header of the body).

 

use Net::SMTP;

...

    $smtp = Net::SMTP->new ($mailhost)

    die "Can't send!" unless (defined $smtp);

    $smtp->debug ($debug); # Turn on tracing, if desired

 

    $smtp->mail ($from) if $smtp->ok;

    $smtp->to (@to_addrs) if $smtp->ok;

    $smtp->data ($buf."\n") if $smtp->ok;

    $smtp->quit () if $smtp->ok;

CGI

This package exists to support a number of CGI functions, including the parsing of HTML form fields and query strings, and the dynamic generation of HTTP and HTML. There are actually two versions of CGI, one of which contains everything in one package and the other of which subdivides the functionality into a number of separate pick-and-choose packages.

The author of both packages, Lincoln Stein, has recently been unable to keep the larger and more extensive modular version up to date with all the bug fixes made to the single-module version. The single-module version is also a lot faster to load, so it is currently the recommended one. However, I used the modular version in the examples below.

 

#!/usr/local/bin/perl -T

...

# Import the CGI modules

use CGI::Base qw(StatusHdr ContentTypeHdr SendHeaders html_escape);

...

    # Send any headers the script may require

    if ($script_attribute{'plain_text'})

    {

        SendHeaders (ContentTypeHdr ('text/plain'));

    }

    else

    {

        SendHeaders (ContentTypeHdr ('text/html'))

            unless ($script_attribute{'noheaders'});

    }

Later on, we have established that an HTTP request needs to be parsed, so now we pull in the CGI::Request package. I had to make a patched version of CGI::MultiPartBuffer (a package included in CGI::BasePlus) to ensure that a timeout would occur if a multipart form (HTTP upload) was interrupted. Moreover, I used a subclass of CGI::Request with additional capabilities for generating dynamic HTML: CGI::Form. In practice, very few scripts use this facility, so it would have been sufficient and less expensive in performance terms to use CGI::Request.

 

    # Read CGI info - use an eval in case the client request is bad

    require CGI::Form; # Form is a Request subclass supporting HTML output

    require 'mpbufpatch.pl';  # Bug-fix CGI::MultipartBuffer package

    eval {$req = new CGI::Form;};

    $error_msg = $@;          # Will be blank if no error occurred

    unless (defined ($req))

    {

        # error handling code here

    }

    else

    {

        # Eval the script in case it fails

        eval {require $script};

        $error_msg  = $@;

        if ($error_msg ne '')

        {

            # error handling here

        }

    }

Watch out – the useful html_escape function in CGI::Base returns a list. If you assign the result to a scalar, you’ll be surprised to find that the result appears to be just a number: the count of list elements. This can be prevented by turning the scalar into a temporary list, to which the result can be assigned (second example below):

 

@error_report = html_escape (@error_report);

($error_msg) = html_escape ($error_msg);

Other useful functions of CGI::Request return all HTTP request parameters and HTML form fields. These are used by scripts under the CGI wrapper. For example:

 

my $remote_user = $req->cgi->var ('REMOTE_USER');

...

if (defined $req->param ('template'))

{

    $template_name = $req->param ('template');

}

File::Find

Many scripts require some sort of list of files under some sub-tree of the directory hierarchy on the server. The package File::Find traverses a directory tree and invokes a callback subroutine of your choosing for each item found. You can choose to traverse depth-first or breadth-first – sometimes this is important. Your callback routine can be used to filter (e.g. by matching each item against a wildcard expression to build a list of matching files) or to execute some action against each file. You can also combine these possibilities into one callback.

The following example was used to force all pathnames to lowercase within a particular directory subtree of a Web server. I specified depth-first searching of the tree so that all the leaf nodes would be renamed first, then the directories above them and so on – otherwise the renaming of a leaf could fail because its parent directory no longer existed. The subroutine convert is the callback:

 

sub convert

{

    $file = $File::Find::name; # extract full pathname of current object

    $new_file = $file;

    $new_file =~ s:/([^/]+)$:/\L$1:; # convert final component to lowercase

    return if ($new_file eq $file);

    $command = "mv \"$file\" \"$new_file\""; # quotes protect spaces etc.

    print "$command\n";        # display what we’re about to do

    system $command;           # then do it

}

Note the substitution (s) expression that converts a pathname to lowercase. Because I wished to match the pathname separator in the match expression, I used the colon (:) as a delimiter instead of the more usual slash (/). The expression to be matched begins with a slash, followed by the parenthesised character-class expression [^/]+, which means “one or more occurrences of any character except a slash”. The expression ends with the symbol $, which stands for “end of line” or “end of string” and anchors the matched substring at the right-hand end of the original string. If matched, the substring matching the first expression in parentheses is assigned to $1 – so on the right-hand side, the substitution consists of a slash followed by the lowercase equivalent of $1 (i.e. \L$1).

The script can be invoked with or without a root directory argument – the default root is the current working directory:

 

$dir = $ARGV[0];

unless (-d $dir)

{

    chomp ($dir = `pwd`); # strip trailing newline output by pwd command

}

finddepth (\&convert, $dir); # the '&' identifies convert as a subroutine

Socket

The socket library allows Perl scripts to use TCP connections efficiently. This article is not a course in socket programming, so I’ll assume a basic familiarity with TCP and ARPA services and just present an example. This is from a watchdog process that checks whether a particular service on the watchdog’s own server is currently up and running.

 

use Socket;

...

    $proto  = getprotobyname ('tcp');

    socket (SOCK, PF_INET, SOCK_STREAM, $proto); # creates a socket "SOCK"

 

    $remote = 'localhost';

    $iaddr  = inet_aton ($remote);

    $paddr  = sockaddr_in ($service_port, $iaddr);

    if (connect (SOCK, $paddr)) # "SOCK" is the filehandle

    {

        # connect worked

        print SOCK "keep_alive:yes\n";

        close SOCK;

    }

LWP

LWP stands for libwww-perl. It provides a simplified API to the WorldWide Web. The original libwww is a collection of components that can be used to build Web clients, servers and more specialised applications. LWP concentrates on making it easy to build HTTP client functionality into Perl scripts, but does contain some modules that have more general communications utility. The modules can be used separately or together.

LWP encourages an object oriented style of communication, but if desired, it can also be used through a very simple procedural interface. It currently supports http, gopher, ftp, news, file and mailto resources.

In terms of http support, LWP implements

          both basic and digest authentication schemes

          transparent redirect handling

          access through proxy servers

          content negotiation (for both clients and server-side CGI).

Much more is provided; e.g. a URL manipulator supporting both absolute and relative URLs, a robots.txt parser, a framework for robots, a framework for a mirror server, an (experimental) HTML parser and formatter, and an interface to Tk. Simple command-line utilities implemented in LWP are available, e.g. lwp-request and GET, which are useful in testing and debugging a server, incorporating in shell scripts, and so on.

However, LWP does have shortcomings arising out of its implementation in Perl: it doesn’t (yet) support multi-threading, so robots and the like are likely to be slow. Moreover, under NT it has been found that it runs out of filehandles eventually if a script keeps on creating new connections. This causes a Perl exception, which unceremoniously terminates the program.

To give a flavour of HTTP programming using LWP, a short example will be presented. The basic classes involved are HTTP::Request, HTTP::Response, and LWP::UserAgent. The script creates a UserAgent to handle the communication (including recovery from errors), and then gives it a series of Request objects to process. Response objects are returned by the UserAgent (note that HTTP::Requests can invoke Web, FTP, gopher and file services).

 

use LWP::UserAgent;

$ua = new LWP::UserAgent;    # create our user agent

$ua->agent ("MyProgram/0.1") # identify ourselves in access logs

 

# Create a request - pretend it came from a form

$req = new HTTP::Request('POST', 'http://www.perl.com/cgi-bin/BugGlimpse');

$req->content_type ('application/x-www-form-urlencoded');

$req->content ('match=www&errors=0');

 

$res = $ua->request ($req);  # Submit the request

 

# Evaluate the outcome

if ($res->is_success)

{

    print $res->content;

}

else

{

    # handle the error

}

By the way, the URI::URL package provides utility functions to format HTML form contents correctly – there is no need to write match=www&errors=0 unless you want to. See the LWP cookbook for more examples

You can specify authentication parameters using the credentials method. If you require more specialised functionality, simply subclass the UserAgent class and modify to suit. LWP is well documented, so it is easy to use in a similar style to the above for ftp, gopher, news, mailto and file requests (in either direction).

Obtaining Further Information

If you have Perl installed on your system, you probably have lots of Perl documentation on-line in man pages and/or as HTML. Look for the perl-extra-libs directory.

To find out the best server from which to download the latest versions of Perl and its libraries, look at http://www.perl.com/perl/CPAN/CPAN.html.

The latest version of LWP can be found at http://www.sn.no/libwww-perl/.

Help and pointers to further reading can be found in the Perl FAQ, which is regularly posted to comp.lang.perl and to news.answers. You can also find an HTML version of it on the Perl language home page .

There are a number of recommended books (reviewed on the above home page):

Programming Perl (2nd Edition) by Larry Wall, Tom Christiansen and Randal Schwartz, O’Reilly & Associates £29.95.

Teach Yourself Perl 5 in 21 Days (2nd Edition) by David Till, Sams Publishing £37.50 (includes a UNIX and Windows NT/95 CD-ROM with Perl 4, Perl 5, all the libraries and the example programs from the book).

Cross-Platform Perl for UNIX and Windows NT by Eric F. Johnson, M&T Books (a division of MIS:Press) £31.99 from http://www.bookshop.co.uk/ (this book, which also includes a CD-ROM, devotes about 100 pages to CGI programming).



Created Thu Oct 23 18:43:26 2003

Copyright ObjectValue Ltd.

Back to ObjectValue Logo articles