A beginner's guide to CGI scripting

This web page assumes that the reader is fluent in Hypertext Markup Language (HTML), with a passing knowledge of JavaScript, and some acquaintance with Perl. In particular, familiarity with HTML forms is assumed. If you don't understand Perl at all, find Robert's Perl tutorial using e.g. google, slog through it a few times, and get hold of Perl for your system and write a few simple programs. Then read on..
TABLE OF CONTENTS
Terminology
How a Web browser works
HTTP/1.0
What is a CGI script?
      1. Necessary HTML
      2. Data coming into a script
      3. The script responds..
Installing a CGI script
Running it
Security Holes
NOT FOR THE FAINT-HEARTED..
Special characters CGI from JavaScript CGI.pm MIME
chmod Server-side includes Tricks Error scripts
RFC 822 HTTP/1.0 HTTP/1.1 Servers  
References

First, some terminology

As with all fairly technical subjects, there are lots of words and abbrev's. Here we will examine just a few of the important ones..

Running CGI scripts on a computer will always potentially compromise the security of that computer!

A. How a browser normally works!

When we learnt HTML, we didn't realise that we were floating on several layers of abstraction. For example, consider the following:

  1. Okay, we're connected to the Internet via our Internet Service Provider. We start up our browser (say, Netscape, or Opera), and want to fetch a web-page. What is the request that the browser sends to fetch a web page at say, http://www.anaesthetist.com/index.htm ? Well, here it is..
    GET /index.htm HTTP/1.0
    Accept: www/source
    Accept: text/html
    Accept: image/gif
    User-Agent:  Netscape/2.0  libwww/2.14
    From:  foo@foobar.com 

    Not what you're used to, is it? The above message is all behind-the-scenes stuff. First, note that after all the blurb is a blank line . Yes, I promise you - it's there. This is very important. How important, we'll find out later!

    Next, see how this is an example of a GET request. As with all such messages, the format is very precisely structured - it sticks rigidly to rules originally defined in something called RFC 822. The request is quite specific. It's asking for HTTP (hypertext transfer protocol) version 1.0, and will only Accept certain types of data in reply It's pretty easy to determine what types are being referred to - html pages, and gif images. These data types are called MIME types (MIME stands for Multipurpose Internet Mail Extension - we discuss it in a little more detail far below).

    At this point ..     pause     .. take a deep breath. Although all of the above looks like garbage, look once more at the message, read each line carefully, and see how a little order starts to appear out of the chaos!

    {As an aside, note the From: line, which was a polite little line found in early requests, now totally obsolete due to the enthusiastic activities of spammers}.

  2. Here's what might come back from the server that provides the web page..
    HTTP/1.0 200 OK
    Date: Wednesday, 02-Feb-94 23:04:12 GMT
    Server: NCSA/1.1
    MIME-version: 1.0
    Last-modified: Monday, 15-Nov-93 23:33:16 GMT Content-type: text/html 
    Content-length: 456
         <HTML> .. the web page goes here.. </HTML>
    

    Note again the header section followed by a blank line . On the first line, "HTTP/version" is followed by a code (200) that indicates "everything went fine". The corresponding code for "not found" is the dreaded 404. A lot of blurb (date, server, MIME version etc) follows. There certainly is a lot of stuff here, isn't there? The MIME type (text/html) and length, and so on. After our mandatory little blank line, the body of the message - an html page.

    When you actually get around to writing a CGI script, you'll find that the above can lazily be abbreviated to just..

    Content-type: text/html
         <HTML> .. the web page goes here.. </HTMLY>
    

    The trick is that your CGI script writes such a response, which is then parsed by the server on which your script lives. The final product - the plethora of information that you saw above - is then sent to the user agent.

Soon we will consider a slight variation on the above, where instead of requesting a web page, the browser asks the server to run a CGI script ! But before we do so, a brief comment on the very first line of the message..

B. What is HTTP/1.0 ?

HTTP/1.0 is the basis of the World Wide Web. There is no 'formal' definition or 'standard' (although HTTP/1.0 accounts for about three quarters of Internet traffic). The closest we can come is something called RFC 1945, which was written by Tim Berners-Lee and his colleagues in May 1996, long, long after he started up the World Wide Web. He described HTTP as:

".. an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. It is a generic, stateless, object-oriented protocol which can be used for many tasks, such as name servers and distributed object management systems, through extension of its request methods (commands). A feature of HTTP is the typing of data representation, allowing systems to be built independently of the data being transferred."

Basically, HTTP is a simple protocol for transferring information. HTTP makes it easy to transfer not only HTML documents, but a vast variety of other data. Not only can we retrieve documents, we can also for example search for information, and talk to a variety of programs on computers across the 'Net. HTTP is very similar to the format used for e-mail (RFC 822) and MIME.

The beauty of HTTP/1.0 is that it makes it easy for us to 'talk to' other Internet-based protocols - there's a host of these, including SMTP, NNTP, FTP, Gopher, and WAIS. HTTP/1.0 is an excellent negotiator between these protocols.

Recently the World Wide Web Consortium (W3C) has defined a new standard called HTTP/1.1, which will probably eventually replace 1.0. Most servers still use 1.0. We discuss HTTP/1.1 below, as well as taking a brief peek at some features of HTTP/1.0. (Don't look at these now).

C. What is a CGI script?

A CGI script is really a program. The program lives on a computer connected to the Internet. The "script" (actually a fully-fledged program) is usually written in the elegant and satisfying language Perl , athough a host of other languages may be used (TCL, BASIC [Yuk!], C, Ada, .. and so on). CGI scripts need not necessarily be interpreted scripts, in fact, they can be compiled programs (for example, C is always compiled).

Consider someone (let's call him Fred) who is browsing web pages on the internet. Fred comes across a web page which contains a link that refers to a CGI script. Fred clicks on the link. What happens next?

When Fred clicks on the link, certain parameters are passed by his browser across the Internet, and eventually the CGI script that lives on the server starts running. What happens next depends very much on the nature of the script, but commonly:

  1. The script reads the parameters that were passed to it;
  2. The script writes a new web page back to Fred, so that he can view it.

Let's explore how this happens. First, a VERY SIMPLISTIC example of what the browser request might look like:

GET /cgi-local/myscript.pl HTTP/1.0
Accept: www/source
Accept: text/html
Accept: image/gif
User-Agent:  Netscape/2.0  libwww/2.14 

Note the similarity to our previous, conventional web-page example, including (you guessed it) the blank line at the end. The difference is that, instead of requesting a web page, the GET asks for a script called "myscript.pl" found in the directory "/cgi-local/". What will the response look like? Well, that depends on the script, but commonly, the response will be almost identical to our previous response..

HTTP/1.0 200 OK
Date: Wednesday, 02-Feb-94 23:04:12 GMT
Server: NCSA/1.1
MIME-version: 1.0
Last-modified: Monday, 15-Nov-93 23:33:16 GMT Content-type: text/html 
Content-length: 456
     <HTML> .. the web page goes here.. </HTMLY>

The only difference is that the script dynamically writes the response, rather than fetching a static page stored somewhere on the server. You can imagine the immense power (and potential for cock-ups) inherent in such a set-up! There are several important aspects to CGI scripting that you have to be aware of. We will look at:

  1. HTML components;
  2. Data that come into the script;
  3. How the script responds.

C1. HTML components

There are several ways that a CGI script might be invoked. One is simply including a reference to the script in an HTML anchor. Another is a form. We previously learnt about HTML forms in our JavaScript tutorial. Important components are:

When we discussed forms in our JavaScript tutorial, we referred to two methods that might be used, GET and POST. We also mentioned that POST is much more general and powerful than GET. Let's look at a POST that might result from a form..

POST /cgi-local/posthandler HTTP/1.0
Accept: www/source
Accept: text/html
Accept: video/mpeg
Accept: image/jpeg
Accept: image/gif
{..snippety..more Accepts may go here..}
User-Agent:  Netscape/2.0  libwww/2.14
Content-type: application/x-www-form-urlencoded
Content-length: 184
     &walrus=Large
     &honey=sweet
     &silliness=on
     &myURL=Fred%20Foobar%20argh@foobar.com 

Note the one special line in the above, apart from the blank line ! Can you guess which one? Yep.. (no, not the walrus)..

Content-type: application/x-www-form-urlencoded 

This line tells us something very interesting - that all fancy characters in the 'body' section following the blank line will be encoded , for example, space characters will be encoded as "%20", in other words, % followed by the relevant hexadecimal code.

Now look more carefully at the body of the POST. It's made up of lines with the general format:

  &name=value 

See how every name begins with an ampersand (&), and then the value follows an equals sign. Neither the = nor the & are encoded, but any other occurrence of either within the actual name or value will be encoded (as %3D and %26 respectively).

The cute thing about POST is that a separate data stream is opened, and the data (here name=value pairs) are put onto that separate stream. The stream becomes the standard input of the CGI script! (For future reference, we'll here note that when you use the POST method, then the environment variables CONTENT_LENGTH and CONTENT_TYPE are both set up appropriately).

Always use POST rather than GET, unless you have no choice

With this under our belt, let's look briefly at the same information, passed as a GET..

GET /cgi-bin/posthandler.pl?&walrus=Large&honey=sweet&silliness=on
&myURL=Fred%20Foobar%20argh@foobar.com HTTP/1.0
Accept: www/source
Accept: text/html
Accept: video/mpeg
Accept: image/jpeg
Accept: image/gif
{..snippety..more Accepts may go here..}
User-Agent:  Netscape/2.0  libwww/2.14 

Note the restrictive format - all the name=value parameters are plonked after a question mark that follows the path and name of the Perl script "posthandler.pl". Despite this limitation, the GET method is useful where we simply wish to submit a stock query to a script, for we can embed the query in a link, thus:

<a href="http://www.anaesthetist.com/cgi-local/shazam.pl?foo=123&bar=nibbles">
click here for confusion</a>

.. but be careful. There's often a (poorly-defined) limitation on the amount of data you can pass with a GET, so don't be surprised if long data are ruthlessly truncated!

C2. Data coming in to the script

CGI stands for Common Gateway Interface . The Interface is the glue that binds the "client" (in our example, Fred) and the actual script that does the work. Information comes in to the Interface, and is then relayed to the script. There is a particularly exacting format that is used to pass this information. What happens is that the script can read in parameters passed to it by the Interface. The script "sees" these parameters as things called environment variables . The names of environment variables are always written in UPPER CASE. There are many environment variables, but perhaps the two most important ones are:

  1. QUERY_STRING
  2. PATH_INFO

The QUERY_STRING environment variable

We already know that if we create a form and then call up a CGI script using POST (the better way) then it's easy for the script to see the incoming data - it simply reads the standard input . Things are more convoluted if we use GET. GET can be invoked..

  1. As part of the HTML reference to the CGI script, for example:
            <a href="http://www.foo.com/bar/foobar.pl?thisIsSomeInfo">
    

  2. From an HTML form. The method used with the form must be GET . (This explains the "?rubbish" that you often see in the location bar of your browser when you use a search engine).

  3. From an ISINDEX document (fine-print stuff that we've deferred until later. You'll probably never use it).

Note that the string that the script sees in QUERY_STRING is mutilated during the process of transfer. In particular:

The PATH_INFO environment variable

This allows extra information to be transmitted from a web-page to a CGI script. The information is put in after the path to the script, for example: "http://www.anaesthetist.com/cgi-local/Fred.pl/PathINfogoeshere";

You can combine such path information with a query-string.. "http://www.anaesthetist.com/cgi-local/Fred.pl/PathINfogoeshere?Whatmeworry";

Other environment variables

There are several other env variables that are routinely set for all requests:

  1. SERVER_SOFTWARE - the name and version of the information server software that runs the gateway. The format is "name/version", for example "NCSA 1.0" or "Apache/1.3.3";
  2. SERVER_NAME - the name of the server (one of IP address, hostname, or DNS alias), for example "stupid.chiron";
  3. GATEWAY_INTERFACE - the version of CGI that is running, in the format "CGI/revision", for example "CGI/1.1";

There are other env variables whose presence depends on the demand being made of the gateway:

A further complication is environment variables that begin with HTTP_ :

None of the environment variables can reliably identify a user!

Note that with server side includes, several other environment variables may be made available, including DOCUMENT_NAME, DOCUMENT_URI, QUERY_STRING_UNESCAPED, DATE_LOCAL, DATE_GMT, and LAST_MODIFIED.

Actually examining the environment variables

Okay, enough mucking around. We're soon going to look at Perl coding in some detail, but here's just a flavour - the Perl code to actually get hold of the environment variables:

 #!/usr/local/bin/perl
 print "Content-type: text/html\n\n";
 print "\n";
 foreach $key (sort keys(%ENV)) \
  { print "$key = $ENV{$key} <br>";
  }
The above little Perl script will actually provide a list of environment variables as an HTML page (admittedly, without the usual head and body tags etc). The first line simply says where to find Perl on the system. See how Perl automagically keeps the environment variables in the %ENV associative array, and we go through each using the foreach instruction. The print statements preceding the foreach loop are explored in detail below. Note the usual Perl method of accessing each associative variable - because it's a variable we say $ENV and not %ENV, but we put curly brackets {braces} afterwards thus: $ENV{whatever}.

C3. How the CGI script talks back..

TALKING BACK
Basics
Other content types
Options other than "content-type"
Extract data from GET
Reading a POST

If you know a bit of Perl, then you can more-or-less work out how to respond to a GET or POST from a browser. You want to generate something along the lines of our lazy little example above..

Content-type: text/html
     <HTML> .. the web page goes here.. </HTML>

.. remembering that the HTTPD will fill in all the details to make up an acceptable HTTP/1.0 header. Well, let's try some Perl:

 #!/usr/local/bin/perl
        print "Content-type: text/html\n\n" ;
        print "<html><head><title>A little page</title></head>" ;
        print "<body>This is a little page</body></html>" ;

Note the two carriage returns (\n\n) after the Content-type statement - these provide our magical blank line without which the header simply won't work. Also see how we simply print to standard output! (stdout - if you were to run the script on your machine without any web in the way, you would see the response on your console). Incredibly straightforward. As usual, there are wrinkles. There's a fancy way in Perl (isn't there always?) of quoting large sections of text. Here it is..

        print "Content-type: text/html\n\n" ;
        print <<"END OF HTML";
        <html>
          <head>
            <title>A little page</title>
          </head>
         <body>
            This is a little page
         </body>
        </html> 
        END OF HTML 

In the above, we've used the text string "END OF HTML" to delineate (surprise, surprise) the end of the html text we wish to print. Cute, is it not? This cute Perl trick is called "here-document quoting". But NOTE that the line END OF HTML must be alone and on its own - any character on the line apart from "END OF HTML" (even a space) will screw things up horribly. Use with respect!

Other Content-types

There's no reason why your CGI script has to return a Content-type of text/html. For example, you can generate images on the fly (if you know how) of type image/gif. Something along the lines of..

 <IMG SRC="/cgi-local/foo.pl?aargh=on&image=bar.gif" ALT="You fruitcake">

Well, presumably you'd want to put in a width and height if you knew them, now, wouldn't you? This is the sort of stuff that generates those ugly little web-page counters that you so self-righteously despise!

Other options apart from Content-type

Apart from Content-type header lines, there are several others that you can put in, but there are two that can stand on their own, replacing Content-type: - Location and Status .

Location is quite sneaky (redirects to another URL, if a URL is specified , or fetches a document as if the client had requested it, if a path is specified . You can even submit a "?" directive after the relevant file name).

Status sends something called a status-line to the client. A status-line is an HTTP/1.0 message that combines a three-digit code (nnn) and a string that explains the reason for the code.

Pulling data out of a GET statement

You know how to get the data - they are stored in the QUERY_STRING environment variable. So we simply say:

 if ( $ENV{'REQUEST_METHOD'} eq "GET" )
    { $inputString = $ENV{'QUERY_STRING'};
    };

Massaging our data

Well, not that simple, because we still have to decode various special characters, and pull out the name=value pairs. There are several steps:

  1. Change '+' signs to spaces:
          $inputString =~ s/\+/ /g;
     

  2. Create an array of name=value pairs, and fill it from the data..
         my (@Pairs);
         @Pairs = split ( /&/, $inputString );
     

  3. Create an associative array, and fill this from the name=value pairs..
        my  (%NameValues);          # our associative array
        my  ($name, $value);        # local variables
        foreach $onepair (@Pairs)
          {  ($name, $value) = split ( /=/, $onepair);
             # FIX UP: here we should fix up the names and values..
             $NameValues{$name} = $value;   # set value in associative array
          };
     

  4. Okay, we skipped over a bit in the above foreach loop, because the names and values need a little massaging! We need to substitute values like %xx (ie hexadecimal encoding) with the relevant character. So in place of the FIX UP comment line in the above, we might put:
       $name =~ s/%([0-9A-Fa-f][0-9A-Fa-f])/pack("C",hex($1))/ge;
     
    .. and a similar line for $value. {check the above code}.

  5. Ooops, nearly forgot! Some more lines to add before we simply plonk the $value into $NameValues{$name}. What about the case where the same $name is submitted several times? Well, okay, you can simply overwrite the first few occurrences (as the above code will do), or you can check for the previous values and, say, concatenate all the values. Here's our check (we'll leave you to decide what to do next)..
      if (  defined ($NameValues{$name})  )
        { # do what you want to here with the multiple values..
          # for example, separate them by colons, or whatever.
        } else
        { $NameValues{$name} = $value;
        };
    

  6. At this stage, you may want to find out how to cut out server-side includes from data submitted to your script. A security issue!

Reading the body of a POST statement

The general way that you read the body of a POST statement is simply by querying the standard input, stdin. This is the same as reading the keyboard (console) when your Perl program is interacting with you. In other words you say something along the lines of:

 if ( $ENV{'REQUEST_METHOD'} eq "POST" )
    { read(STDIN,$inputString,$ENV{'CONTENT_LENGTH'});
    };

.. which will read the whole body of the POST from the standard input. Note that this body is composed of multiple lines (separated by CR and/or LF characters). See how we get the length of the data from the CONTENT_LENGTH environment variable. There is NO obligation for CGI to put some sort of "End Of File" character on the data, so CONTENT_LENGTH is rather important.

After we've read in our $inputString, we massage it into shape (and name/value pairs) as we did for the GET statement above.

We use the more complex Perl read statement rather than something like:

  local undef $/;                      #force read to end of file!
  $inputString = <STDIN>;
  local $/="\n";                       #re-establish usual delimiter!
because we have no EOF character, and therefore need to state the length.
.

D. Actually installing a Perl script

There are several tricks to getting a Perl script to run on your server. Here are a few pointers:

  1. Make sure that you're allowed to .. contact your admin, grovel, grovel.

  2. Put the correct path to Perl as the first line of your script. Something like:
     #!/usr/bin/perl 

    (Don't leave out the #, or put a space between the # and the !). "which perl" will generally tell you on a UNIX system where perl resides.

  3. Put the script in the correct directory. Something like /cgi-bin or /cgi-local, but again, contact admin, grovel, grovel..

  4. Note that some simple things may screw you around. For example, if you create a file in MS-DOS edit, and upload it to the web as a binary, Perl won't run it on UNIX systems because the default "end of line" character for MS-DOS/Windows is CR+LF, while on UNIX Perls, it's just LF. There are several solutions - the easiest is just to make sure that when you FTP upload the file to UNIX, your transfer is in ASCII mode - then the translation will automatically occur. Another solution is to find one of the translation programs on the Web, for example "fixcrlf.exe".

  5. Make sure that the file has the correct suffix - on many systems, you have to rename the file from ".pl" to ".cgi" in order to get it to work!

  6. Make sure that the directory and/or script itself has the correct permissions. There are three numbers that you have to get right - this magic triad is "755". The correct UNIX command is..
        chmod 755 filename 

    Where filename is the name of the directory or file whose permission you're changing. You'll find that friendly software like WS-FTP will show the permissions in a slightly different way.

Useful UNIX commands apart from chmod are mv , used to move or rename a file, mkdir for creating a directory, cp to copy a file, pwd to list the current directory, and   ~   as a replacement for the long cumbersome path to your home directory!

E. .. and running it

Okay. Create a web-page with a reference to the script, or simply type in the URL thus:

  http://www.anaesthetist.com/cgi-local/foobar.pl 

And see whether you get back the dynamic web-page you expected (or a confusing error)! If all hell broke loose (ie. you got an internal server error, code 500) then look carefully on your server for the file error.log that will tell you what went wrong (yep, you guessed it, try the grovel, grovel routine). Chances are, you didn't set 755 permissions for both the directory and script file, but you may have the wrong #!pathtoperl, or some other error in your script. (Check the script from the command line if you can, although this is often unhelpful).

If the script file wasn't even found, then you may have forgotten that UNIX is cAse SENsitIvE, or have the wrong name, or the wrong suffix (e.g. ".pl" in stead of ".cgi", or vice versa). Here's an example to play around with:

A simple 'Hello world' example

Click on this test link. The URL is "http://www.anaesthetist.com/cgi-local/hi.pl"

Here's the source code for hi.pl..

#!/usr/bin/perl
#=======================================================================#
#                       A SIMPLE CGI HELLO-WORLD!                       #
#=======================================================================#
print "Content-type: text/html\n\n" ;
print <<"END OF HTML";
 <html>
  <head>
   <title>A little page!</title>
  <body>
    <div align="center">
     <h1>Hi!</h1>
    </div>
    <hr>
    This is a little test page.
    <p><a href="http://www.anaesthetist.com/mnm/cgi/index.htm#after">Back to tutorial</a>
    <p><hr>
  </body>
 </html>
END OF HTML
#=======================================================================#
#                               That's it!                              #
#=======================================================================#

Writing to files from a Perl script

Hmm. It's general practice for scripts to be run as 'nobody'. This means that if your CGI is to write to a file on the server, the directory needs to be "world-writable" (a particularly bad idea) or owned by 'nobody'. [You may wish to research this further].

Accessing Databases

This depends on your database. There's a lot on the 'Net about the common databases - including mySQL and PostgreSQL. For example, check out resourceindex.com. It's also good to know some ODBC.

F. Security Holes

If you write Perl scripts that others can execute, you ARE opening security holes! Sounds pretty brutal, but largely true. Beware the following (at the very least):

If you don't understand the above, you probably shouldn't be allowing others
to use your scripts by publishing them on the 'Net! Even if you do,
you're probably still going to get burned from time to time.

G. A miscellany

Special characters

The following are all special characters in HTTP/1.0:

 <   > (   )   @   ,   ;   :
    \   "   /   [   ]  
  ?   =   {   }   
.. as well as the space character, and horizontal tab. Also don't forget the special roles of LF ± CR!

Calling CGI from JavaScript

This is actually fairly straightforward. In your JavaScript simply say something along the lines of..

  window.location.href = "http://www.example1.com/cgi-bin/fred.pl?hereismyquery";
.. and you're away.

Using CGI.pm

There's a readily available Perl library module called CGI.pm. If it's on your system, you're lucky, and can just say

    use CGI;
 
at the start of your Perl script. If it's not, try (grovel, grovel) or installing it in a local directory and then saying something along the lines of..
    use lib '/path/of/yourlocaldir';
    use CGI;
 

Using CGI.pm you can handle parsing of CGI queries and form generation with just a few simple calls. Check it out in Lincoln Stein's vast CGI.pm documentation. John Callender has written an excellent introduction to this Perl module. This includes how to send an email from Perl, and is incidentally a rather good introduction to Perl! (Or get the source of that other excellent form to email script, FormMail).

A note on MIME

MIME stands for "Multipurpose Internet Mail Extension". MIME allows for transfer and recognition of a host of different data types. There are hundreds of types. MIME types were originally defined in RFC 1341. For some improvements, see the now obsolete RFC 1521 and 1522, and the more recent five-part RFC 2045 to 2049. Mailcap files (which handle media types) are in RFC 1524. All MIME types should be registered with IANA. Current types include:

For further reading try a long list of MIME types with the relevant RFCs, etc. Here's another good reference. And here's a no-nonsense note. The MIME FAQ is here.

A note on chmod

Here's a direct snitch from the CGI FAQ (go read it - it's great)..

For a proper overview, "man chmod ".  Some modes that may be useful
in a typical CGI context are:
* CGI programs, 0755
* data files to be readable by CGI, 0644
* directories for data used by CGI, 0755
* data files to be writable by CGI, 0666 (data has absolutely no security)
* directories for data used by CGI with write access, 0777 (no security)
* CGI programs to run setuid, 4755
* data files for setuid CGI programs, 0600 or 0644
* directories for data used by setuid CGI programs, 0700 or 0755
* For a typical backend server process, 4750 
What you're doing is setting bit flags in the permissions of files and directories. Scary stuff! The 'behind-the-scenes' information is that 755 (for example) is an octal number. The "7" refers to "remote file permissions" for the owner, and the subsequent two 5's to permissions for "group" and "other" respectively. Each octal digit is made up of three bits, the first referring to read permissions, the second to write, and the last to execute. So "5" allows read and execute, but not write, and so on. (read = 4, write = 2, execute = 1, 4+2+1 = 7).

Okay, the leftmost digit (usually assumed to be a zero) is a little different, in that the first digit selects the "set user ID", the second the "set group ID", and the last the "save text image" attribute. So the "4" in "4755" means "set user ID", for example.

Note that if you're being anally retentive, it's probably best to specify the leading zero, as this ensures that the number is seen as octal on almost any system!

Server-side includes

Note that we particularly don't want some sneaky hacker to insert "server-side includes" into our data. So we must at some stage go through all data strings and rip out all potential server-side includes, thus:

  $datum =~ s/<!--(.|\n)*-->//g;

The general format of a server side include is:

      <!--#command tag1="value1" tag2="value2" -->
for example..
      <!--#exec cgi="/cgi-bin/hits.pl"-->

In other words, a 'standard' HTML comment containing a directive to include the output from an executable in the web-page. Not only do SSIs provide a security hazard, but they will also slow down an overworked server even further, because documents that are provided have to be parsed by the server to see whether it should insert an 'include'. Note that setting up SSI is also quite a business as you have to decide which directories are safe to use, and tell the server what file type is to be parsed and turned into an HTML document. Internally the server uses the MIME-type text/x-server-parsed-html to identify SSI documents (they often have the suffix .shtml).

Commands include exec, config, include, echo, fsize, and flastmod . With exec the cmd tag executes the string provided (using /bin/sh), and cgi runs a script and inserts its output, whatever it is! For more on NCSA server side includes, try.. this note

Trickery

There are yet more wrinkles..

Error Scripts

These may be used to handle the case where a script 'crashes and burns'. They have extra environment variables, including:

A topic beyond the scope of this document at present!


A summary of RFC 822


Miscellaneous Notes on HTTP 1.0

The following are snippets from RFC 1945. The extended BNF format used is similar to that of RFC 822, and other features are similar to MIME (particularly the obsolete RFC 1521). Appendix C of RFC 1945 lists how HTTP/1.0 differs from MIME.

  1. Formatting rules

  2. Here are the formatting rules for a URI:
       URIs in HTTP can be represented in absolute form or relative to some
       known base URI [9], depending upon the context of their use. The two
       forms are differentiated by the fact that absolute URIs always begin
       with a scheme name followed by a colon.
           URI            = ( absoluteURI | relativeURI ) [ "#" fragment ]
           absoluteURI    = scheme ":" *( uchar | reserved )
           relativeURI    = net_path | abs_path | rel_path
           net_path       = "//" net_loc [ abs_path ]
           abs_path       = "/" rel_path
           rel_path       = [ path ] [ ";" params ] [ "?" query ]
           path           = fsegment *( "/" segment )
           fsegment       = 1*pchar
           segment        = *pchar
           params         = param *( ";" param )
           param          = *( pchar | "/" )
           scheme         = 1*( ALPHA | DIGIT | "+" | "-" | "." )
           net_loc        = *( pchar | ";" | "?" )
           query          = *( uchar | reserved )
           fragment       = *( uchar | reserved )
           pchar          = uchar | ":" | "@" | "&" | "=" | "+"
           uchar          = unreserved | escape
           unreserved     = ALPHA | DIGIT | safe | extra | national
           escape         = "%" HEX HEX
           reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+"
           extra          = "!" | "*" | "'" | "(" | ")" | ","
           safe           = "$" | "-" | "_" | "."
           unsafe         = CTL | SP | <"> | "#" | "%" | "<" | ">"
           national       = 
       For definitive information on URL syntax and semantics, see RFC 1738
       [4] and RFC 1808 [9]. The BNF above includes national characters not
       allowed in valid URLs as specified by RFC 1738, since HTTP servers
       are not restricted in the set of unreserved characters allowed to
       represent the rel_path part of addresses, and HTTP proxies may
       receive requests for URIs not defined by RFC 1738.
    

  3. HTTP messages
  4. Let's look at a definition of HTTP messages:

       HTTP messages consist of requests from client to server and responses
       from server to client.
           HTTP-message   = Simple-Request           ; HTTP/0.9 messages
                          | Simple-Response
                          | Full-Request             ; HTTP/1.0 messages
                          | Full-Response
       Full-Request and Full-Response use the generic message format of RFC
       822 [7] for transferring entities. Both messages may include optional
       header fields (also known as "headers") and an entity body. The
       entity body is separated from the headers by a null line (i.e., a
       line with nothing preceding the CRLF).
           Full-Request   = Request-Line             ; Section 5.1
                            *( General-Header        ; Section 4.3
                             | Request-Header        ; Section 5.2
                             | Entity-Header )       ; Section 7.1
                            CRLF
                            [ Entity-Body ]          ; Section 7.2
           Full-Response  = Status-Line              ; Section 6.1
                            *( General-Header        ; Section 4.3
                             | Response-Header       ; Section 6.2
                             | Entity-Header )       ; Section 7.1
                            CRLF
                            [ Entity-Body ]          ; Section 7.2
       Simple-Request and Simple-Response do not allow the use of any header
       information and are limited to a single request method (GET).
           Simple-Request  = "GET" SP Request-URI CRLF
           Simple-Response = [ Entity-Body ]
       Use of the Simple-Request format is discouraged because it prevents
       the server from identifying the media type of the returned entity.
    

  5. A list of status codes
  6.        Status-Code    = "200"   ; OK
                          | "201"   ; Created
                          | "202"   ; Accepted
                          | "204"   ; No Content
                          | "301"   ; Moved Permanently
                          | "302"   ; Moved Temporarily
                          | "304"   ; Not Modified
                          | "400"   ; Bad Request
                          | "401"   ; Unauthorized
                          | "403"   ; Forbidden
                          | "404"   ; Not Found
                          | "500"   ; Internal Server Error
                          | "501"   ; Not Implemented
                          | "502"   ; Bad Gateway
                          | "503"   ; Service Unavailable 

  7. Differences from MIME
  8. There are just a few differences from the (now obsolete) RFC 1521 MIME definition:


HTTP 1.1

The RFC standard for HTTP/1.1 is RFC 2616 from June 1999. We're up to ~170 pages in this standard, which differs quite a lot from 1.0. Krishnamurthy, Mogul and Kristol have reviewed the differences between v 1.0 and 1.1. In summary, these are:


Different Servers

You should research your server - a good internet search term is simply "HTTPD" with the type of server, for example Apache (by far the commonest, over 50% of all HTTPDs). Note that the CERN/W3C and NCSA HTTPDs are no longer being developed, so if you use them and there's a bug, you may be stuck!


H. References

Check these out..

  1. The CGI FAQ - read this FIRST. It's good .

  2. James Marshall's tutorial (a superb introduction, with practical examples, also available Auf deutsch, En español, Em português, & In het Nederlands )

  3. Ottega - pretty good, with some examples.

  4. Good stuff with a comprehensive Perl tutorial.

  5. For C and CGI CSCENE has a darn good page by Brent York (with a useful example of a web-page counter written in C).

  6. CTDP CGI scripting manual on one of those irritating Tripod pages, but well done.

  7. A note on environment variables in Perl, at about.com (good);

  8. Lies - good! (John Callender)

  9. KU introduction (an "instantaneous introduction")

  10. CGI 101 - a comprehensive tutorial also available in PDF (first 6 chapters, anyway), and as a book.

  11. NCSA CGI stuff Here's their overview

    You can get all the RFCs at http://www.ietf.org. Here are a few:

  12. RFC 822 (46 pages)! This supplanted RFC #733. The "Standard for the Format of ARPA Internet Text Messages", dated 1982-8-13.

  13. Here's RFC 1945 (59 pages)

  14. RFC 931

  15. W3C notes on HTTP/1.1 and here is RFC 2616 in HTML! Here's the text version (175 pages).

  16. You may wish to read an interesting note on differences between HTTP/1.0 and /1.1 by Krishnamurthy and colleagues.