A beginner's guide to CGI scripting

This web page assumes that the reader is fluent in Hypertext Markup Language (HTML), with a passing knowledge of JavaScript, and some acquaintance with Perl. In particular, familiarity with HTML forms is assumed. If you don't understand Perl at all, find Robert's Perl tutorial using e.g. google, slog through it a few times, and get hold of Perl for your system and write a few simple programs. Then read on..

TABLE OF CONTENTS

Terminology
How a Web browser works
HTTP/1.0
What is a CGI script?
1. Necessary HTML
2. Data coming into a script
3. The script responds..
Installing a CGI script
Running it
Security Holes
NOT FOR THE FAINT-HEARTED..

Special characters CGI from JavaScript CGI.pm MIME
chmod Server-side includes Tricks Error scripts
RFC 822 HTTP/1.0 HTTP/1.1 Servers

References

First, some terminology

As with all fairly technical subjects, there are lots of words and abbrev's. Here we will examine just a few of the important ones..

HTTP is the Hypertext Transfer Protocol. We'll learn quite a lot about how the Web works as we go through the following page. HTTP is about moving messages around the Internet.
A "user agent" is simply software that is used to browse the Web. Another name for such browser software is "the client";
The web server that provides web pages is simply "the server", although you will come across the confusing abbreviation HTTPD (the d is for daemon - regard the daemon as a sort of agent, not somebody with horns and a tail!);
CR and LF. These are abbreviations for 'carriage return' (hexadecimal 0D) and 'line feed' (hex 0A). Unfortunately, different systems use either OA on its own or the two combined to indicate the end of one line and the beginning of the next. You can create the characters in Perl as follows:
```
   $LF = pack("c", 10);         # create a line feed
   $CR = pack("c", 13);         # and a carriage return.
```
RFC. This stands for "Request For Comment", and is always followed by a number, which defines one of a multitude of Internet standards (and related documents). Unfortunately, there is a proliferation of RFCs related to HTTP, so we'll refer to them again and again!
URI. A Uniform Resource Identifier - either a name (URN) or location (URL). A URI is simply a formatted string which identifies a resource on a network. There are mildly complex rules that define the format of a URI.
A proxy is an 'agent' that forwards a message towards a server. The proxy usually reformats the request.
The word gateway . The central idea behind CGI scripting is that you can hook your web browser into a gateway into, for example, a database. The gateway will translate the message into something that the underlying software can understand. The flipside of this advantage is that you're 'letting the whole world run a program on your system'!

Running CGI scripts on a computer will always potentially compromise the security of that computer!

A. How a browser normally works!

When we learnt HTML, we didn't realise that we were floating on several layers of abstraction. For example, consider the following:

Okay, we're connected to the Internet via our Internet Service Provider. We start up our browser (say, Netscape, or Opera), and want to fetch a web-page. What is the request that the browser sends to fetch a web page at say, http://www.anaesthetist.com/index.htm ? Well, here it is..
```
GET /index.htm HTTP/1.0
Accept: www/source
Accept: text/html
Accept: image/gif
User-Agent:  Netscape/2.0  libwww/2.14
From:  foo@foobar.com 
```
Not what you're used to, is it? The above message is all behind-the-scenes stuff. First, note that after all the blurb is a blank line . Yes, I promise you - it's there. This is very important. How important, we'll find out later!

Next, see how this is an example of a GET request. As with all such messages, the format is very precisely structured - it sticks rigidly to rules originally defined in something called RFC 822. The request is quite specific. It's asking for HTTP (hypertext transfer protocol) version 1.0, and will only Accept certain types of data in reply It's pretty easy to determine what types are being referred to - html pages, and gif images. These data types are called MIME types (MIME stands for Multipurpose Internet Mail Extension - we discuss it in a little more detail far below).

At this point .. pause .. take a deep breath. Although all of the above looks like garbage, look once more at the message, read each line carefully, and see how a little order starts to appear out of the chaos!

{As an aside, note the From: line, which was a polite little line found in early requests, now totally obsolete due to the enthusiastic activities of spammers}.
Here's what might come back from the server that provides the web page..
```
HTTP/1.0 200 OK
Date: Wednesday, 02-Feb-94 23:04:12 GMT
Server: NCSA/1.1
MIME-version: 1.0
Last-modified: Monday, 15-Nov-93 23:33:16 GMT Content-type: text/html 
Content-length: 456
     <HTML> .. the web page goes here.. </HTML>
```
Note again the header section followed by a blank line . On the first line, "HTTP/version" is followed by a code (200) that indicates "everything went fine". The corresponding code for "not found" is the dreaded 404. A lot of blurb (date, server, MIME version etc) follows. There certainly is a lot of stuff here, isn't there? The MIME type (text/html) and length, and so on. After our mandatory little blank line, the body of the message - an html page.

When you actually get around to writing a CGI script, you'll find that the above can lazily be abbreviated to just..
```
Content-type: text/html
     <HTML> .. the web page goes here.. </HTMLY>
```
The trick is that your CGI script writes such a response, which is then parsed by the server on which your script lives. The final product - the plethora of information that you saw above - is then sent to the user agent.

Soon we will consider a slight variation on the above, where instead of requesting a web page, the browser asks the server to run a CGI script ! But before we do so, a brief comment on the very first line of the message..

B. What is HTTP/1.0 ?

HTTP/1.0 is the basis of the World Wide Web. There is no 'formal' definition or 'standard' (although HTTP/1.0 accounts for about three quarters of Internet traffic). The closest we can come is something called RFC 1945, which was written by Tim Berners-Lee and his colleagues in May 1996, long, long after he started up the World Wide Web. He described HTTP as:

".. an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. It is a generic, stateless, object-oriented protocol which can be used for many tasks, such as name servers and distributed object management systems, through extension of its request methods (commands). A feature of HTTP is the typing of data representation, allowing systems to be built independently of the data being transferred."

Basically, HTTP is a simple protocol for transferring information. HTTP makes it easy to transfer not only HTML documents, but a vast variety of other data. Not only can we retrieve documents, we can also for example search for information, and talk to a variety of programs on computers across the 'Net. HTTP is very similar to the format used for e-mail (RFC 822) and MIME.

The beauty of HTTP/1.0 is that it makes it easy for us to 'talk to' other Internet-based protocols - there's a host of these, including SMTP, NNTP, FTP, Gopher, and WAIS. HTTP/1.0 is an excellent negotiator between these protocols.

Recently the World Wide Web Consortium (W3C) has defined a new standard called HTTP/1.1, which will probably eventually replace 1.0. Most servers still use 1.0. We discuss HTTP/1.1 below, as well as taking a brief peek at some features of HTTP/1.0. (Don't look at these now).

C. What is a CGI script?

A CGI script is really a program. The program lives on a computer connected to the Internet. The "script" (actually a fully-fledged program) is usually written in the elegant and satisfying language Perl , athough a host of other languages may be used (TCL, BASIC [Yuk!], C, Ada, .. and so on). CGI scripts need not necessarily be interpreted scripts, in fact, they can be compiled programs (for example, C is always compiled).

Consider someone (let's call him Fred) who is browsing web pages on the internet. Fred comes across a web page which contains a link that refers to a CGI script. Fred clicks on the link. What happens next?

When Fred clicks on the link, certain parameters are passed by his browser across the Internet, and eventually the CGI script that lives on the server starts running. What happens next depends very much on the nature of the script, but commonly:

The script reads the parameters that were passed to it;
The script writes a new web page back to Fred, so that he can view it.

Let's explore how this happens. First, a VERY SIMPLISTIC example of what the browser request might look like:

GET /cgi-local/myscript.pl HTTP/1.0
Accept: www/source
Accept: text/html
Accept: image/gif
User-Agent:  Netscape/2.0  libwww/2.14

Note the similarity to our previous, conventional web-page example, including (you guessed it) the blank line at the end. The difference is that, instead of requesting a web page, the GET asks for a script called "myscript.pl" found in the directory "/cgi-local/". What will the response look like? Well, that depends on the script, but commonly, the response will be almost identical to our previous response..

HTTP/1.0 200 OK
Date: Wednesday, 02-Feb-94 23:04:12 GMT
Server: NCSA/1.1
MIME-version: 1.0
Last-modified: Monday, 15-Nov-93 23:33:16 GMT Content-type: text/html 
Content-length: 456
     <HTML> .. the web page goes here.. </HTMLY>

The only difference is that the script dynamically writes the response, rather than fetching a static page stored somewhere on the server. You can imagine the immense power (and potential for cock-ups) inherent in such a set-up! There are several important aspects to CGI scripting that you have to be aware of. We will look at:

HTML components;
Data that come into the script;
How the script responds.

C1. HTML components

There are several ways that a CGI script might be invoked. One is simply including a reference to the script in an HTML anchor. Another is a form. We previously learnt about HTML forms in our JavaScript tutorial. Important components are:

the <FORM> tag itself, which tells the browser what method is being used to transmit information (GET or POST), and an action attribute, which says which CGI script to use to actually process the data coming from the form!
the <INPUT> tags which say what types of input you're dealing with;
<SELECT> which defines a list to select from (and includes the <OPTION> tag);
<TEXTAREA> for big chunks of text.

When we discussed forms in our JavaScript tutorial, we referred to two methods that might be used, GET and POST. We also mentioned that POST is much more general and powerful than GET. Let's look at a POST that might result from a form..

POST /cgi-local/posthandler HTTP/1.0
Accept: www/source
Accept: text/html
Accept: video/mpeg
Accept: image/jpeg
Accept: image/gif
{..snippety..more Accepts may go here..}
User-Agent:  Netscape/2.0  libwww/2.14
Content-type: application/x-www-form-urlencoded
Content-length: 184
     &walrus=Large
     &honey=sweet
     &silliness=on
     &myURL=Fred%20Foobar%20argh@foobar.com

Note the one special line in the above, apart from the blank line ! Can you guess which one? Yep.. (no, not the walrus)..

Content-type: application/x-www-form-urlencoded

This line tells us something very interesting - that all fancy characters in the 'body' section following the blank line will be encoded , for example, space characters will be encoded as "%20", in other words, % followed by the relevant hexadecimal code.

Now look more carefully at the body of the POST. It's made up of lines with the general format:

  &name=value

See how every name begins with an ampersand (&), and then the value follows an equals sign. Neither the = nor the & are encoded, but any other occurrence of either within the actual name or value will be encoded (as %3D and %26 respectively).

The cute thing about POST is that a separate data stream is opened, and the data (here name=value pairs) are put onto that separate stream. The stream becomes the standard input of the CGI script! (For future reference, we'll here note that when you use the POST method, then the environment variables CONTENT_LENGTH and CONTENT_TYPE are both set up appropriately).

Always use POST rather than GET, unless you have no choice

With this under our belt, let's look briefly at the same information, passed as a GET..

GET /cgi-bin/posthandler.pl?&walrus=Large&honey=sweet&silliness=on
&myURL=Fred%20Foobar%20argh@foobar.com HTTP/1.0
Accept: www/source
Accept: text/html
Accept: video/mpeg
Accept: image/jpeg
Accept: image/gif
{..snippety..more Accepts may go here..}
User-Agent:  Netscape/2.0  libwww/2.14

Note the restrictive format - all the name=value parameters are plonked after a question mark that follows the path and name of the Perl script "posthandler.pl". Despite this limitation, the GET method is useful where we simply wish to submit a stock query to a script, for we can embed the query in a link, thus:

<a href="http://www.anaesthetist.com/cgi-local/shazam.pl?foo=123&bar=nibbles">
click here for confusion</a>

.. but be careful. There's often a (poorly-defined) limitation on the amount of data you can pass with a GET, so don't be surprised if long data are ruthlessly truncated!

C2. Data coming in to the script

CGI stands for Common Gateway Interface . The Interface is the glue that binds the "client" (in our example, Fred) and the actual script that does the work. Information comes in to the Interface, and is then relayed to the script. There is a particularly exacting format that is used to pass this information. What happens is that the script can read in parameters passed to it by the Interface. The script "sees" these parameters as things called environment variables . The names of environment variables are always written in UPPER CASE. There are many environment variables, but perhaps the two most important ones are:

QUERY_STRING
PATH_INFO

The QUERY_STRING environment variable

We already know that if we create a form and then call up a CGI script using POST (the better way) then it's easy for the script to see the incoming data - it simply reads the standard input . Things are more convoluted if we use GET. GET can be invoked..

As part of the HTML reference to the CGI script, for example:

        <a href="http://www.foo.com/bar/foobar.pl?thisIsSomeInfo">

From an HTML form. The method used with the form must be GET . (This explains the "?rubbish" that you often see in the location bar of your browser when you use a search engine).
From an ISINDEX document (fine-print stuff that we've deferred until later. You'll probably never use it).

Note that the string that the script sees in QUERY_STRING is mutilated during the process of transfer. In particular:

space characters are converted to + signs;
Certain special characters are encoded as a percentage sign followed by a 2-digit hexadecimal code, as we learnt above.

The PATH_INFO environment variable

This allows extra information to be transmitted from a web-page to a CGI script. The information is put in after the path to the script, for example: "http://www.anaesthetist.com/cgi-local/Fred.pl/PathINfogoeshere";

You can combine such path information with a query-string.. "http://www.anaesthetist.com/cgi-local/Fred.pl/PathINfogoeshere?Whatmeworry";

Other environment variables

There are several other env variables that are routinely set for all requests:

SERVER_SOFTWARE - the name and version of the information server software that runs the gateway. The format is "name/version", for example "NCSA 1.0" or "Apache/1.3.3";
SERVER_NAME - the name of the server (one of IP address, hostname, or DNS alias), for example "stupid.chiron";
GATEWAY_INTERFACE - the version of CGI that is running, in the format "CGI/revision", for example "CGI/1.1";

There are other env variables whose presence depends on the demand being made of the gateway:

REQUEST_METHOD - we are particularly interested in the HTTP methods GET and POST, but there are others (for example, HEAD, which is identical to GET except that the server never returns a body in its response!);
SERVER_PROTOCOL - the name and revision of the "information protocol" used by the request ("protocol/revision"), for example ""HTTP/1.0";
SERVER_PORT - the port number to which the request was sent - for HTTP, the port is usually 80.
PATH_TRANSLATED - this is another incarnation of PATH_INFO, but has been translated by the server from a virtual address to a physical one, where appropriate; In our above example of PATH_INFO, the translated path might be: "/usr/bin/cgi-local/PathINfogoeshere";
SCRIPT_NAME - the (virtual) path of the script itself! This is used for URLs that reference themselves. For example "/cgi-local/foobar.cgi";
REMOTE_HOST - who made the request? (If this is null, then look at REMOTE_ADDR, below); Note that on some servers, DNS lookup can be turned on or off. This variable will only be populated if the server allows DNS lookup to be turned on. Do NOT rely on this to definitively identify someone!
REMOTE_ADDR - the IP address of the requestor (of the remote host);
AUTH_TYPE - only of relevance where the server can authenticate the user. Protocol-specific;
REMOTE_USER - authenticated user name, in the case where AUTH_TYPE is relevant;
REMOTE_IDENT - set to the remote user name only in the specific instance where "RFC 931 identification" is supported! This is pretty darn useless for actually identifying someone, as it can easily be faked by those with nefarious intent.
CONTENT_TYPE. In some circumstances, information may be attached. Here the type of datum attached is specified (For example, with HTTP POST, and PUT); CONTENT_LENGTH similarly specifies the length of such content.

A further complication is environment variables that begin with HTTP_ :

HTTP_ACCEPT - the content is determined by the "Accept: .." lines in the header, if they exist. The value is just a list of mime types, separated by commas (One can even say "*/*");
HTTP_USER_AGENT - the browser used by the client ("software/version library/version", or anything the client wishes to spew forth - Internet Explorer will sometimes identify itself as Mozilla, ie. Netscape?!);
HTTP_REFERER only has a value if the script was invoked from within an HTML document, in which case the value is the URL of the document. Note the spelling [or lack thereof]. Some servers will also (or even only) accept HTTP_REFERRER.
HTTP_FROM is obsolete (and usually now left blank) but formerly contained the user's email - spam has put paid to this!
HTTP_HOST.
HTTP_ACCEPT_LANGUAGE - for example, "en" for English.
HTTP_ACCEPT_ENCODING - client-supported encoding, for example, "gzip";
HTTP_CONNECTION - type of connection established between server and client, for example, "Keep-Alive";

None of the environment variables can reliably identify a user!

Note that with server side includes, several other environment variables may be made available, including DOCUMENT_NAME, DOCUMENT_URI, QUERY_STRING_UNESCAPED, DATE_LOCAL, DATE_GMT, and LAST_MODIFIED.

Actually examining the environment variables

Okay, enough mucking around. We're soon going to look at Perl coding in some detail, but here's just a flavour - the Perl code to actually get hold of the environment variables:

 #!/usr/local/bin/perl
 print "Content-type: text/html\n\n";
 print "\n";
 foreach $key (sort keys(%ENV)) \
  { print "$key = $ENV{$key} <br>";
  }

The above little Perl script will actually provide a list of environment variables as an HTML page (admittedly, without the usual head and body tags etc). The first line simply says where to find Perl on the system. See how Perl automagically keeps the environment variables in the %ENV associative array, and we go through each using the foreach instruction. The print statements preceding the foreach loop are explored in detail below. Note the usual Perl method of accessing each associative variable - because it's a variable we say $ENV and not %ENV, but we put curly brackets {braces} afterwards thus: $ENV{whatever}.

C3. How the CGI script talks back..

TALKING BACK

Basics
Other content types
Options other than "content-type"
Extract data from GET
Reading a POST

If you know a bit of Perl, then you can more-or-less work out how to respond to a GET or POST from a browser. You want to generate something along the lines of our lazy little example above..

Content-type: text/html
     <HTML> .. the web page goes here.. </HTML>

.. remembering that the HTTPD will fill in all the details to make up an acceptable HTTP/1.0 header. Well, let's try some Perl:

 #!/usr/local/bin/perl
        print "Content-type: text/html\n\n" ;
        print "<html><head><title>A little page</title></head>" ;
        print "<body>This is a little page</body></html>" ;

Note the two carriage returns (\n\n) after the Content-type statement - these provide our magical blank line without which the header simply won't work. Also see how we simply print to standard output! (stdout - if you were to run the script on your machine without any web in the way, you would see the response on your console). Incredibly straightforward. As usual, there are wrinkles. There's a fancy way in Perl (isn't there always?) of quoting large sections of text. Here it is..

        print "Content-type: text/html\n\n" ;
        print <<"END OF HTML";
        <html>
          <head>
            <title>A little page</title>
          </head>
         <body>
            This is a little page
         </body>
        </html> 
        END OF HTML

In the above, we've used the text string "END OF HTML" to delineate (surprise, surprise) the end of the html text we wish to print. Cute, is it not? This cute Perl trick is called "here-document quoting". But NOTE that the line END OF HTML must be alone and on its own - any character on the line apart from "END OF HTML" (even a space) will screw things up horribly. Use with respect!

Pulling data out of a GET statement

You know how to get the data - they are stored in the QUERY_STRING environment variable. So we simply say:

 if ( $ENV{'REQUEST_METHOD'} eq "GET" )
    { $inputString = $ENV{'QUERY_STRING'};
    };

Massaging our data

Well, not that simple, because we still have to decode various special characters, and pull out the name=value pairs. There are several steps:

Change '+' signs to spaces:
```
      $inputString =~ s/\+/ /g;
 
```

Create an array of name=value pairs, and fill it from the data..

     my (@Pairs);
     @Pairs = split ( /&/, $inputString );

Create an associative array, and fill this from the name=value pairs..

    my  (%NameValues);          # our associative array
    my  ($name, $value);        # local variables
    foreach $onepair (@Pairs)
      {  ($name, $value) = split ( /=/, $onepair);
         # FIX UP: here we should fix up the names and values..
         $NameValues{$name} = $value;   # set value in associative array
      };

Okay, we skipped over a bit in the above foreach loop, because the names and values need a little massaging! We need to substitute values like %xx (ie hexadecimal encoding) with the relevant character. So in place of the FIX UP comment line in the above, we might put:
```
   $name =~ s/%([0-9A-Fa-f][0-9A-Fa-f])/pack("C",hex($1))/ge;
 
```
.. and a similar line for $value. {check the above code}.
Ooops, nearly forgot! Some more lines to add before we simply plonk the $value into $NameValues{$name}. What about the case where the same $name is submitted several times? Well, okay, you can simply overwrite the first few occurrences (as the above code will do), or you can check for the previous values and, say, concatenate all the values. Here's our check (we'll leave you to decide what to do next)..
```
  if (  defined ($NameValues{$name})  )
    { # do what you want to here with the multiple values..
      # for example, separate them by colons, or whatever.
    } else
    { $NameValues{$name} = $value;
    };
```
At this stage, you may want to find out how to cut out server-side includes from data submitted to your script. A security issue!

Reading the body of a POST statement

The general way that you read the body of a POST statement is simply by querying the standard input, stdin. This is the same as reading the keyboard (console) when your Perl program is interacting with you. In other words you say something along the lines of:

 if ( $ENV{'REQUEST_METHOD'} eq "POST" )
    { read(STDIN,$inputString,$ENV{'CONTENT_LENGTH'});
    };

.. which will read the whole body of the POST from the standard input. Note that this body is composed of multiple lines (separated by CR and/or LF characters). See how we get the length of the data from the CONTENT_LENGTH environment variable. There is NO obligation for CGI to put some sort of "End Of File" character on the data, so CONTENT_LENGTH is rather important.

After we've read in our $inputString, we massage it into shape (and name/value pairs) as we did for the GET statement above.

We use the more complex Perl read statement rather than something like:

  local undef $/;                      #force read to end of file!
  $inputString = <STDIN>;
  local $/="\n";                       #re-establish usual delimiter!

because we have no EOF character, and therefore need to state the length.
.

D. Actually installing a Perl script

There are several tricks to getting a Perl script to run on your server. Here are a few pointers:

Make sure that you're allowed to .. contact your admin, grovel, grovel.
Put the correct path to Perl as the first line of your script. Something like:
```
 #!/usr/bin/perl 
```
(Don't leave out the #, or put a space between the # and the !). "which perl" will generally tell you on a UNIX system where perl resides.
Put the script in the correct directory. Something like /cgi-bin or /cgi-local, but again, contact admin, grovel, grovel..
Note that some simple things may screw you around. For example, if you create a file in MS-DOS edit, and upload it to the web as a binary, Perl won't run it on UNIX systems because the default "end of line" character for MS-DOS/Windows is CR+LF, while on UNIX Perls, it's just LF. There are several solutions - the easiest is just to make sure that when you FTP upload the file to UNIX, your transfer is in ASCII mode - then the translation will automatically occur. Another solution is to find one of the translation programs on the Web, for example "fixcrlf.exe".
Make sure that the file has the correct suffix - on many systems, you have to rename the file from ".pl" to ".cgi" in order to get it to work!
Make sure that the directory and/or script itself has the correct permissions. There are three numbers that you have to get right - this magic triad is "755". The correct UNIX command is..
```
    chmod 755 filename 
```
Where filename is the name of the directory or file whose permission you're changing. You'll find that friendly software like WS-FTP will show the permissions in a slightly different way.

Useful UNIX commands apart from chmod are mv , used to move or rename a file, mkdir for creating a directory, cp to copy a file, pwd to list the current directory, and ~ as a replacement for the long cumbersome path to your home directory!

E. .. and running it

Okay. Create a web-page with a reference to the script, or simply type in the URL thus:

  http://www.anaesthetist.com/cgi-local/foobar.pl

And see whether you get back the dynamic web-page you expected (or a confusing error)! If all hell broke loose (ie. you got an internal server error, code 500) then look carefully on your server for the file error.log that will tell you what went wrong (yep, you guessed it, try the grovel, grovel routine). Chances are, you didn't set 755 permissions for both the directory and script file, but you may have the wrong #!pathtoperl, or some other error in your script. (Check the script from the command line if you can, although this is often unhelpful).

If the script file wasn't even found, then you may have forgotten that UNIX is cAse SENsitIvE, or have the wrong name, or the wrong suffix (e.g. ".pl" in stead of ".cgi", or vice versa). Here's an example to play around with:

A simple 'Hello world' example

Click on this test link. The URL is "http://www.anaesthetist.com/cgi-local/hi.pl"

Here's the source code for hi.pl..

#!/usr/bin/perl #=======================================================================# # A SIMPLE CGI HELLO-WORLD! # #=======================================================================# print "Content-type: text/html\n\n" ; print <<"END OF HTML"; <html> <head> <title>A little page!</title> <body> <div align="center"> <h1>Hi!</h1> </div> <hr> This is a little test page. <p><a href="http://www.anaesthetist.com/mnm/cgi/index.htm#after">Back to tutorial</a> <p><hr> </body> </html> END OF HTML #=======================================================================# # That's it! # #=======================================================================#

Writing to files from a Perl script

Hmm. It's general practice for scripts to be run as 'nobody'. This means that if your CGI is to write to a file on the server, the directory needs to be "world-writable" (a particularly bad idea) or owned by 'nobody'. [You may wish to research this further].

Accessing Databases

This depends on your database. There's a lot on the 'Net about the common databases - including mySQL and PostgreSQL. For example, check out resourceindex.com. It's also good to know some ODBC.

F. Security Holes

If you write Perl scripts that others can execute, you ARE opening security holes! Sounds pretty brutal, but largely true. Beware the following (at the very least):

Use of the Perl eval command.
Special characters that have meaning to a Bourne shell.
popen and system .
Server-side includes. These must be TURNED OFF , or you will die a horrible death, heh. Anyway, you should trim them out of your data - see our note on SSIs.
Data that bite! Certain characters such as `backticks` have special meaning in Perl, and submitting data with |pipes| to something like an (eugh) Access database will result in Visual Basic being invoked on the stuff between the pipes. Hackers can use this to accomplish evil. Other changes may simply sow confusion - if you don't watch out for it, something like
```
 $surname = "O'Connor";
 &SubmitSQL ("INSERT INTO NAMETABLE (Surname) VALUES ('$surname')" );
```
will cause a headache. You first have to duplicate the single quote in the surname!

If you don't understand the above, you probably shouldn't be allowing others
to use your scripts by publishing them on the 'Net! Even if you do,
you're probably still going to get burned from time to time.

G. A miscellany

Special characters

The following are all special characters in HTTP/1.0:

 <   > (   )   @   ,   ;   :
    \   "   /   [   ]  
  ?   =   {   }

.. as well as the space character, and horizontal tab. Also don't forget the special roles of LF ± CR!

Calling CGI from JavaScript

This is actually fairly straightforward. In your JavaScript simply say something along the lines of..

  window.location.href = "http://www.example1.com/cgi-bin/fred.pl?hereismyquery";

.. and you're away.

Using CGI.pm

There's a readily available Perl library module called CGI.pm. If it's on your system, you're lucky, and can just say

    use CGI;

at the start of your Perl script. If it's not, try (grovel, grovel) or installing it in a local directory and then saying something along the lines of..

    use lib '/path/of/yourlocaldir';
    use CGI;

Using CGI.pm you can handle parsing of CGI queries and form generation with just a few simple calls. Check it out in Lincoln Stein's vast CGI.pm documentation. John Callender has written an excellent introduction to this Perl module. This includes how to send an email from Perl, and is incidentally a rather good introduction to Perl! (Or get the source of that other excellent form to email script, FormMail).

A note on MIME

MIME stands for "Multipurpose Internet Mail Extension". MIME allows for transfer and recognition of a host of different data types. There are hundreds of types. MIME types were originally defined in RFC 1341. For some improvements, see the now obsolete RFC 1521 and 1522, and the more recent five-part RFC 2045 to 2049. Mailcap files (which handle media types) are in RFC 1524. All MIME types should be registered with IANA. Current types include:

text (text/plain, text/html, text/richtext, text/sgml, text/css, text/x-speech, text/x-uuencode, text/xml, ..)
audio (audio/aiff, audio/x-aiff, audio/basic, audio/midi, audio/mpeg, audio/x-realaudio, audio/voc, audio/wav, ..)
image (image/bmp, image/gif, image/jpeg, image/png, image/tiff, ..)
video (For example, video/mpeg, video/quicktime, video/x-msvideo, video/avi)
multipart (e.g. multipart/encrypted = RFC1847, multipart/form-data = RFC1867, multipart/mixed = RFC 1521, multipart/x-mixed-replace {Netscape server side stuff}, multipart/parallel = RFC 1521 also, multipart/voice-message = RFC 1911, multipart/x-gzip, multipart/x-zip, and so on..)
application (a host, for example application/java, application/msword, application/postscript, application/mspowerpoint, application/x-sh, application/streamingmedia, application/x-shockwave-flash, application/x-tcl, application/excel, application/xml, ..)
music, chemical (e.g. chemical/x-pdb), etc.

For further reading try a long list of MIME types with the relevant RFCs, etc. Here's another good reference. And here's a no-nonsense note. The MIME FAQ is here.

A note on chmod

Here's a direct snitch from the CGI FAQ (go read it - it's great)..

For a proper overview, "man chmod ".  Some modes that may be useful
in a typical CGI context are:
* CGI programs, 0755
* data files to be readable by CGI, 0644
* directories for data used by CGI, 0755
* data files to be writable by CGI, 0666 (data has absolutely no security)
* directories for data used by CGI with write access, 0777 (no security)
* CGI programs to run setuid, 4755
* data files for setuid CGI programs, 0600 or 0644
* directories for data used by setuid CGI programs, 0700 or 0755
* For a typical backend server process, 4750

What you're doing is setting bit flags in the permissions of files and directories. Scary stuff! The 'behind-the-scenes' information is that 755 (for example) is an octal number. The "7" refers to "remote file permissions" for the owner, and the subsequent two 5's to permissions for "group" and "other" respectively. Each octal digit is made up of three bits, the first referring to read permissions, the second to write, and the last to execute. So "5" allows read and execute, but not write, and so on. (read = 4, write = 2, execute = 1, 4+2+1 = 7).

Okay, the leftmost digit (usually assumed to be a zero) is a little different, in that the first digit selects the "set user ID", the second the "set group ID", and the last the "save text image" attribute. So the "4" in "4755" means "set user ID", for example.

Note that if you're being anally retentive, it's probably best to specify the leading zero, as this ensures that the number is seen as octal on almost any system!

Server-side includes

Note that we particularly don't want some sneaky hacker to insert "server-side includes" into our data. So we must at some stage go through all data strings and rip out all potential server-side includes, thus:

  $datum =~ s/<!--(.|\n)*-->//g;

The general format of a server side include is:

      <!--#command tag1="value1" tag2="value2" -->

for example..

      <!--#exec cgi="/cgi-bin/hits.pl"-->

In other words, a 'standard' HTML comment containing a directive to include the output from an executable in the web-page. Not only do SSIs provide a security hazard, but they will also slow down an overworked server even further, because documents that are provided have to be parsed by the server to see whether it should insert an 'include'. Note that setting up SSI is also quite a business as you have to decide which directories are safe to use, and tell the server what file type is to be parsed and turned into an HTML document. Internally the server uses the MIME-type text/x-server-parsed-html to identify SSI documents (they often have the suffix .shtml).

Commands include exec, config, include, echo, fsize, and flastmod . With exec the cmd tag executes the string provided (using /bin/sh), and cgi runs a script and inserts its output, whatever it is! For more on NCSA server side includes, try.. this note

Trickery

There are yet more wrinkles..

There is a way of preventing the server from parsing the output of your script (ie. you can talk directly to the client). The trick is to name the script beginning with "nph-", which stands for "not parse header". The catch - your script must return a valid HTTP/1.0 response to the client, in all its gory detail.
ISINDEX documents.. An ISINDEX is an element that can be inserted into either the head or body of a webpage. <ISINDEX> is deprecated in HTML 4.0, and there's really no good reason to use it, as FORMs are more flexible. When you use this silly tag, then the browser should provide a single line into which users can input text, which is then submitted to a CGI script. Note that this will only work reliably if a BASE URL is specified for the document, along the lines of:
```
  <BASE HREF="http://www.foo.com/cgi-bin/scriptname">
  <ISINDEX PROMPT="Enter some simple words..">
```
Note that with ISINDEX queries, command line arguments may be used. (See below). One way of determining whether the command line may safely be used is to first check QUERY_STRING for an equals sign ( = ) that is NOT encoded. If such a nasty little character is present, then do NOT use the command line, as all ISINDEX queries should encode equals signs!
There is another trick that you might wish to use. This is an internal Perl thing - normally in Perl one can read "command line" arguments by referring to the first argument as
```
    argv[1]
```
and so on. The sneaky thing is that QUERY_STRING values are also put into argv[1], argv[2] and so on, provided that the data were not submitted from a form. How is argv[1] distinguished from argv[2] and so on? Easy! Spaces (now translated to plus signs) are the delimiter. So, if you web page anchor is:
```
        <a href="http://www.foo.com/bar/foobar.pl?alpha beta gamma?">
```
then the script will see argv[1] as the string "alpha", argv[2] will be "beta", and so on.
One trick - if you want a CGI program that simply echoes back the content of a POST form (so that you can test form submission) try post-query , which is available as ACTION="http://hoohoo.ncsa.uiuc.edu/cgi-bin/post-query". There's a similar script called simply query that does the same for forms that use the GET method (Available as ACTION="http://hoohoo.ncsa.uiuc.edu/cgi-bin/query").

Error Scripts

These may be used to handle the case where a script 'crashes and burns'. They have extra environment variables, including:

REDIRECT_REQUEST (the request as it was sent)
REDIRECT_URL (URL that caused a problem)
REDIRECT_STATUS (status number and message that would have been sent)

A topic beyond the scope of this document at present!

A summary of RFC 822

This document has application to all sorts of internet messages, including e-mail and HTML, as well as requests for CGI scripts;
An 'extended' and rather informal Backus Naur format (BNF) is used.
Messages consist of tightly-structured headers, and unformatted 'body' text.
Long lines can be "folded" by (a) inserting a CRLF combination, and (b) immediately after this, inserting at least one 'linear white-space' character (LWSP, a tab or space); You 'unfold' by removing all CRLFs followed immediately by LWSP;
An unfolded 'field' in a header is composed of:
1. field name made up of ASCII characters 33 to 126 decimal, excluding colon;
2. A colon (:)
3. a field body (any ASCII characters, apart from CR or LF)
Some field bodies are further structured ; others are just ASCII text;
Structured field bodies may be structured to contain one or more of:
1. individual special characters e.g. @ . ,
2. "quoted strings"
3. domain-literals - [text within square brackets]
4. (comments) - (enclosed in brackets)
5. atoms (basically, a word)
Display of structured field data should NOT allow LWSP between words separated by @ or .
Characters can be quoted by preceding them with a backslash \ but ONLY within a quoted string, domain-literal, or comment.
The document distinguishes between "dtext" and "ctext", the former any character apart from [square brackets], backslash and CR; the latter any character apart from (parenthesis), backslash and CR. A domain-literal is made up of dtext within square brackets, a comment of ctext within parenthesis.
special characters are ( ) < > @ , ; : \ " . [ ]
With structured text, contiguous words in a phrase are assumed to be separated by just ONE space;
< angle brackets > are used to indicate the presence of a "one machine usable reference" (e.g. a mailbox)
Field names are generally case-independent;
Backspace characters are regarded as 'overstriking' preceding text (but not past the start of a line) !
The body must occur after the header, but individual header lines need NOT be in any particular order!

The canonical format for messages will now be shown. In the following note that items in square brackets are optional, that (parenthesis) here indicates a single item, that "a / b" means a or b, that *something means any number of copies of something including none at all, and that 1*something means one or more copies of something. Here goes (in BNF)..

     message     =  fields *( CRLF *text )       ; Everything after
                                                 ;  first null line
                                                 ;  is message body
     fields      =    dates                      ; Creation time,
                      source                     ;  author id & one
                    1*destination                ;  address required
                     *optional-field             ;  others optional
     source      = [  trace ]                    ; net traversals
                      originator                 ; original mail
                   [  resent ]                   ; forwarded
     trace       =    return                     ; path to sender
                    1*received                   ; receipt tags
     return      =  "Return-path" ":" route-addr ; return address
     received    =  "Received"    ":"            ; one per relay
                       ["from" domain]           ; sending host
                       ["by"   domain]           ; receiving host
                       ["via"  atom]             ; physical path
                      *("with" atom)             ; link/mail protocol
                       ["id"   msg-id]           ; receiver msg id
                       ["for"  addr-spec]        ; initial form
                        ";"    date-time         ; time received
     originator  =   authentic                   ; authenticated addr
                   [ "Reply-To"   ":" 1#address] )
     authentic   =   "From"       ":"   mailbox  ; Single author
                 / ( "Sender"     ":"   mailbox  ; Actual submittor
                     "From"       ":" 1#mailbox) ; Multiple authors
                                                 ;  or not sender
     resent      =   resent-authentic
                   [ "Resent-Reply-To"  ":" 1#address] )
     resent-authentic =
                 =   "Resent-From"      ":"   mailbox
                 / ( "Resent-Sender"    ":"   mailbox
                     "Resent-From"      ":" 1#mailbox  )
     dates       =   orig-date                   ; Original
                   [ resent-date ]               ; Forwarded
     orig-date   =  "Date"        ":"   date-time
     resent-date =  "Resent-Date" ":"   date-time
     destination =  "To"          ":" 1#address  ; Primary
                 /  "Resent-To"   ":" 1#address
                 /  "cc"          ":" 1#address  ; Secondary
                 /  "Resent-cc"   ":" 1#address
                 /  "bcc"         ":"  #address  ; Blind carbon
                 /  "Resent-bcc"  ":"  #address
     optional-field =
                 /  "Message-ID"        ":"   msg-id
                 /  "Resent-Message-ID" ":"   msg-id
                 /  "In-Reply-To"       ":"  *(phrase / msg-id)
                 /  "References"        ":"  *(phrase / msg-id)
                 /  "Keywords"          ":"  #phrase
                 /  "Subject"           ":"  *text
                 /  "Comments"          ":"  *text
                 /  "Encrypted"         ":" 1#2word
                 /  extension-field              ; To be defined
                 /  user-defined-field           ; May be pre-empted
     msg-id      =  "<" addr-spec ">"            ; Unique message id
     extension-field =
                   <Any field which is defined in a document
                    published as a formal extension to this
                    specification; none will have names beginning
                    with the string "X-">
     user-defined-field =
                   <Any field which has not been defined
                    in this specification or published as an
                    extension to this specification; names for
                    such fields must be unique and may be
                    pre-empted by published extensions>

Each of the above fields is detailed in the specification.
trace fields provide an audit-trail of message handling;
"Reply-To" is provided by the originator, while "Return-Path" should allow a trace back to the originator!
The "Received" field is useful in tracing transport problems.
Combinations of Originator fields are intentionally limited, as detailed in the standard;
Destination ("receiver") fields will look surprisingly familiar to anyone using email!
The generator of the message-id must guarantee that it's unique!
Extension field names must NEVER begin with "X-" as this sequence is reserved for user-defined fields!
There are numerous detailed standard definitions of dates (non-Y2K compliant!) and times, time zone (eg. "UT"), addresses;
Of arcane interest is that 'source-routing' (along a pre-specified path) is possible, but strongly discouraged;
For any domain, Postmaster@domain" is required to be valid;
Appendix A3 gives examples of complete headers (email -ish!)

Miscellaneous Notes on HTTP 1.0

The following are snippets from RFC 1945. The extended BNF format used is similar to that of RFC 822, and other features are similar to MIME (particularly the obsolete RFC 1521). Appendix C of RFC 1945 lists how HTTP/1.0 differs from MIME.

Formatting rules

   URIs in HTTP can be represented in absolute form or relative to some
   known base URI [9], depending upon the context of their use. The two
   forms are differentiated by the fact that absolute URIs always begin
   with a scheme name followed by a colon.
       URI            = ( absoluteURI | relativeURI ) [ "#" fragment ]
       absoluteURI    = scheme ":" *( uchar | reserved )
       relativeURI    = net_path | abs_path | rel_path
       net_path       = "//" net_loc [ abs_path ]
       abs_path       = "/" rel_path
       rel_path       = [ path ] [ ";" params ] [ "?" query ]
       path           = fsegment *( "/" segment )
       fsegment       = 1*pchar
       segment        = *pchar
       params         = param *( ";" param )
       param          = *( pchar | "/" )
       scheme         = 1*( ALPHA | DIGIT | "+" | "-" | "." )
       net_loc        = *( pchar | ";" | "?" )
       query          = *( uchar | reserved )
       fragment       = *( uchar | reserved )
       pchar          = uchar | ":" | "@" | "&" | "=" | "+"
       uchar          = unreserved | escape
       unreserved     = ALPHA | DIGIT | safe | extra | national
       escape         = "%" HEX HEX
       reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+"
       extra          = "!" | "*" | "'" | "(" | ")" | ","
       safe           = "$" | "-" | "_" | "."
       unsafe         = CTL | SP | <"> | "#" | "%" | "<" | ">"
       national       = 
   For definitive information on URL syntax and semantics, see RFC 1738
   [4] and RFC 1808 [9]. The BNF above includes national characters not
   allowed in valid URLs as specified by RFC 1738, since HTTP servers
   are not restricted in the set of unreserved characters allowed to
   represent the rel_path part of addresses, and HTTP proxies may
   receive requests for URIs not defined by RFC 1738.

HTTP messages

Let's look at a definition of HTTP messages:

   HTTP messages consist of requests from client to server and responses
   from server to client.
       HTTP-message   = Simple-Request           ; HTTP/0.9 messages
                      | Simple-Response
                      | Full-Request             ; HTTP/1.0 messages
                      | Full-Response
   Full-Request and Full-Response use the generic message format of RFC
   822 [7] for transferring entities. Both messages may include optional
   header fields (also known as "headers") and an entity body. The
   entity body is separated from the headers by a null line (i.e., a
   line with nothing preceding the CRLF).
       Full-Request   = Request-Line             ; Section 5.1
                        *( General-Header        ; Section 4.3
                         | Request-Header        ; Section 5.2
                         | Entity-Header )       ; Section 7.1
                        CRLF
                        [ Entity-Body ]          ; Section 7.2
       Full-Response  = Status-Line              ; Section 6.1
                        *( General-Header        ; Section 4.3
                         | Response-Header       ; Section 6.2
                         | Entity-Header )       ; Section 7.1
                        CRLF
                        [ Entity-Body ]          ; Section 7.2
   Simple-Request and Simple-Response do not allow the use of any header
   information and are limited to a single request method (GET).
       Simple-Request  = "GET" SP Request-URI CRLF
       Simple-Response = [ Entity-Body ]
   Use of the Simple-Request format is discouraged because it prevents
   the server from identifying the media type of the returned entity.

A list of status codes

       Status-Code    = "200"   ; OK
                      | "201"   ; Created
                      | "202"   ; Accepted
                      | "204"   ; No Content
                      | "301"   ; Moved Permanently
                      | "302"   ; Moved Temporarily
                      | "304"   ; Not Modified
                      | "400"   ; Bad Request
                      | "401"   ; Unauthorized
                      | "403"   ; Forbidden
                      | "404"   ; Not Found
                      | "500"   ; Internal Server Error
                      | "501"   ; Not Implemented
                      | "502"   ; Bad Gateway
                      | "503"   ; Service Unavailable

Differences from MIME

There are just a few differences from the (now obsolete) RFC 1521 MIME definition:

"Bare" CR and LF are allowed within text (MIME only allows line breaks of CRLF);
1.0 allows only a restricted set of date formats;
The Content-Encoding header is not represented in RFC 1521;
conversely, HTTP/1.0 doesn't use the MIME Content-Transfer-Encoding header field;
1.0 gives meaning to header fields in multipart "body-parts", while RFC 1521 generally ignores these;

HTTP 1.1

The RFC standard for HTTP/1.1 is RFC 2616 from June 1999. We're up to ~170 pages in this standard, which differs quite a lot from 1.0. Krishnamurthy, Mogul and Kristol have reviewed the differences between v 1.0 and 1.1. In summary, these are:

Because of 'back-compatibility' issues, the standard is complex and far from uniform. HTTP will as usual ignore headers it doesn't understand!
A Via header is supported so one can see who along the chain between client and server is using 1.0 and who is using 1.1.
An OPTIONS method is introduced that (at least in theory) allows a client to ask the server what it can do, without actually requesting a 'resource'.
the Upgrade header is for the future - to allow a switch to an alternate means of communication.
In 1.0, caching was simple - the Expires header told a server which cached a 'common' response (that was often requested) until when it need not update its copy of this response by asking for a fresh copy. There was another frill, a cache could ask the originator for a copy of a page (or whatever) If-Modified-Since . The originating server would then respond with a newer copy (code 200), or code 304 which meant 'not modified, OK to use the original'. (The problem here was one of clock synchronisation). There was even the ability to disable all caching with the header Pragma: no cache .

1.1 cranks the above up a few notches - a cache entry is either fresh or stale. Stale entries should (normally) be re-validated when requested. A new idea is the ETag header, which is shorthand for 'entity tag'. Rather than insisting on comparison of timestamps, the new standard makes sure that if two entity tags are the same, then the associated responses must be identical. There are also extra conditional requests..
1. If-None-match - several entity tags are presented!
2. If-Unmodified-since
3. If-Match.
There's also a Cache-control header, with lots of cache control directives including relative expiration times (max-age directive), private , no-store and no-transform .
Even more cute is the Vary header which forces the cache to examine not only the requested URL (a la 1.0) but also select request header fields to make sure that sending a cached response is appropriate!
A new (and beautiful) attribute of 1.1 is that you can request part of a resource. Great for speed and bandwidth, also you can download chunks of (say) a large file! A good idea. The relevant response code is 206 (Partial Content). One can even send multiple ranges in one message, and there's a MIME type for this.. multipart/byteranges.
It's now possible to first send a header, and then only, if the server is happy to accept the body of the request (which may be biiiig), transmit the body of the message. The unhappy server sends a 401 to tell the client to go to hell; the happy server says 100 (Continue). The trick is that this new 100 code is sent in its own special header Expect:. But watch for bugs in this clever trick!
Compression is provided for (either 'transfer-codings', which are hop-by-hop ie happen along the line of communication, or 'content-codings', which happen at one end, with decompression at the other end). Note that 1.0 already has a Content-Encoding header but that 1.1 adds Transfer-Encoding for hop by hop coding. The Accept-Encoding header is spruced up (enough?), and the TE header allows the client to say which transfer encoding is kosher.
A layer deeper than HTTP is the transport protocol used to actually get the messages across - this is usually TCP . Unfortunately, setting up a new TCP connection for each request is clumsy and costly. Even worse is that each image in a web-page is retrieved by a distinct HTTP request (with its own little TCP overhead)!
1.1 introduces two solutions to this clumsiness - persistent connections and pipelining . Implementation is complex - the Connection header allows a message that is about to be forwarded to have hop-by-hop headers clipped out of it! Persistent connections are now a default (!) but can be turned off at will using a Connection: close header. Pipelining allows multiple requests to be sent across the same TCP connection, without waiting for an answer to the initial request!
Dynamic responses to requests often don't know the length of the response, so cannot send a Content-Length header. In 1.1 we now have chunked transfer-coding, (set up by specifying Transfer-Encoding: chunked ), so manageable chunks of data can be sent. A great idea!
A Trailer header is convenient when used with chunking - deferred headers (that depend, for example, on the whole body) can be listed in this header and then sent after the chunks! There are several technicalities.
In 1.1 the Content-Length header refers only to the actual message length (the "entity length") and cannot be used to talk about temporary length changes brought about by, for example, compression between hops in the chain!
To prevent problems that occur when several hosts are bound to the same IP address (there is a multiplicity of virtually hosted dot coms around, for example), a Host header provides more detail about the actual 'vanity' name of the dot com or whatever whose page is being requested at a given IP. (A port may also be specified). {This partially addresses a problem that will probably only be sorted out when IPv6 comes of age}.
There are now many more error codes than the sixteen found with 1.0. There's also a Warning header, and 24 new status codes, including 409 (Conflict) and 410 (Gone)! All the codes are listed starting on page 39 of RFC 2616.
Several security issues are addressed in 1.1. "Basic authentication" is jacked up so that now a client first computes a (MD5) checksum of several values including a one-time value provided by the host ("nonce"). Nonces can also become stale! Proxy authentication is also addressed.
The Content-MD5 header is refined.
"Content negotiation" is carefully specified, but still confusing and complex.

Different Servers

You should research your server - a good internet search term is simply "HTTPD" with the type of server, for example Apache (by far the commonest, over 50% of all HTTPDs). Note that the CERN/W3C and NCSA HTTPDs are no longer being developed, so if you use them and there's a bug, you may be stuck!

H. References

Check these out..

The CGI FAQ - read this FIRST. It's good .
James Marshall's tutorial (a superb introduction, with practical examples, also available Auf deutsch, En español, Em português, & In het Nederlands )
Ottega - pretty good, with some examples.
Good stuff with a comprehensive Perl tutorial.
For C and CGI CSCENE has a darn good page by Brent York (with a useful example of a web-page counter written in C).
CTDP CGI scripting manual on one of those irritating Tripod pages, but well done.
A note on environment variables in Perl, at about.com (good);
Lies - good! (John Callender)
KU introduction (an "instantaneous introduction")
CGI 101 - a comprehensive tutorial also available in PDF (first 6 chapters, anyway), and as a book.
NCSA CGI stuff Here's their overview

You can get all the RFCs at http://www.ietf.org. Here are a few:
RFC 822 (46 pages)! This supplanted RFC #733. The "Standard for the Format of ARPA Internet Text Messages", dated 1982-8-13.
Here's RFC 1945 (59 pages)
RFC 931
W3C notes on HTTP/1.1 and here is RFC 2616 in HTML! Here's the text version (175 pages).
You may wish to read an interesting note on differences between HTTP/1.0 and /1.1 by Krishnamurthy and colleagues.