This web page assumes that the reader is fluent in Hypertext Markup Language (HTML), with a passing knowledge of JavaScript, and some acquaintance with Perl. In particular, familiarity with HTML forms is assumed. If you don't understand Perl at all, find Robert's Perl tutorial using e.g. google, slog through it a few times, and get hold of Perl for your system and write a few simple programs. Then read on.. |
$LF = pack("c", 10); # create a line feed $CR = pack("c", 13); # and a carriage return.
Running CGI scripts on a computer will always potentially compromise the security of that computer! |
When we learnt HTML, we didn't realise that we were floating on several layers of abstraction. For example, consider the following:
GET /index.htm HTTP/1.0 Accept: www/source Accept: text/html Accept: image/gif User-Agent: Netscape/2.0 libwww/2.14 From: foo@foobar.com
Not what you're used to, is it? The above message is all behind-the-scenes stuff. First, note that after all the blurb is a blank line . Yes, I promise you - it's there. This is very important. How important, we'll find out later!
Next, see how this is an example of a GET request. As with all such messages, the format is very precisely structured - it sticks rigidly to rules originally defined in something called RFC 822. The request is quite specific. It's asking for HTTP (hypertext transfer protocol) version 1.0, and will only Accept certain types of data in reply It's pretty easy to determine what types are being referred to - html pages, and gif images. These data types are called MIME types (MIME stands for Multipurpose Internet Mail Extension - we discuss it in a little more detail far below).
At this point .. pause .. take a deep breath. Although all of the above looks like garbage, look once more at the message, read each line carefully, and see how a little order starts to appear out of the chaos!
{As an aside, note the From: line, which was a polite little line found in early requests, now totally obsolete due to the enthusiastic activities of spammers}.
HTTP/1.0 200 OK Date: Wednesday, 02-Feb-94 23:04:12 GMT Server: NCSA/1.1 MIME-version: 1.0 Last-modified: Monday, 15-Nov-93 23:33:16 GMT Content-type: text/html Content-length: 456 <HTML> .. the web page goes here.. </HTML>
Note again the header section followed by a blank line . On the first line, "HTTP/version" is followed by a code (200) that indicates "everything went fine". The corresponding code for "not found" is the dreaded 404. A lot of blurb (date, server, MIME version etc) follows. There certainly is a lot of stuff here, isn't there? The MIME type (text/html) and length, and so on. After our mandatory little blank line, the body of the message - an html page.
When you actually get around to writing a CGI script, you'll find that the above can lazily be abbreviated to just..
Content-type: text/html <HTML> .. the web page goes here.. </HTMLY>
The trick is that your CGI script writes such a response, which is then parsed by the server on which your script lives. The final product - the plethora of information that you saw above - is then sent to the user agent.
Soon we will consider a slight variation on the above, where instead of requesting a web page, the browser asks the server to run a CGI script ! But before we do so, a brief comment on the very first line of the message..
HTTP/1.0 is the basis of the World Wide Web. There is no 'formal' definition or 'standard' (although HTTP/1.0 accounts for about three quarters of Internet traffic). The closest we can come is something called RFC 1945, which was written by Tim Berners-Lee and his colleagues in May 1996, long, long after he started up the World Wide Web. He described HTTP as:
".. an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. It is a generic, stateless, object-oriented protocol which can be used for many tasks, such as name servers and distributed object management systems, through extension of its request methods (commands). A feature of HTTP is the typing of data representation, allowing systems to be built independently of the data being transferred." |
Basically, HTTP is a simple protocol for transferring information. HTTP makes it easy to transfer not only HTML documents, but a vast variety of other data. Not only can we retrieve documents, we can also for example search for information, and talk to a variety of programs on computers across the 'Net. HTTP is very similar to the format used for e-mail (RFC 822) and MIME.
The beauty of HTTP/1.0 is that it makes it easy for us to 'talk to' other Internet-based protocols - there's a host of these, including SMTP, NNTP, FTP, Gopher, and WAIS. HTTP/1.0 is an excellent negotiator between these protocols.
Recently the World Wide Web Consortium (W3C) has defined a new standard called HTTP/1.1, which will probably eventually replace 1.0. Most servers still use 1.0. We discuss HTTP/1.1 below, as well as taking a brief peek at some features of HTTP/1.0. (Don't look at these now).
A CGI script is really a program. The program lives on a computer connected to the Internet. The "script" (actually a fully-fledged program) is usually written in the elegant and satisfying language Perl , athough a host of other languages may be used (TCL, BASIC [Yuk!], C, Ada, .. and so on). CGI scripts need not necessarily be interpreted scripts, in fact, they can be compiled programs (for example, C is always compiled).
Consider someone (let's call him Fred) who is browsing web pages on the internet. Fred comes across a web page which contains a link that refers to a CGI script. Fred clicks on the link. What happens next?
When Fred clicks on the link, certain parameters are passed by his browser across the Internet, and eventually the CGI script that lives on the server starts running. What happens next depends very much on the nature of the script, but commonly:
Let's explore how this happens. First, a VERY SIMPLISTIC example of what the browser request might look like:
GET /cgi-local/myscript.pl HTTP/1.0 Accept: www/source Accept: text/html Accept: image/gif User-Agent: Netscape/2.0 libwww/2.14
Note the similarity to our previous, conventional web-page example, including (you guessed it) the blank line at the end. The difference is that, instead of requesting a web page, the GET asks for a script called "myscript.pl" found in the directory "/cgi-local/". What will the response look like? Well, that depends on the script, but commonly, the response will be almost identical to our previous response..
HTTP/1.0 200 OK Date: Wednesday, 02-Feb-94 23:04:12 GMT Server: NCSA/1.1 MIME-version: 1.0 Last-modified: Monday, 15-Nov-93 23:33:16 GMT Content-type: text/html Content-length: 456 <HTML> .. the web page goes here.. </HTMLY>
The only difference is that the script dynamically writes the response, rather than fetching a static page stored somewhere on the server. You can imagine the immense power (and potential for cock-ups) inherent in such a set-up! There are several important aspects to CGI scripting that you have to be aware of. We will look at:
There are several ways that a CGI script might be invoked. One is simply including a reference to the script in an HTML anchor. Another is a form. We previously learnt about HTML forms in our JavaScript tutorial. Important components are:
When we discussed forms in our JavaScript tutorial, we referred to two methods that might be used, GET and POST. We also mentioned that POST is much more general and powerful than GET. Let's look at a POST that might result from a form..
POST /cgi-local/posthandler HTTP/1.0 Accept: www/source Accept: text/html Accept: video/mpeg Accept: image/jpeg Accept: image/gif {..snippety..more Accepts may go here..} User-Agent: Netscape/2.0 libwww/2.14 Content-type: application/x-www-form-urlencoded Content-length: 184 &walrus=Large &honey=sweet &silliness=on &myURL=Fred%20Foobar%20argh@foobar.com
Note the one special line in the above, apart from the blank line ! Can you guess which one? Yep.. (no, not the walrus)..
Content-type: application/x-www-form-urlencoded
This line tells us something very interesting - that all fancy characters in the 'body' section following the blank line will be encoded , for example, space characters will be encoded as "%20", in other words, % followed by the relevant hexadecimal code.
Now look more carefully at the body of the POST. It's made up of lines with the general format:
&name=value
See how every name begins with an ampersand (&), and then the value follows an equals sign. Neither the = nor the & are encoded, but any other occurrence of either within the actual name or value will be encoded (as %3D and %26 respectively).
The cute thing about POST is that a separate data stream is opened, and the data (here name=value pairs) are put onto that separate stream. The stream becomes the standard input of the CGI script! (For future reference, we'll here note that when you use the POST method, then the environment variables CONTENT_LENGTH and CONTENT_TYPE are both set up appropriately).
Always use POST rather than GET, unless you have no choice |
With this under our belt, let's look briefly at the same information, passed as a GET..
GET /cgi-bin/posthandler.pl?&walrus=Large&honey=sweet&silliness=on &myURL=Fred%20Foobar%20argh@foobar.com HTTP/1.0 Accept: www/source Accept: text/html Accept: video/mpeg Accept: image/jpeg Accept: image/gif {..snippety..more Accepts may go here..} User-Agent: Netscape/2.0 libwww/2.14
Note the restrictive format - all the name=value parameters are plonked after a question mark that follows the path and name of the Perl script "posthandler.pl". Despite this limitation, the GET method is useful where we simply wish to submit a stock query to a script, for we can embed the query in a link, thus:
<a href="http://www.anaesthetist.com/cgi-local/shazam.pl?foo=123&bar=nibbles"> click here for confusion</a>
.. but be careful. There's often a (poorly-defined) limitation on the amount of data you can pass with a GET, so don't be surprised if long data are ruthlessly truncated!
CGI stands for Common Gateway Interface . The Interface is the glue that binds the "client" (in our example, Fred) and the actual script that does the work. Information comes in to the Interface, and is then relayed to the script. There is a particularly exacting format that is used to pass this information. What happens is that the script can read in parameters passed to it by the Interface. The script "sees" these parameters as things called environment variables . The names of environment variables are always written in UPPER CASE. There are many environment variables, but perhaps the two most important ones are:
We already know that if we create a form and then call up a CGI script using POST (the better way) then it's easy for the script to see the incoming data - it simply reads the standard input . Things are more convoluted if we use GET. GET can be invoked..
<a href="http://www.foo.com/bar/foobar.pl?thisIsSomeInfo">
Note that the string that the script sees in QUERY_STRING is mutilated during the process of transfer. In particular:
This allows extra information to be transmitted from a web-page to a CGI script. The information is put in after the path to the script, for example: "http://www.anaesthetist.com/cgi-local/Fred.pl/PathINfogoeshere";
You can combine such path information with a query-string.. "http://www.anaesthetist.com/cgi-local/Fred.pl/PathINfogoeshere?Whatmeworry";
There are several other env variables that are routinely set for all requests:
There are other env variables whose presence depends on the demand being made of the gateway:
A further complication is environment variables that begin with HTTP_ :
None of the environment variables can reliably identify a user! |
Note that with server side includes, several other environment variables may be made available, including DOCUMENT_NAME, DOCUMENT_URI, QUERY_STRING_UNESCAPED, DATE_LOCAL, DATE_GMT, and LAST_MODIFIED.
Okay, enough mucking around. We're soon going to look at Perl coding in some detail, but here's just a flavour - the Perl code to actually get hold of the environment variables:
#!/usr/local/bin/perl print "Content-type: text/html\n\n"; print "\n"; foreach $key (sort keys(%ENV)) \ { print "$key = $ENV{$key} <br>"; }The above little Perl script will actually provide a list of environment variables as an HTML page (admittedly, without the usual head and body tags etc). The first line simply says where to find Perl on the system. See how Perl automagically keeps the environment variables in the %ENV associative array, and we go through each using the foreach instruction. The print statements preceding the foreach loop are explored in detail below. Note the usual Perl method of accessing each associative variable - because it's a variable we say $ENV and not %ENV, but we put curly brackets {braces} afterwards thus: $ENV{whatever}.
|
If you know a bit of Perl, then you can more-or-less work out how to respond to a GET or POST from a browser. You want to generate something along the lines of our lazy little example above..
Content-type: text/html <HTML> .. the web page goes here.. </HTML>
.. remembering that the HTTPD will fill in all the details to make up an acceptable HTTP/1.0 header. Well, let's try some Perl:
#!/usr/local/bin/perl print "Content-type: text/html\n\n" ; print "<html><head><title>A little page</title></head>" ; print "<body>This is a little page</body></html>" ;
Note the two carriage returns (\n\n) after the Content-type statement - these provide our magical blank line without which the header simply won't work. Also see how we simply print to standard output! (stdout - if you were to run the script on your machine without any web in the way, you would see the response on your console). Incredibly straightforward. As usual, there are wrinkles. There's a fancy way in Perl (isn't there always?) of quoting large sections of text. Here it is..
print "Content-type: text/html\n\n" ; print <<"END OF HTML"; <html> <head> <title>A little page</title> </head> <body> This is a little page </body> </html> END OF HTML
In the above, we've used the text string "END OF HTML" to delineate (surprise, surprise) the end of the html text we wish to print. Cute, is it not? This cute Perl trick is called "here-document quoting". But NOTE that the line END OF HTML must be alone and on its own - any character on the line apart from "END OF HTML" (even a space) will screw things up horribly. Use with respect!
There's no reason why your CGI script has to return a Content-type of text/html. For example, you can generate images on the fly (if you know how) of type image/gif. Something along the lines of..
<IMG SRC="/cgi-local/foo.pl?aargh=on&image=bar.gif" ALT="You fruitcake">
Well, presumably you'd want to put in a width and height if you knew them, now, wouldn't you? This is the sort of stuff that generates those ugly little web-page counters that you so self-righteously despise!
Apart from Content-type header lines, there are several others that you can put in, but there are two that can stand on their own, replacing Content-type: - Location and Status .
Location is quite sneaky (redirects to another URL, if a URL is specified , or fetches a document as if the client had requested it, if a path is specified . You can even submit a "?" directive after the relevant file name).
Status sends something called a status-line to the client. A status-line is an HTTP/1.0 message that combines a three-digit code (nnn) and a string that explains the reason for the code.
You know how to get the data - they are stored in the QUERY_STRING environment variable. So we simply say:
if ( $ENV{'REQUEST_METHOD'} eq "GET" ) { $inputString = $ENV{'QUERY_STRING'}; };
Well, not that simple, because we still have to decode various special characters, and pull out the name=value pairs. There are several steps:
$inputString =~ s/\+/ /g;
my (@Pairs); @Pairs = split ( /&/, $inputString );
my (%NameValues); # our associative array my ($name, $value); # local variables foreach $onepair (@Pairs) { ($name, $value) = split ( /=/, $onepair); # FIX UP: here we should fix up the names and values.. $NameValues{$name} = $value; # set value in associative array };
$name =~ s/%([0-9A-Fa-f][0-9A-Fa-f])/pack("C",hex($1))/ge;.. and a similar line for $value. {check the above code}.
if ( defined ($NameValues{$name}) ) { # do what you want to here with the multiple values.. # for example, separate them by colons, or whatever. } else { $NameValues{$name} = $value; };
The general way that you read the body of a POST statement is simply by querying the standard input, stdin. This is the same as reading the keyboard (console) when your Perl program is interacting with you. In other words you say something along the lines of:
if ( $ENV{'REQUEST_METHOD'} eq "POST" ) { read(STDIN,$inputString,$ENV{'CONTENT_LENGTH'}); };
.. which will read the whole body of the POST from the standard input. Note that this body is composed of multiple lines (separated by CR and/or LF characters). See how we get the length of the data from the CONTENT_LENGTH environment variable. There is NO obligation for CGI to put some sort of "End Of File" character on the data, so CONTENT_LENGTH is rather important.
After we've read in our $inputString, we massage it into shape (and name/value pairs) as we did for the GET statement above.
We use the more complex Perl read statement rather than something like:
local undef $/; #force read to end of file! $inputString = <STDIN>; local $/="\n"; #re-establish usual delimiter!because we have no EOF character, and therefore need to state the length.
There are several tricks to getting a Perl script to run on your server. Here are a few pointers:
#!/usr/bin/perl
(Don't leave out the #, or put a space between the # and the !). "which perl" will generally tell you on a UNIX system where perl resides.
chmod 755 filename
Where filename is the name of the directory or file whose permission you're changing. You'll find that friendly software like WS-FTP will show the permissions in a slightly different way.
Useful UNIX commands apart from chmod are mv , used to move or rename a file, mkdir for creating a directory, cp to copy a file, pwd to list the current directory, and ~ as a replacement for the long cumbersome path to your home directory!
Okay. Create a web-page with a reference to the script, or simply type in the URL thus:
http://www.anaesthetist.com/cgi-local/foobar.pl
And see whether you get back the dynamic web-page you expected (or a confusing error)! If all hell broke loose (ie. you got an internal server error, code 500) then look carefully on your server for the file error.log that will tell you what went wrong (yep, you guessed it, try the grovel, grovel routine). Chances are, you didn't set 755 permissions for both the directory and script file, but you may have the wrong #!pathtoperl, or some other error in your script. (Check the script from the command line if you can, although this is often unhelpful).
If the script file wasn't even found, then you may have forgotten that UNIX is cAse SENsitIvE, or have the wrong name, or the wrong suffix (e.g. ".pl" in stead of ".cgi", or vice versa). Here's an example to play around with:
A simple 'Hello world' example |
Click on this test link. The URL is "http://www.anaesthetist.com/cgi-local/hi.pl" Here's the source code for hi.pl.. #!/usr/bin/perl #=======================================================================# # A SIMPLE CGI HELLO-WORLD! # #=======================================================================# print "Content-type: text/html\n\n" ; print <<"END OF HTML"; <html> <head> <title>A little page!</title> <body> <div align="center"> <h1>Hi!</h1> </div> <hr> This is a little test page. <p><a href="http://www.anaesthetist.com/mnm/cgi/index.htm#after">Back to tutorial</a> <p><hr> </body> </html> END OF HTML #=======================================================================# # That's it! # #=======================================================================# |
Hmm. It's general practice for scripts to be run as 'nobody'. This means that if your CGI is to write to a file on the server, the directory needs to be "world-writable" (a particularly bad idea) or owned by 'nobody'. [You may wish to research this further].
This depends on your database. There's a lot on the 'Net about the common databases - including mySQL and PostgreSQL. For example, check out resourceindex.com. It's also good to know some ODBC.
If you write Perl scripts that others can execute, you ARE opening security holes! Sounds pretty brutal, but largely true. Beware the following (at the very least):
$surname = "O'Connor"; &SubmitSQL ("INSERT INTO NAMETABLE (Surname) VALUES ('$surname')" );will cause a headache. You first have to duplicate the single quote in the surname!
If you don't understand the above, you probably shouldn't be
allowing others to use your scripts by publishing them on the 'Net! Even if you do, you're probably still going to get burned from time to time. |
The following are all special characters in HTTP/1.0:
< > ( ) @ , ; : \ " / [ ] ? = { }.. as well as the space character, and horizontal tab. Also don't forget the special roles of LF ± CR!
This is actually fairly straightforward. In your JavaScript simply say something along the lines of..
window.location.href = "http://www.example1.com/cgi-bin/fred.pl?hereismyquery";.. and you're away.
There's a readily available Perl library module called CGI.pm. If it's on your system, you're lucky, and can just say
use CGI;at the start of your Perl script. If it's not, try (grovel, grovel) or installing it in a local directory and then saying something along the lines of..
use lib '/path/of/yourlocaldir'; use CGI;
Using CGI.pm you can handle parsing of CGI queries and form generation
with just a few simple calls. Check it out in Lincoln Stein's vast CGI.pm documentation.
John Callender has written an excellent introduction to this Perl module. This
includes how to send an email from Perl, and is incidentally a rather good
introduction to Perl! (Or get the source of that other
excellent form to email script, FormMail).
A note on MIME
MIME stands for "Multipurpose Internet Mail Extension". MIME allows
for transfer and recognition of a host of different data types.
There are hundreds of types. MIME types were originally defined in RFC 1341.
For some improvements, see the now obsolete RFC 1521 and 1522, and the more recent
five-part RFC 2045 to 2049. Mailcap files (which handle media
types) are in RFC 1524. All MIME types should be registered with IANA.
Current types include:
For further reading try a long list of MIME types with the relevant RFCs, etc. Here's another good reference. And here's a no-nonsense note. The MIME FAQ is here.
Here's a direct snitch from the CGI FAQ (go read it - it's great)..
For a proper overview, "man chmod ". Some modes that may be useful in a typical CGI context are: * CGI programs, 0755 * data files to be readable by CGI, 0644 * directories for data used by CGI, 0755 * data files to be writable by CGI, 0666 (data has absolutely no security) * directories for data used by CGI with write access, 0777 (no security) * CGI programs to run setuid, 4755 * data files for setuid CGI programs, 0600 or 0644 * directories for data used by setuid CGI programs, 0700 or 0755 * For a typical backend server process, 4750What you're doing is setting bit flags in the permissions of files and directories. Scary stuff! The 'behind-the-scenes' information is that 755 (for example) is an octal number. The "7" refers to "remote file permissions" for the owner, and the subsequent two 5's to permissions for "group" and "other" respectively. Each octal digit is made up of three bits, the first referring to read permissions, the second to write, and the last to execute. So "5" allows read and execute, but not write, and so on. (read = 4, write = 2, execute = 1, 4+2+1 = 7).
Okay, the leftmost digit (usually assumed to be a zero) is a little different, in that the first digit selects the "set user ID", the second the "set group ID", and the last the "save text image" attribute. So the "4" in "4755" means "set user ID", for example.
Note that if you're being anally retentive, it's probably best to specify the leading zero, as this ensures that the number is seen as octal on almost any system!
Note that we particularly don't want some sneaky hacker to insert "server-side includes" into our data. So we must at some stage go through all data strings and rip out all potential server-side includes, thus:
$datum =~ s/<!--(.|\n)*-->//g;
The general format of a server side include is:
<!--#command tag1="value1" tag2="value2" -->for example..
<!--#exec cgi="/cgi-bin/hits.pl"-->
In other words, a 'standard' HTML comment containing a directive to include the output from an executable in the web-page. Not only do SSIs provide a security hazard, but they will also slow down an overworked server even further, because documents that are provided have to be parsed by the server to see whether it should insert an 'include'. Note that setting up SSI is also quite a business as you have to decide which directories are safe to use, and tell the server what file type is to be parsed and turned into an HTML document. Internally the server uses the MIME-type text/x-server-parsed-html to identify SSI documents (they often have the suffix .shtml).
Commands include exec, config, include, echo, fsize, and flastmod . With exec the cmd tag executes the string provided (using /bin/sh), and cgi runs a script and inserts its output, whatever it is! For more on NCSA server side includes, try.. this note
There are yet more wrinkles..
<BASE HREF="http://www.foo.com/cgi-bin/scriptname"> <ISINDEX PROMPT="Enter some simple words..">
Note that with ISINDEX queries, command line arguments may be used. (See below). One way of determining whether the command line may safely be used is to first check QUERY_STRING for an equals sign ( = ) that is NOT encoded. If such a nasty little character is present, then do NOT use the command line, as all ISINDEX queries should encode equals signs!
argv[1]and so on. The sneaky thing is that QUERY_STRING values are also put into argv[1], argv[2] and so on, provided that the data were not submitted from a form. How is argv[1] distinguished from argv[2] and so on? Easy! Spaces (now translated to plus signs) are the delimiter. So, if you web page anchor is:
<a href="http://www.foo.com/bar/foobar.pl?alpha beta gamma?">
then the script will see argv[1] as the string "alpha", argv[2] will be "beta", and so on.
These may be used to handle the case where a script 'crashes and burns'. They have extra environment variables, including:
A topic beyond the scope of this document at present!
message = fields *( CRLF *text ) ; Everything after ; first null line ; is message body fields = dates ; Creation time, source ; author id & one 1*destination ; address required *optional-field ; others optional source = [ trace ] ; net traversals originator ; original mail [ resent ] ; forwarded trace = return ; path to sender 1*received ; receipt tags return = "Return-path" ":" route-addr ; return address received = "Received" ":" ; one per relay ["from" domain] ; sending host ["by" domain] ; receiving host ["via" atom] ; physical path *("with" atom) ; link/mail protocol ["id" msg-id] ; receiver msg id ["for" addr-spec] ; initial form ";" date-time ; time received originator = authentic ; authenticated addr [ "Reply-To" ":" 1#address] ) authentic = "From" ":" mailbox ; Single author / ( "Sender" ":" mailbox ; Actual submittor "From" ":" 1#mailbox) ; Multiple authors ; or not sender resent = resent-authentic [ "Resent-Reply-To" ":" 1#address] ) resent-authentic = = "Resent-From" ":" mailbox / ( "Resent-Sender" ":" mailbox "Resent-From" ":" 1#mailbox ) dates = orig-date ; Original [ resent-date ] ; Forwarded orig-date = "Date" ":" date-time resent-date = "Resent-Date" ":" date-time destination = "To" ":" 1#address ; Primary / "Resent-To" ":" 1#address / "cc" ":" 1#address ; Secondary / "Resent-cc" ":" 1#address / "bcc" ":" #address ; Blind carbon / "Resent-bcc" ":" #address optional-field = / "Message-ID" ":" msg-id / "Resent-Message-ID" ":" msg-id / "In-Reply-To" ":" *(phrase / msg-id) / "References" ":" *(phrase / msg-id) / "Keywords" ":" #phrase / "Subject" ":" *text / "Comments" ":" *text / "Encrypted" ":" 1#2word / extension-field ; To be defined / user-defined-field ; May be pre-empted msg-id = "<" addr-spec ">" ; Unique message id extension-field = <Any field which is defined in a document published as a formal extension to this specification; none will have names beginning with the string "X-"> user-defined-field = <Any field which has not been defined in this specification or published as an extension to this specification; names for such fields must be unique and may be pre-empted by published extensions>
The following are snippets from RFC 1945. The extended BNF format used is similar to that of RFC 822, and other features are similar to MIME (particularly the obsolete RFC 1521). Appendix C of RFC 1945 lists how HTTP/1.0 differs from MIME.
URIs in HTTP can be represented in absolute form or relative to some known base URI [9], depending upon the context of their use. The two forms are differentiated by the fact that absolute URIs always begin with a scheme name followed by a colon. URI = ( absoluteURI | relativeURI ) [ "#" fragment ] absoluteURI = scheme ":" *( uchar | reserved ) relativeURI = net_path | abs_path | rel_path net_path = "//" net_loc [ abs_path ] abs_path = "/" rel_path rel_path = [ path ] [ ";" params ] [ "?" query ] path = fsegment *( "/" segment ) fsegment = 1*pchar segment = *pchar params = param *( ";" param ) param = *( pchar | "/" ) scheme = 1*( ALPHA | DIGIT | "+" | "-" | "." ) net_loc = *( pchar | ";" | "?" ) query = *( uchar | reserved ) fragment = *( uchar | reserved ) pchar = uchar | ":" | "@" | "&" | "=" | "+" uchar = unreserved | escape unreserved = ALPHA | DIGIT | safe | extra | national escape = "%" HEX HEX reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" extra = "!" | "*" | "'" | "(" | ")" | "," safe = "$" | "-" | "_" | "." unsafe = CTL | SP | <"> | "#" | "%" | "<" | ">" national =For definitive information on URL syntax and semantics, see RFC 1738 [4] and RFC 1808 [9]. The BNF above includes national characters not allowed in valid URLs as specified by RFC 1738, since HTTP servers are not restricted in the set of unreserved characters allowed to represent the rel_path part of addresses, and HTTP proxies may receive requests for URIs not defined by RFC 1738.
Let's look at a definition of HTTP messages:
HTTP messages consist of requests from client to server and responses from server to client. HTTP-message = Simple-Request ; HTTP/0.9 messages | Simple-Response | Full-Request ; HTTP/1.0 messages | Full-Response Full-Request and Full-Response use the generic message format of RFC 822 [7] for transferring entities. Both messages may include optional header fields (also known as "headers") and an entity body. The entity body is separated from the headers by a null line (i.e., a line with nothing preceding the CRLF). Full-Request = Request-Line ; Section 5.1 *( General-Header ; Section 4.3 | Request-Header ; Section 5.2 | Entity-Header ) ; Section 7.1 CRLF [ Entity-Body ] ; Section 7.2 Full-Response = Status-Line ; Section 6.1 *( General-Header ; Section 4.3 | Response-Header ; Section 6.2 | Entity-Header ) ; Section 7.1 CRLF [ Entity-Body ] ; Section 7.2 Simple-Request and Simple-Response do not allow the use of any header information and are limited to a single request method (GET). Simple-Request = "GET" SP Request-URI CRLF Simple-Response = [ Entity-Body ] Use of the Simple-Request format is discouraged because it prevents the server from identifying the media type of the returned entity.
Status-Code = "200" ; OK | "201" ; Created | "202" ; Accepted | "204" ; No Content | "301" ; Moved Permanently | "302" ; Moved Temporarily | "304" ; Not Modified | "400" ; Bad Request | "401" ; Unauthorized | "403" ; Forbidden | "404" ; Not Found | "500" ; Internal Server Error | "501" ; Not Implemented | "502" ; Bad Gateway | "503" ; Service Unavailable
There are just a few differences from the (now obsolete) RFC 1521 MIME definition:
The RFC standard for HTTP/1.1 is RFC 2616 from June 1999. We're up to ~170 pages in this standard, which differs quite a lot from 1.0. Krishnamurthy, Mogul and Kristol have reviewed the differences between v 1.0 and 1.1. In summary, these are:
1.1 cranks the above up a few notches - a cache entry is either fresh or stale. Stale entries should (normally) be re-validated when requested. A new idea is the ETag header, which is shorthand for 'entity tag'. Rather than insisting on comparison of timestamps, the new standard makes sure that if two entity tags are the same, then the associated responses must be identical. There are also extra conditional requests..
You should research your server - a good internet search term is simply "HTTPD" with the type of server, for example Apache (by far the commonest, over 50% of all HTTPDs). Note that the CERN/W3C and NCSA HTTPDs are no longer being developed, so if you use them and there's a bug, you may be stuck!
Check these out..
You can get all the RFCs at http://www.ietf.org. Here are a few:
Date of First Publication: 2002-4-4 | Date of Last Update: 2006/10/24 | Web page author: Click here |