Normal Forms
There is an easy mnemonic that describes our usual goal of "third
normal form". Here it is:
Data should depend on the key,
the whole key, and nothing but the key! ... Rule #3.
What does this mean?
First Normal Form
If a relation contains no repeating groups, then it is in
first normal form. Conversely, if we find that for a single primary
key, we have two data items that we need to 'squeeze' into one
column, then the data cannot be in first normal form. Let's say
that we've decided to slap together a database containing information
about patients and the (medicinal) drugs they are taking. So we
create a table with the following columns, including the primary key
"Patient ID Number":
Patient ID Number |
Patient Name |
Drug |
Trade Name |
Formulation |
Size |
Dose |
Frequency |
Side Effect |
Let's try filling in some information..
Patient ID Number |
Patient Name |
Drug |
Trade Name |
Formulation |
Size |
Dose |
Frequency |
Side Effect |
6009315011076 |
Soap,Joe |
Carbimazole Carbamazepine |
Neomercazole Tegretol |
Tab Caps |
10mg 200mg |
30mg 200mg |
OD TDS |
Agranulocytosis nil |
6003325011074 |
Green,Anne |
Carbimazole |
Neomercazole |
Tab |
10mg |
10mg |
OD |
nil |
Woops. At the intersection of some rows and columns, we've had
to fill in two different data items. It's also obvious that there
is considerable redundancy in the table - the drug Carbimazole (with
all the associated information) is represented twice.
Fixing to First Normal Form
It should be clear that we can fix things by simply introducing
a "composite primary key" - we can group the drug and the patient
ID together as a composite primary key, so now each row has a unique
primary key, and at no intersection of row and column do we have
to put in two data items:
Patient ID Number |
Drug |
Patient Name |
Trade Name |
Formulation |
Size |
Dose |
Frequency |
Side Effect |
6009315011076 |
Carbimazole |
Soap, Joe |
Neomercazole |
Tab |
10mg |
30mg | OD |
Agranulocytosis |
6009315011076 |
Carbamazepine |
Soap, Joe |
Tegretol |
Caps |
200mg |
200mg | TDS |
nil |
6003325011074 |
Carbimazole |
Green,Anne |
Neomercazole |
Tab |
10mg |
10mg | OD |
nil |
Equally clearly, this is a clumsy fix, and really hasn't changed much.
We still have immense data redundancy. We need to go further to..
Second Normal Form
Note that, although the above table is now in 'first normal form', nothing
much has changed. Look at the table carefully, and you'll see that
there are what we call partial dependencies. In other words,
the columns
Trade Name,
Formulation, and
Size
depend only on the Drug component of the key, and not on
the other part of the key, Patient ID Number. Likewise, the
Patient Name
depends only on the Patient ID Number, and is not dependent
at all on the Drug column! To get rid of this problem, we
create two new tables, a Patient Table:
Patient
|
Patient ID Number |
Patient Name |
6009315011076 |
Soap, Joe |
6003325011074 |
Green,Anne |
and a Drug Table:
Drug
|
Drug |
Trade Name |
Formulation |
Size |
Carbimazole |
Neomercazole |
Tab |
10mg |
Carbamazepine |
Tegretol |
Caps |
200mg |
Finally, we relate the two in a diminished form of our original table, thus:
PatientRx
|
Patient ID Number |
Drug |
Dose |
Frequency |
Side Effect |
6009315011076 |
Carbimazole |
30mg | OD |
Agranulocytosis |
6009315011076 |
Carbamazepine |
200mg | TDS |
nil |
6003325011074 |
Carbimazole |
10mg | OD |
nil |
Third Normal Form
.. but still we're not finished. Look at the last table carefully.
There's an intimate relationship between the the dose and the frequency of
giving a drug. Here's where knowledge of your subject comes in. Many people would
be happy to stop at this point. There is no absolutely fixed rule
that says you have to give your carbamazepine at a dose of 200mg TDS - in
fact, if you're religiously doing so, you're probably mistreating a
lot of your epileptics. But let's say the whole point of our drug
survey is to relate different drug regimens and associated side effect!
In this context, we have a strong association between the dose and
frequency of a drug. We call such a relationship between two non-key
attributes a transitive dependency.
Third normal form is about
removing transitive dependencies. How could we do this?
Well, we could label each dose + frequency as a "regimen",
and proceed to break up the PatientRx table into two, thus:
PatientRx
|
Patient ID Number |
Drug |
Regimen |
Side Effect |
6009315011076 |
Carbimazole |
R1 |
Agranulocytosis |
6009315011076 |
Carbamazepine |
R3 |
nil |
6003325011074 |
Carbimazole |
R2 |
nil |
with a new DrugRegimen table:
DrugRegimen
|
Regimen |
Dose |
Frequency |
R1 |
30mg | OD |
R2 |
10mg | OD |
R3 |
200mg | TDS |
Common Sense Intervenes?
A moment's consideration will suggest that we could have done things
several different ways!
- Firstly, what was to prevent us from lumping the drug, dose and
frequency together in the DrugRegimen table? In many ways, this
makes more sense. One's never going to give carbamazepine in a dose
of 10mg daily, or (heaven forbid!) carbimazole 200mg TDS, so it's silly
to have these sort of redundant options enshrined within our database!
You can see the importance of knowing your subject, and not just blindly
putting tables together. If we move drug name into the DrugRegimen
table..
PatientRx
|
Patient ID Number |
Regimen |
Side Effect |
6009315011076 |
R1 |
Agranulocytosis |
6009315011076 |
R3 |
nil |
6003325011074 |
R2 |
nil |
with a new DrugRegimen table:
DrugRegimen
|
Regimen |
Drug |
Dose |
Frequency |
R1 |
Carbimazole |
30mg | OD |
R2 |
Carbimazole |
10mg | OD |
R3 |
Carbamazepine |
200mg | TDS |
then we can immediately see that:
- The "Drug" column disappears from the PatientRx table;
- "Regimen" now becomes part of the composite primary key
within the same table;
- In our first incarnation of PatientRx, where we had
"Drug", "Dose" and "Frequency" as columns, it's probably more
correct to regard "Dose" and "Frequency" as being dependent
on "Drug" - a partial dependency that violates second
normal form, rather than stressing their transitive dependency on one another
(violating third normal form)! Nevertheless, one could still make
a fairly strong case (based on their transitive dependency) for further dividing the DrugRegimen
table into two tables, for example:
DrugRegimen
|
Regimen |
Drug |
Dosing |
R1 |
Carbimazole |
D1 |
R2 |
Carbimazole |
D2 |
R3 |
Carbamazepine |
D3 |
|
DrugDosing
|
Dosing |
Dose |
Frequency |
D1 |
30mg | OD |
D2 |
10mg | OD |
D3 |
200mg | TDS |
|
Personally, I wouldn't bother to do this last step, unless I was dealing
with a whole lot of drugs that had similar dosing schedules. You can
argue it both ways! This shows how, although there are a lot of hard
rules in database design, a lot of what you do depends on an in-depth
appreciation of the subjects you are dealing with, and what you want
to get out of the database!
- Secondly, one could argue that in the PatientRx table,
"Side Effect depends only on Regimen, so the table violates
second normal form". This is very much a personal opinion, but my
contention would be that such an assertion is completely false, as
there is an intimate relationship between the patient (and his/her
unique metabolism), and the likelihood of a side effect occurring.
Tricky, isn't it?
- Thirdly, one should note that there is likely to be more
than one side effect, in certain cases. So, just when we thought
that we were home and dry, we find that our PatientRx table isn't
even in first normal form, all because we left an 's' off the
title of this column, and didn't think hard enough!
A Summary of Normalisation
We can see the usefulness of our mnemonic
Rule #3 - A datum should depend
on a key, individual data items should depend on the whole key
(not just part of the key, which would be a partial dependency), and
nothing but the key (to avoid transitive dependencies).
What are the merits of such extensive normalisation (besides giving
us a lot of little tables??). The merit of first normal form should
be intuitively obvious, but what about second and third? Think about
our example of first normal form. We could not insert a new drug
into our database (say aspirin) until at least one of our patients
was on aspirin. This insertion anomaly, and a host of other
consequences besides the
obvious redundancy, make second normal form a good idea. Likewise
(after we've accepted that defining a drug regimen is a good idea,
and decided that a regimen constitutes a particular dose and frequency),
we cannot add a new drug regimen until we've normalised
our tables to third normal form. Note that a lot depends on how
we are actually using the data!
There are other 'higher' forms of normalisation, for even third
normal form doesn't guarantee that all 'anomalies' have been removed.
Generally, we don't often go beyond 3NF (third normal form), but here
are a few 'higher' normal forms, just to whet your appetite!
- 'Boyce-Codd Normal form' - When there are alternate keys (there
is more than one candidate key), a table may be in 3NF and anomalies
may still result. This occurs when there is a composite primary key,
and there are two equally valid candidates to make up part of this
composite primary key. If there is an attribute (one or more columns)
on which any other attribute is fully dependent, and this
attribute is NOT itself a candidate key, then the table is not in
Boyce-Codd Normal form (BCNF). We fix this by breaking the table up into
two tables, both in BCNF.
(There is a special name for an attribute
on which some other attribute is fully dependent - we call it a
determinant.)
- 'Fourth Normal Form' is even more tricky. Here the central
concept is one of multivalued dependency - there are
three attributes in a relation (let's call them A, B and C), and
although for each A there is a well-defined set of values for
both B and C, nevertheless values in B and C are independent.
- We won't even talk about "Fifth normal form" (join dependencies),
and "Domain-Key normal form".
Date of First Draft: 2001/11/11 |
Date of Last Update: 2001/-/- |
Web page author:
jo@anaesthetist.com
|