Part 1 of Database Internals

Database Normalization: 1NF through 5NF Explained

16 min read

Database normalization is the process of organizing tables to reduce redundancy and prevent data anomalies. Each “normal form” is a progressively stricter rule about how data should be structured. For each form, we’ll look at a table that violates it, explain exactly why, then show the corrected design with real data.


First Normal Form (1NF)

The Rule: Every column holds atomic (indivisible) values. No repeating groups. Every row is uniquely identifiable by a primary key. Column data types are consistent.

The Violating Table — Insured Customer Family Data:

(Imagine the columns repeating: Child_1, Age_1, Child_2, Age_2, ... — the duplicate column names below render as a single header in markdown but represent the unbounded repeating-group violation.)

Customer CodeCustomer SurnameChildAgeChildAge ···
DENT04DentonBilly3Olivia10
PARK31ParkJustin7
PARK31ParkPhoebe5
KELL17KellerAlice4IvyNewborn ···

This table violates 1NF in four separate ways:

  1. Row order is meaningful — the header says “from most-recently-entered (top) to least-recently-entered (bottom)”. Data meaning should never depend on row order.
  2. Inconsistent data types — the Age column holds integers in some rows (“10”) and strings in others (“Newborn”). A column must have one consistent type.
  3. No primary keyCustomer Code is not unique: PARK31 appears twice. Uniqueness is not being enforced.
  4. Repeating groupsChild and Age columns repeat indefinitely across the row. The number of children is unbounded, so there’s no way to design a fixed schema.

The Fix — Two tables, both in 1NF (and well-normalized for this schema — we’ll see what 5NF actually requires further down):

Customer_CodeCustomer_Surname
DENT04Denton
PARK31Park
KELL17Keller
Customer_CodeChild_Seq_NoChild_NameChild_AgeDate_Recorded
PARK311Phoebe52026-01-27
PARK312Justin72026-01-28
DENT041Billy32026-02-01
DENT042Olivia102026-02-01
KELL171Ivy02026-01-19
KELL172Alice42026-01-19
KELL173Jonah72026-01-19
KELL174Billy92026-01-19
  • Primary key of Customer = Customer_Code
  • Primary key of Customer_Dependent_Child = { Customer_Code, Child_Seq_No }
  • Customer_Code on the child table is a foreign key to the Customer table
  • Date_Recorded replaces row ordering as the way to track recency
  • In a real database, Child_Age would be replaced with Child_Birth_Date — age changes, birth dates don’t

A Quick Note on Functional Dependencies

Before 2NF, here’s the notation we’ll use throughout. A functional dependency X → Y means: “given X, you can determine Y uniquely.” If you know an Employee_Code, you know exactly one Employee_Surname. So Employee_Code → Employee_Surname.

The normal forms are progressively stricter rules about which functional dependencies are allowed in a single relation. The mnemonic for 1NF/2NF/3NF is:

Every non-key column depends on the key, the whole key, and nothing but the key.

  • 1NF — every value is atomic; the table has a key.
  • 2NF — non-key columns depend on the whole key (no partial dependencies on part of a composite key).
  • 3NF — non-key columns depend on nothing but the key (no transitive dependencies through other non-key columns).

BCNF, 4NF, and 5NF generalize this further to handle edge cases. We’ll get to those.


Second Normal Form (2NF)

The Rule: Must be in 1NF. Every non-key column must depend on the entire primary key — not just a part of it. This only applies when the primary key is composite (made up of more than one column).

The Violating Table — Employee Working Days:

Employee_CodeWorking_DayEmployee_Surname
WREN02TUESDAYWren
WREN02WEDNESDAYWren
WREN02THURSDAYWren
MILL01MONDAYMiller
MILL01TUESDAYMiller

The primary key is { Employee_Code, Working_Day }. But Employee_Surname only depends on Employee_Code — it tells us about the employee, not about which day they work. This is a partial dependency: a non-key column depends on only part of the composite key.

The result is obvious in the data: “Wren” is repeated three times and “Miller” twice. If the employee changes their surname, every row must be updated — miss one and the data becomes inconsistent.

The Fix:

Employee_CodeEmployee_Surname
WREN02Wren
MILL01Miller
Employee_CodeWorking_Day
WREN02TUESDAY
WREN02WEDNESDAY
WREN02THURSDAY
MILL01MONDAY
MILL01TUESDAY
  • Primary key of Employee = Employee_Code
  • Primary key of Employee_Working_Days = { Employee_Code, Working_Day }
  • Employee_Code on Employee_Working_Days is a foreign key to Employee
  • Surname is stored once — update it in one place, it’s consistent everywhere

Third Normal Form (3NF)

The Rule: Must be in 2NF. No non-key column should depend on another non-key column. Every non-key attribute must depend on the key, the whole key, and nothing but the key.

The Violating Table — Product:

Product_IDProduct_NameProduct_SubcategoryProduct_Category
103400-Piece Autumn Foliage JigsawJIGSAWSPUZZLES
104250-Piece Magic Bookshop JigsawJIGSAWSPUZZLES
105Standard Rubik’s CubeMECHANICAL-PUZZLESPUZZLES
106Kendrick KoalaCUDDLY-TOYSTOYS
107Adventurer Walkie Talkie SetELECTRONIC-TOYSTOYS

The primary key is Product_ID. Product_Category depends on Product_ID only transitively — through Product_Subcategory. The real determinant is Product_Subcategory, and Product_Subcategory is not a candidate key of this table. This is a transitive dependency: Product_ID → Product_Subcategory → Product_Category.

The redundancy is visible: “PUZZLES” appears three times and “TOYS” twice. If we rename the “PUZZLES” category, we have to update multiple rows. Miss one, the data becomes contradictory.

The Fix:

Product_IDProduct_NameProduct_Subcategory
103400-Piece Autumn Foliage JigsawJIGSAWS
104250-Piece Magic Bookshop JigsawJIGSAWS
105Standard Rubik’s CubeMECHANICAL-PUZZLES
106Kendrick KoalaCUDDLY-TOYS
107Adventurer Walkie Talkie SetELECTRONIC-TOYS
Product_SubcategoryProduct_Category
JIGSAWSPUZZLES
MECHANICAL-PUZZLESPUZZLES
CUDDLY-TOYSTOYS
ELECTRONIC-TOYSTOYS
  • Primary key of Product = Product_ID
  • Primary key of Product_Subcategory = Product_Subcategory
  • Product_Subcategory on Product is a foreign key to Product_Subcategory table
  • Renaming “PUZZLES” to “BRAIN-GAMES” now requires updating exactly one row

3NF is the practical target for most production databases. It eliminates the most common redundancy and anomaly problems without over-fragmenting the schema.


Boyce-Codd Normal Form (BCNF)

The Rule: Must be in 3NF. For every functional dependency X → Y, X must be a candidate key. Every determinant must be a superkey.

BCNF is a stricter variant of 3NF that catches edge cases involving tables with multiple overlapping candidate keys.

The Violating Table — Course Scheduling:

TeacherSubjectRoom
Dr. LeeDatabasesRoom 101
Dr. LeeAlgorithmsRoom 202
Dr. KimDatabasesRoom 303

Business rules:

  • Each teacher-room pairing has exactly one subject (so {Teacher, Room} → Subject).
  • Each room hosts only one subject (so Room → Subject).

Candidate keys: {Teacher, Room}. Note that {Teacher, Subject} is not a candidate key — multiple rooms can host the same teacher/subject pair.

But Room → Subject is a valid functional dependency — and Room alone is not a candidate key. This violates BCNF. The consequence: if Room 101 is reassigned to “Algorithms”, you might update one row but leave the teacher’s schedule inconsistent.

The Fix:

RoomSubject
Room 101Databases
Room 202Algorithms
Room 303Databases
TeacherRoom
Dr. LeeRoom 101
Dr. LeeRoom 202
Dr. KimRoom 303

Every determinant is now a candidate key. The Room → Subject dependency lives in its own table where Room is the primary key.

BCNF decomposition is not always dependency-preserving — splitting Room → Subject into its own table loses the ability to enforce {Teacher, Room} → Subject with a simple foreign key. This is the classic BCNF tradeoff: stronger normalization, weaker constraint enforcement.


Fourth Normal Form (4NF)

The Rule: Must be in BCNF. For every non-trivial multi-valued dependency X →→ Y, X must be a superkey of the relation.

A multi-valued dependency exists when one attribute independently determines a set of values for two other attributes in the same table — and those two sets are unrelated to each other.

The Violating Table — Course Assignments:

CourseTextbookTeacher
DatabasesSQL MasteryDr. Lee
DatabasesSQL MasteryDr. Kim
DatabasesDB InternalsDr. Lee
DatabasesDB InternalsDr. Kim

A course has multiple textbooks AND multiple teachers — independently. Any teacher can use any textbook. The table is forced to store every possible (Textbook, Teacher) combination. Adding a new teacher requires adding one row per textbook. Adding a textbook requires adding one row per teacher. Miss one and the data is inconsistent.

The Fix:

CourseTextbook
DatabasesSQL Mastery
DatabasesDB Internals
CourseTeacher
DatabasesDr. Lee
DatabasesDr. Kim

Each independent multi-valued fact lives in its own table. Adding Dr. Park as a teacher is now one row — not one row per textbook.


Fifth Normal Form (5NF)

5NF is rarely the deciding factor in practical schema design — most production databases stop at 3NF or BCNF. We cover it here because the underlying idea (avoiding fabricated tuples on rejoin) is genuinely useful even when you’re not formally proving 5NF compliance.

The Rule: Must be in 4NF. The table cannot be decomposed further without losing information. Eliminates join dependencies that are not implied by candidate keys.

5NF applies when the only lossless way to decompose a table is into three or more smaller tables. Any two-table split introduces spurious (incorrect) rows when rejoined.

The Violating Table — Supplier / Product / Project:

SupplierProductProject
TechCorpLaptopApollo
TechCorpLaptopHermes
TechCorpMonitorApollo
GadgetCoKeyboardApollo

The constraint that makes this 5NF-relevant: the table is only valid if the following business rule holds — if Supplier S sells Product P, AND Product P is used in Project J, AND Supplier S is engaged on Project J, then (S, P, J) must appear in the table. In other words, the relation is the join of three binary relations: SuppliersProducts, ProductsProjects, and SuppliersProjects. Without that constraint, the table is just a normal ternary relationship and no decomposition would lose information.

The rule: a (Supplier, Product, Project) row is only valid if all three pairwise relationships independently hold — the supplier sells that product, the product is used in that project, and the supplier is engaged on that project. No two-way split can enforce this correctly without creating spurious combinations.

The Fix — three pairwise tables:

SupplierProduct
TechCorpLaptop
TechCorpMonitor
GadgetCoKeyboard
ProductProject
LaptopApollo
LaptopHermes
MonitorApollo
KeyboardApollo
SupplierProject
TechCorpApollo
TechCorpHermes
GadgetCoApollo

The original table is reconstructed by joining all three. Any pair-wise decomposition would introduce rows that were never true.

Why three tables, not two? If we decomposed into just Supplier_Product (S, P) and Product_Project (P, J), rejoining on P would manufacture rows that never appeared in the original table. For example, if Supplier A sells the Keyboard and Project Apollo uses Keyboards, the join produces (A, Keyboard, Apollo) — even if Supplier A is not actually engaged on Apollo. The third table Supplier_Project (S, J) is exactly the constraint that prevents this fabrication: only (S, P, J) triples that survive joining all three tables are valid.


The Progression at a Glance

3NF / BCNF is the right target for most production systems. Go further only when you have genuinely independent multi-valued facts (4NF) or complex three-way join dependencies (5NF).


When to Denormalize

Normalization is the right starting point — not always the ending point. Intentional redundancy is valid when:

  • Read performance is critical — heavily joined queries across many normalized tables are expensive. A materialized view or denormalized reporting table can dramatically speed up reads.
  • You’re building a data warehouse — OLAP workloads favor wide flat tables (star/snowflake schemas) over normalized OLTP schemas.
  • The data rarely changes — if a field updates once a year, the update anomaly risk is low.
  • A query is too hot to afford joins — on high-throughput endpoints, a cache field with an explicit refresh strategy can be worth the complexity.

Normalize first, denormalize deliberately. Start clean, measure where bottlenecks actually are, and introduce redundancy with intention — not by accident.


Pitfalls and Common Mistakes

The rules above are clear, but applying them in practice is where teams get into trouble. The recurring patterns worth watching for:

  • Confusing 3NF with BCNF. They sound interchangeable but aren’t. A relation can be in 3NF and still violate BCNF if a non-key attribute determines another candidate-key attribute. Check every functional dependency — not just the ones from the primary key.
  • Treating 4NF and 5NF as required. They apply only when independent multi-valued facts (4NF) or three-way join dependencies (5NF) actually exist in your data. Most production schemas don’t hit either situation. Don’t decompose chasing a textbook target you don’t have.
  • Over-normalizing read-mostly schemas. Splitting tables to satisfy 3NF only pays off if updates happen. A table where rows are written once and never modified rarely benefits from being split into three with foreign keys — you’re paying join cost for no anomaly-prevention upside.
  • Mistaking “atomic” for “small.” 1NF demands no repeating groups, not “no complex values.” A JSONB document, a PostGIS geometry, or an array column is 1NF-compliant if treated as a single indivisible value. Don’t split data that the application always reads and writes as a unit.
  • Dropping foreign keys “for performance.” Foreign keys cost very little on modern Postgres and are the only mechanism that guarantees the referential integrity normalization assumes. Disabling them turns “guaranteed valid” into “valid as long as no bug ever ships.”
  • Storing derived values that violate 3NF. A row holding unit_price, quantity, and total_price creates a transitive dependency (total_price is derivable from the other two). Either compute it on read or accept the redundancy with an explicit refresh strategy and write the trigger / view that keeps it consistent.
  • Adding nullable “extension” columns instead of normalizing. Adding secondary_email, tertiary_email, home_phone, work_phone to a users table — most of them null for most users — is a repeating group disguised as separate columns. The right move is a user_contact_methods child table.
  • Premature normalization before knowing the access patterns. Splitting a domain into eight tables before you know how the application queries it produces a schema optimized for nothing. Sketch the read paths first; let them inform the table boundaries.
  • Forgetting that BCNF decomposition isn’t always dependency-preserving. Decomposing to satisfy BCNF can scatter a multi-attribute functional dependency across two tables in a way that makes it un-enforceable with a simple FK. Decide explicitly whether you’d rather lose the BCNF guarantee or the dependency-preservation guarantee — there’s no fully correct answer.
  • Using surrogate keys to “skip” normalization. Adding an id BIGSERIAL to a table doesn’t make it normalized — it just gives the existing duplicates a new column. Normalize against the real candidate keys; add surrogate IDs as an implementation detail for foreign-key ergonomics.
  • Ignoring functional dependencies in legacy schemas. When refactoring a denormalized table, list every X → Y you can identify in the data before designing the new shape. The dependencies tell you where the tables want to split — guessing produces a schema that fights you at the next migration.