Database Normalization: 1NF through 5NF Explained

Database normalization is the process of organizing tables to reduce redundancy and prevent data anomalies. Each “normal form” is a progressively stricter rule about how data should be structured. We’ll work through every form using a single running example — a university registration system with students, courses, instructors, and departments — so the entities carry their meaning from 1NF all the way to 5NF.

First Normal Form (1NF)

The Rule: Every column holds atomic (indivisible) values. No repeating groups. Every row is uniquely identifiable. Column data types are consistent.

The Violating Table — Student Course Records:

(Imagine the columns repeating: Course_1, Grade_1, Course_2, Grade_2, ... — the duplicate column names below render as a single header in markdown but represent the unbounded repeating-group violation.)

Student_ID	Student_Name	Course	Grade	Course	Grade ···
24680	Denton	CS101	Pending	MATH200	Withdrew
67890	Wren	CS101	A
12345	Park	CS101	A	MATH200	B
12345	Park	CS210	A

This table violates 1NF in four separate ways:

Row order is meaningful — the header says “most-recently-entered (top) to least-recently-entered (bottom)”. Data meaning should never depend on row order.
Inconsistent data types — the Grade column holds letter grades (“A”, “B”) in some rows and status strings (“Pending”, “Withdrew”) in others. A column must have one consistent type.
No primary key — Student_ID is not unique: 12345 appears twice. Uniqueness is not being enforced.
Repeating groups — Course and Grade columns repeat indefinitely across the row. The number of courses a student takes is unbounded, so no fixed schema works.

The Fix — Two tables, both in 1NF:

Student_ID	Student_Name
12345	Park
67890	Wren
24680	Denton

Student_ID	Course_ID	Grade	Status	Enrollment_Date
12345	CS101	A	Completed	2025-09-02
12345	MATH200	B	Completed	2025-09-02
12345	CS210	A	Completed	2026-01-15
67890	CS101	A	Completed	2025-09-02
24680	MATH200	NULL	Withdrew	2025-09-02
24680	CS101	NULL	Pending	2026-01-15

Primary key of Student = Student_ID
Primary key of Enrollment = { Student_ID, Course_ID }
Student_ID on Enrollment is a foreign key to Student
Enrollment_Date replaces row ordering as the way to track recency
The mixed-type problem is resolved: Grade is a letter grade or NULL; lifecycle information lives in Status

Key Terminology: Keys and Functional Dependencies

Before 2NF and beyond we need precise language for what’s allowed in a table. Most of the difficulty people have with the higher normal forms is fuzziness on these definitions.

Functional dependency (X → Y): “Given X, you can determine Y uniquely.” If you know a Student_ID, you know exactly one Student_Name. So Student_ID → Student_Name. X is called the determinant, Y the dependent.

Superkey: Any set of columns whose values uniquely identify each row. {Student_ID} is a superkey of Student. So is {Student_ID, Student_Name} — adding extra columns to a unique set leaves it unique. Superkeys are not required to be minimal.

Candidate key: A minimal superkey — remove any column from it and it stops being unique. A table can have several candidate keys. In Enrollment, {Student_ID, Course_ID} is the candidate key (neither column alone identifies a row).

Primary key: The candidate key chosen as the canonical identifier. The other candidate keys still exist; they’re just not “the” key.

Prime attribute: A column that is part of some candidate key.

Non-prime attribute: A column that is part of no candidate key — a pure data column.

The mnemonic for 1NF/2NF/3NF — “every non-key column depends on the key, the whole key, and nothing but the key” — maps to these terms directly:

1NF — every value is atomic; the table has a key.
2NF — non-prime columns depend on the whole candidate key (no partial dependency on part of a composite key).
3NF — non-prime columns depend on nothing but candidate keys (no transitive dependency through other non-prime columns).

BCNF, 4NF, and 5NF tighten these rules further to handle subtler dependency shapes.

Second Normal Form (2NF)

The Rule: Must be in 1NF. Every non-prime column must depend on the entire candidate key — not just part of it. This only applies when the candidate key is composite.

The Violating Table — Enrollment with Student Name embedded:

Suppose a developer added Student_Name directly to the enrollment table “to avoid joining on every read”:

Student_ID	Course_ID	Grade	Student_Name
12345	CS101	A	Park
12345	CS210	A	Park
12345	MATH200	B	Park
67890	CS101	A	Wren
24680	CS101	NULL	Denton

The candidate key is {Student_ID, Course_ID}. Check each non-prime column:

Grade depends on the whole key — a student gets a different grade in each course. ✓
Student_Name depends only on Student_ID — the course is irrelevant. ✗

Student_Name is a partial dependency on the composite key: a non-prime column determined by part of the key, not all of it.

The cost shows up in the data: “Park” appears three times. If the student changes their surname, every enrollment row must update — miss one and the data becomes inconsistent.

The Fix — keep Student_Name where it really lives:

Student_ID	Student_Name
12345	Park
67890	Wren
24680	Denton

Student_ID	Course_ID	Grade
12345	CS101	A
12345	CS210	A
12345	MATH200	B
67890	CS101	A
24680	CS101	NULL

Student_Name lives once. Enrollment joins back to Student when names are needed.

Third Normal Form (3NF)

The Rule: Must be in 2NF. No non-prime column should depend on another non-prime column. Stated formally: for every non-trivial functional dependency X → A, either X is a superkey, or A is a prime attribute.

In plain terms: each non-prime column must be determined by a candidate key — never by another non-prime column.

The Violating Table — Course Catalog:

Course_ID	Course_Name	Department	Department_Building
CS101	Intro to Programming	Computer Science	Turing Hall
CS210	Data Structures	Computer Science	Turing Hall
MATH200	Linear Algebra	Mathematics	Euler Hall
MATH210	Real Analysis	Mathematics	Euler Hall
PHYS101	Mechanics	Physics	Faraday Hall

The candidate key is Course_ID. The dependencies that hold:

Course_ID → Course_Name — direct ✓
Course_ID → Department — direct ✓
Department → Department_Building — each department occupies one building (this is the problem)
Course_ID → Department_Building — holds, but only transitively through Department

Department and Department_Building are both non-prime. The chain non-prime → non-prime is a transitive dependency, and 3NF forbids it. (Formal check: in the FD Department → Department_Building, Department isn’t a superkey and Department_Building isn’t prime — both 3NF escape clauses fail.)

The redundancy is visible: “Computer Science / Turing Hall” appears twice; “Mathematics / Euler Hall” appears twice. If Mathematics relocates to Gauss Hall, every row must update — miss one and the data contradicts itself.

The Fix — Split out the transitive dependency:

Course_ID	Course_Name	Department
CS101	Intro to Programming	Computer Science
CS210	Data Structures	Computer Science
MATH200	Linear Algebra	Mathematics
MATH210	Real Analysis	Mathematics
PHYS101	Mechanics	Physics

Department	Department_Building
Computer Science	Turing Hall
Mathematics	Euler Hall
Physics	Faraday Hall

Primary key of Course = Course_ID
Primary key of Department = Department
Course.Department is a foreign key to Department
Renaming a building now requires updating exactly one row

3NF is the practical target for most production databases. It eliminates the common redundancy and anomaly problems without over-fragmenting the schema.

Boyce-Codd Normal Form (BCNF)

The Rule: Must be in 3NF. For every non-trivial functional dependency X → Y, X must be a superkey of the relation.

BCNF closes a loophole in 3NF. 3NF allows X → A when A is prime (part of some candidate key), even if X itself isn’t a superkey. BCNF disallows that — every determinant must be a superkey, no exceptions.

This matters only when a table has multiple overlapping candidate keys — that’s where the loophole opens.

The Violating Table — Advisor Assignments:

Each student is assigned an advisor for each course they take. Two business rules govern the table:

Each (student, course) pair has exactly one advisor. So {Student_ID, Course_ID} → Advisor.
Each advisor specializes in exactly one course. So Advisor → Course_ID. (Multiple advisors can share a course; each advisor handles only one.)

Student_ID	Course_ID	Advisor
12345	CS101	Dr. Lee
12345	MATH200	Dr. Kim
67890	CS101	Dr. Lee
24680	CS101	Dr. Park

Candidate-key analysis (worth walking through, since this is where BCNF earns its keep):

{Student_ID, Course_ID} is unique per row (rule 1 forces one advisor per pair, so the pair itself can’t repeat) → candidate key.
{Student_ID, Advisor} is also unique: given a student and an advisor, the course is determined by rule 2 (Advisor → Course_ID), so the row is determined → candidate key.
Course_ID alone is not unique (CS101 appears three times).
Advisor alone is not unique (Dr. Lee appears twice).

So Student_ID, Course_ID, and Advisor are all prime attributes (each is part of some candidate key). There are no non-prime attributes at all — which means the table is trivially in 3NF: there’s no “non-prime depending on non-prime” to violate.

But the table is not in BCNF. The dependency Advisor → Course_ID has Advisor on the left, and Advisor alone is not a superkey of this table. BCNF demands every determinant be a superkey. This one isn’t.

The cost is real. If Dr. Lee changes specialty from CS101 to CS210, both rows mentioning Dr. Lee must update together. Miss one and the data implies Dr. Lee specializes in two courses simultaneously — contradicting rule 2.

The Fix — Decompose so every determinant is a candidate key:

Advisor	Course_ID
Dr. Lee	CS101
Dr. Kim	MATH200
Dr. Park	CS101

Student_ID	Advisor
12345	Dr. Lee
12345	Dr. Kim
67890	Dr. Lee
24680	Dr. Park

In Advisor_Specialty, Advisor is the primary key, so Advisor → Course_ID is now a key-to-attribute dependency — exactly what BCNF requires. In Student_Advisor, the pair {Student_ID, Advisor} is the candidate key.

The catch — BCNF decomposition isn’t always dependency-preserving. We’ve lost the ability to enforce {Student_ID, Course_ID} → Advisor with a single constraint. To check that no student has two advisors for the same course, we’d have to join Student_Advisor with Advisor_Specialty and verify uniqueness of {Student_ID, Course_ID} — a runtime check, not a schema-level guarantee. This is the classic BCNF trade-off: stricter normalization, weaker constraint enforcement.

Fourth Normal Form (4NF)

The Rule: Must be in BCNF. For every non-trivial multi-valued dependency X →→ Y, X must be a superkey.

A multi-valued dependency X →→ Y means: given a value of X, the set of Y values associated with it is fixed and independent of any other non-X column. This is different from a functional dependency — an FD determines a single Y per X, an MVD determines a set.

A single MVD on a key is fine — it’s just “this key has many associated values” (a course has many enrolled students). The 4NF violation appears when a table has two independent MVDs sharing the same determinant — forcing every combination of the two sets to be materialized as separate rows.

The Violating Table — Course Resources:

Each course has a set of approved textbooks and a set of qualified instructors. Any approved textbook can be paired with any qualified instructor — the two sets are independent.

Course_ID	Textbook	Instructor
CS101	Programming Basics	Dr. Lee
CS101	Programming Basics	Dr. Kim
CS101	Intro to CS	Dr. Lee
CS101	Intro to CS	Dr. Kim

Two MVDs hold simultaneously:

Course_ID →→ Textbook — CS101’s textbook set is {Programming Basics, Intro to CS}, independent of who teaches.
Course_ID →→ Instructor — CS101’s instructor set is {Dr. Lee, Dr. Kim}, independent of textbook.

The candidate key of this table is the full triple {Course_ID, Textbook, Instructor} — Course_ID alone is not a superkey. So neither MVD is “on a superkey”, and 4NF is violated.

The cost is combinatorial. The table must store every (Textbook, Instructor) combination — 2 × 2 = 4 rows. Adding Dr. Park as a third instructor requires two new rows, one per textbook. Adding a third textbook requires three new rows, one per instructor. Miss any combination and the relation contradicts the independence rule.

The Fix — One table per independent multi-valued fact:

Course_ID	Textbook
CS101	Programming Basics
CS101	Intro to CS

Course_ID	Instructor
CS101	Dr. Lee
CS101	Dr. Kim

Each MVD now lives on the candidate key of its own table. Adding Dr. Park costs one row, not two. The independence between textbooks and instructors is encoded in the schema itself.

Fifth Normal Form (5NF)

5NF is rarely the deciding factor in practical schema design — most production systems stop at 3NF or BCNF. We cover it here because the underlying idea (avoiding fabricated tuples when tables are rejoined) is genuinely useful even when you’re not formally proving 5NF compliance.

Textbook reference: the classic 5NF example is sometimes called the supplier–part–project (or supplier–product–project) problem — a ternary relation between suppliers, parts, and projects governed by three independent pairwise facts. The instructor–course–semester variant used below has exactly the same shape; the lesson transfers either way.

The Rule: Must be in 4NF. Every join dependency is implied by the candidate keys — the table cannot be decomposed losslessly into smaller pieces beyond what its keys already imply.

A join dependency *[R1, R2, ..., Rn] says the relation equals the natural join of its projections onto R1, R2, ..., Rn. The relation can be stitched back together from those pieces with nothing lost or invented.

5NF matters when the only lossless decomposition is into three or more pieces — any two-table split introduces spurious (incorrect) rows on rejoin.

The Violating Table — Instructor / Course / Semester:

The university records teaching assignments in a single table. Three pairwise rules govern it:

An instructor is qualified to teach a set of courses.
A course is offered in a set of semesters.
An instructor is employed in a set of semesters.

The constraint that makes this 5NF-relevant: the table contains (Instructor, Course, Semester) exactly when all three pairwise facts hold — the instructor is qualified for that course, the course is offered that semester, and the instructor is employed that semester. Without this constraint this would be an arbitrary ternary relation and no decomposition would be lossless.

Instructor	Course	Semester
Dr. Lee	CS101	Fall 2025
Dr. Lee	CS101	Spring 2026
Dr. Lee	CS210	Fall 2025
Dr. Kim	CS101	Fall 2025

The pairwise facts underlying these rows:

Qualified: Lee↔CS101, Lee↔CS210, Kim↔CS101
Offered: CS101↔Fall 2025, CS101↔Spring 2026, CS210↔Fall 2025
Employed: Lee↔Fall 2025, Lee↔Spring 2026, Kim↔Fall 2025

Notice Kim is not employed in Spring 2026 — so (Kim, CS101, Spring 2026) is correctly absent from the table even though Kim is qualified for CS101 and CS101 is offered that semester.

The Fix — three pairwise tables:

Instructor	Course
Dr. Lee	CS101
Dr. Lee	CS210
Dr. Kim	CS101

Course	Semester
CS101	Fall 2025
CS101	Spring 2026
CS210	Fall 2025

Instructor	Semester
Dr. Lee	Fall 2025
Dr. Lee	Spring 2026
Dr. Kim	Fall 2025

Joining all three on their shared columns reconstructs the original.

Why three tables, not two? Suppose we kept only Qualified_To_Teach(Instructor, Course) and Course_Offered(Course, Semester). Joining these on Course gives:

Instructor	Course	Semester
Dr. Lee	CS101	Fall 2025
Dr. Lee	CS101	Spring 2026
Dr. Lee	CS210	Fall 2025
Dr. Kim	CS101	Fall 2025
Dr. Kim	CS101	Spring 2026 ← fabricated

The last row appears because Kim is qualified for CS101 and CS101 is offered in Spring 2026 — but Kim isn’t employed that semester. The two-table decomposition is lossy: it manufactures rows on rejoin.

The third table Instructor_Employed(Instructor, Semester) is precisely the constraint that filters this out: only (I, C, S) triples that survive joining all three tables are valid.

The Progression at a Glance

3NF / BCNF is the right target for most production systems. Go further only when you have genuinely independent multi-valued facts (4NF) or complex three-way join dependencies (5NF).

When to Denormalize

Normalization is the right starting point — not always the ending point. Intentional redundancy is valid when:

Read performance is critical — heavily joined queries across many normalized tables are expensive. A materialized view or denormalized reporting table can dramatically speed up reads.
You’re building a data warehouse — OLAP workloads favor wide flat tables (star/snowflake schemas) over normalized OLTP schemas.
The data rarely changes — if a field updates once a year, the update anomaly risk is low.
A query is too hot to afford joins — on high-throughput endpoints, a cache field with an explicit refresh strategy can be worth the complexity.

Normalize first, denormalize deliberately. Start clean, measure where bottlenecks actually are, and introduce redundancy with intention — not by accident.

Pitfalls and Common Mistakes

The rules above are clear, but applying them in practice is where teams get into trouble. The recurring patterns worth watching for:

Confusing 3NF with BCNF. They sound interchangeable but aren’t. A relation can be in 3NF and still violate BCNF when a non-superkey determines an attribute that happens to be prime (part of some candidate key). Check every functional dependency — not just the ones whose left side is the primary key.
Treating 4NF and 5NF as required. They apply only when independent multi-valued facts (4NF) or three-way join dependencies (5NF) actually exist in your data. Most production schemas don’t hit either situation. Don’t decompose chasing a textbook target you don’t have.
Over-normalizing read-mostly schemas. Splitting tables to satisfy 3NF only pays off if updates happen. A table where rows are written once and never modified rarely benefits from being split into three with foreign keys — you’re paying join cost for no anomaly-prevention upside.
Mistaking “atomic” for “small.” 1NF demands no repeating groups, not “no complex values.” A JSONB document, a PostGIS geometry, or an array column is 1NF-compliant if treated as a single indivisible value. Don’t split data that the application always reads and writes as a unit.
Dropping foreign keys “for performance.” Foreign keys cost very little on modern Postgres and are the only mechanism that guarantees the referential integrity normalization assumes. Disabling them turns “guaranteed valid” into “valid as long as no bug ever ships.”
Storing derived values that violate 3NF. A row holding unit_price, quantity, and total_price creates a transitive dependency (total_price is derivable from the other two). Either compute it on read or accept the redundancy with an explicit refresh strategy and write the trigger / view that keeps it consistent.
Adding nullable “extension” columns instead of normalizing. Adding secondary_email, tertiary_email, home_phone, work_phone to a users table — most of them null for most users — is a repeating group disguised as separate columns. The right move is a user_contact_methods child table.
Premature normalization before knowing the access patterns. Splitting a domain into eight tables before you know how the application queries it produces a schema optimized for nothing. Sketch the read paths first; let them inform the table boundaries.
Forgetting that BCNF decomposition isn’t always dependency-preserving. Decomposing to satisfy BCNF can scatter a multi-attribute functional dependency across two tables in a way that makes it un-enforceable with a simple FK. Decide explicitly whether you’d rather lose the BCNF guarantee or the dependency-preservation guarantee — there’s no fully correct answer.
Using surrogate keys to “skip” normalization. Adding an id BIGSERIAL to a table doesn’t make it normalized — it just gives the existing duplicates a new column. Normalize against the real candidate keys; add surrogate IDs as an implementation detail for foreign-key ergonomics.
Ignoring functional dependencies in legacy schemas. When refactoring a denormalized table, list every X → Y you can identify in the data before designing the new shape. The dependencies tell you where the tables want to split — guessing produces a schema that fights you at the next migration.

FAQ

What is 5NF in DBMS?

Fifth normal form is the level at which a table cannot be losslessly decomposed into smaller tables beyond what its candidate keys already imply. It only matters for relations with three-way join dependencies — tables where every two-table split would invent spurious rows on rejoin. The textbook example is the supplier–part–project problem; the instructor–course–semester walkthrough above has the same structure.

What’s the difference between 1NF, 2NF, 3NF, 4NF, and 5NF?

Each form tightens the rule on which columns may determine which:

1NF — atomic values, no repeating groups, every row uniquely identifiable.
2NF — every non-key column depends on the whole composite key, not just part of it.
3NF — no transitive dependencies through other non-key columns.
BCNF — every determinant must be a superkey (closes the 3NF loophole).
4NF — no independent multi-valued dependencies on a non-superkey.
5NF — no join dependencies beyond what candidate keys already imply.

3NF (or BCNF) is the practical target for most production schemas.

What’s the difference between 3NF and BCNF?

3NF allows a non-superkey to determine a column if that column is prime (part of some candidate key). BCNF disallows that escape clause — every determinant must be a superkey, no exceptions. The two diverge only when a relation has overlapping candidate keys; for most tables they’re equivalent.

When should you stop normalizing?

Almost always at 3NF or BCNF. Go to 4NF only when your data contains genuinely independent multi-valued facts; go to 5NF only when there are three-way join dependencies (rare in practice). Pushing past necessity makes the schema harder to query without preventing any real anomaly.

Is a JSON or JSONB column 1NF-compliant?

Yes, if the application treats the JSON value as a single indivisible unit. 1NF forbids repeating groups, not complex values. A JSONB document, a PostGIS geometry, or an array column is fine as long as you don’t model fields-inside-the-document as if they were independent columns elsewhere in the schema.

What is denormalization?

The deliberate introduction of redundancy to improve read performance — a materialized view, a cached aggregate field, a wide reporting table. It’s not the opposite of normalization; it’s a controlled departure from it. Always normalize first, then denormalize specific hot paths with intent.

ACID and Isolation Levels — what transactions defend the schema you’ve now normalized.
How Databases Actually Store and Find Your Data — what your normalized rows physically become.
B-trees and B+-trees — the index structure most foreign-key lookups go through.
PostgreSQL Indexing: Internals, Types, and Trade-offs — how to make the joins your normalized schema implies actually fast.
Reading PostgreSQL EXPLAIN Output — how a normalized schema’s joins translate to plan nodes.