🏗️ Lesson 16: Database Design & Normalization
You've learned to create tables, define relationships, write JOINs, optimize with indexes, and simplify with views. But how do you design the tables correctly in the first place? A poorly designed schema leads to duplicated data, update headaches, and mysterious bugs. Normalization is the systematic process of organizing tables so that data is stored once, in the right place, with clear relationships. It's the difference between a database that works and one that works well.
🎯 Learning Objectives
By the end of this lesson, you will be able to:
- Identify the problems caused by poorly designed tables
- Apply First Normal Form (1NF) to eliminate repeating groups
- Apply Second Normal Form (2NF) to remove partial dependencies
- Apply Third Normal Form (3NF) to remove transitive dependencies
- Read and create Entity-Relationship (ER) diagrams
- Design a normalized schema from requirements
- Know when denormalization is the right trade-off
Estimated Time: 50 minutes
Prerequisites: Views (Lesson 15)
📑 In This Lesson
Why Design Matters
Imagine a spreadsheet where one column is "Customer Name," another is "Customer Email," another is "Customer Phone" — and these same values are repeated on every single order row. Change a customer's email? You'd have to update it in 50 different rows. Miss one, and now you have two conflicting emails for the same person.
That's what happens in an unnormalized database. Normalization fixes these problems by organizing data into well-structured tables where each fact is stored exactly once.
→ Duplicated data everywhere"] end subgraph good["✅ Normalized"] G1["customers"] G2["orders"] G3["books"] G1 --- G2 G2 --- G3 end bad -->|"Normalization"| good
The Problem: A Badly Designed Table
Let's start with a table that stores everything about orders in one flat structure — the kind of thing you might see in a spreadsheet export:
The "Everything" Table
| order_id | order_date | cust_name | cust_email | cust_city | book_titles | book_prices | quantities |
|---|---|---|---|---|---|---|---|
| 1 | 2026-01-15 | Alice Johnson | alice@example.com | Las Vegas | Dune, 1984 | 14.99, 9.99 | 1, 2 |
| 2 | 2026-02-20 | Alice Johnson | alice@example.com | Las Vegas | Pride and Prejudice | 7.99 | 1 |
| 3 | 2026-03-10 | Bob Smith | bob@example.com | Denver | Dune | 14.99 | 1 |
At first glance it looks fine. But look closer — there are serious problems hiding in this design.
Update, Insert, & Delete Anomalies
The "everything" table suffers from three classic problems:
1. Update Anomaly
Alice's email appears in both rows. If she changes her email, you must update every single row containing her data. Miss one? Now you have two different emails for the same customer. Which is correct?
⚠️ The Real Danger
With 1,000 orders from Alice, you'd need to update 1,000 rows — and they must all succeed. One failure creates an inconsistency that's hard to detect and harder to fix.
2. Insert Anomaly
You want to add a new customer (Carol) who hasn't placed an order yet. But every row requires an order_id and order_date. You can't add a customer without faking an order — or leaving required fields NULL.
3. Delete Anomaly
If you delete Bob's only order (order 3), you lose all information about Bob — his name, email, and city are gone. Deleting an order shouldn't erase a customer.
Change email in 50 rows?"] A --> C["Insert Anomaly
Can't add customer without order"] A --> D["Delete Anomaly
Deleting order erases customer"] B --> E["Solution: Normalization"] C --> E D --> E
These problems exist because different facts are mixed together in one table. Customer data, order data, and book data should live in separate tables. That's what normalization achieves.
First Normal Form (1NF)
📜 1NF Rule
A table is in First Normal Form if:
- Every column contains only atomic (single, indivisible) values
- There are no repeating groups or arrays in a single cell
- Each row is unique (has a primary key)
The Violation
Look at our "everything" table — the book_titles, book_prices, and quantities columns contain comma-separated lists:
| order_id | book_titles | book_prices | quantities |
|---|---|---|---|
| 1 | Dune, 1984 | 14.99, 9.99 | 1, 2 |
"Dune, 1984" in one cell is not atomic — it's two values crammed together. You can't easily query "find all orders containing 1984" without messy string parsing.
The Fix: One Fact Per Cell
Split each item into its own row:
| order_id | order_date | cust_name | cust_email | cust_city | book_title | book_price | quantity |
|---|---|---|---|---|---|---|---|
| 1 | 2026-01-15 | Alice Johnson | alice@example.com | Las Vegas | Dune | 14.99 | 1 |
| 1 | 2026-01-15 | Alice Johnson | alice@example.com | Las Vegas | 1984 | 9.99 | 2 |
| 2 | 2026-02-20 | Alice Johnson | alice@example.com | Las Vegas | Pride and Prejudice | 7.99 | 1 |
| 3 | 2026-03-10 | Bob Smith | bob@example.com | Denver | Dune | 14.99 | 1 |
Now every cell has exactly one value. The primary key would be (order_id, book_title) — a composite key identifying each unique order-item combination.
✅ 1NF Achieved
No more comma-separated values. Each cell is atomic, each row is unique. But we still have massive redundancy — Alice's name, email, and city are repeated on every row. That's what 2NF addresses.
Functional Dependencies
Before tackling 2NF and 3NF, you need to understand functional dependencies — they're the reasoning tool behind normalization.
A functional dependency means: if you know the value of column A, you can determine the value of column B. We write this as:
A → B (A determines B)
Examples from Our Table
| Dependency | Meaning |
|---|---|
order_id → order_date | Knowing the order ID tells you the date |
order_id → cust_name, cust_email, cust_city | Knowing the order ID tells you which customer |
book_title → book_price | Knowing the book tells you its price |
(order_id, book_title) → quantity | Knowing the order AND book tells you the quantity |
The composite primary key is (order_id, book_title). Notice that some columns depend on the whole key (quantity), while others depend on only part of it (order_date depends only on order_id). That partial dependency is exactly what 2NF eliminates.
Second Normal Form (2NF)
📜 2NF Rule
A table is in Second Normal Form if:
- It's already in 1NF
- Every non-key column depends on the entire primary key, not just part of it
2NF only applies to tables with composite primary keys. A table with a single-column PK is automatically in 2NF if it's in 1NF.
The Violation
Our 1NF table has primary key (order_id, book_title). But several columns depend on only part of that key:
order_date,cust_name,cust_email,cust_citydepend onorder_idalonebook_pricedepends onbook_titlealone
Only quantity depends on the full composite key (order_id + book_title).
The Fix: Split Into Separate Tables
Move each group of columns to its own table, keyed by the column they actually depend on:
orders — columns that depend on order_id
| order_id | order_date | cust_name | cust_email | cust_city |
|---|---|---|---|---|
| 1 | 2026-01-15 | Alice Johnson | alice@example.com | Las Vegas |
| 2 | 2026-02-20 | Alice Johnson | alice@example.com | Las Vegas |
| 3 | 2026-03-10 | Bob Smith | bob@example.com | Denver |
books — columns that depend on book_title
| title | price |
|---|---|
| Dune | 14.99 |
| 1984 | 9.99 |
| Pride and Prejudice | 7.99 |
order_items — columns that depend on the full key (order_id, book_title)
| order_id | book_title | quantity |
|---|---|---|
| 1 | Dune | 1 |
| 1 | 1984 | 2 |
| 2 | Pride and Prejudice | 1 |
| 3 | Dune | 1 |
✅ 2NF Achieved
No partial dependencies — every non-key column depends on the entire primary key in its table. Book prices are stored once (in books), not repeated per order. But notice: the orders table still has Alice's name, email, and city repeated across two rows. That's a transitive dependency — and 3NF will fix it.
Third Normal Form (3NF)
📜 3NF Rule
A table is in Third Normal Form if:
- It's already in 2NF
- No non-key column depends on another non-key column (no transitive dependencies)
Or informally: "Every non-key column must depend on the key, the whole key, and nothing but the key."
The Violation
In our orders table:
order_id → cust_name → cust_email, cust_city
The customer's email and city don't depend on the order — they depend on the customer. If Alice changes her email, you still have to update every order row that mentions her. That's a transitive dependency: order_id determines cust_name, and cust_name determines cust_email.
The Fix: Extract the Customer
Move customer data into its own table and reference it with a foreign key:
customers
| id | name | city | |
|---|---|---|---|
| 1 | Alice Johnson | alice@example.com | Las Vegas |
| 2 | Bob Smith | bob@example.com | Denver |
orders — now just stores the order itself and a reference to the customer
| id | order_date | customer_id |
|---|---|---|
| 1 | 2026-01-15 | 1 |
| 2 | 2026-02-20 | 1 |
| 3 | 2026-03-10 | 2 |
✅ 3NF Achieved
Alice's email is stored exactly once — in the customers table. Change it there and it's updated everywhere. No update anomalies. No transitive dependencies. Every non-key column depends on the key, the whole key, and nothing but the key.
This final schema is exactly the bookstore design we've been using throughout this course — customers, orders, order_items, books, authors. Now you know why it's structured that way.
Normalization at a Glance
| Normal Form | Rule | What It Eliminates | How to Fix |
|---|---|---|---|
| 1NF | Atomic values, no repeating groups | Comma-separated values, arrays in cells | One value per cell, one row per item |
| 2NF | No partial dependencies (on part of composite key) | Columns that depend on only part of the PK | Move them to a table keyed by what they depend on |
| 3NF | No transitive dependencies (non-key → non-key) | Columns that depend on another non-key column | Extract into a new table with its own PK |
💡 The Informal Mnemonic
Every non-key column must depend on:
- The key (1NF — data is properly atomic and has a key)
- The whole key (2NF — not just part of a composite key)
- And nothing but the key (3NF — not on some other non-key column)
"So help me Codd." (A nod to Edgar F. Codd, who invented relational database theory.)
Entity-Relationship Diagrams
An ER diagram is a visual blueprint of your database. It shows tables (entities), their columns (attributes), and the relationships between them. ER diagrams are the standard tool for planning and communicating database designs.
Reading ER Diagrams
The relationship lines tell you how tables connect:
| Symbol | Meaning | Example |
|---|---|---|
||--|| | One-to-one | User has one profile |
||--o{ | One-to-many | Customer places many orders |
}o--o{ | Many-to-many | Students enroll in courses (via junction table) |
Our Bookstore ER Diagram
This diagram tells the whole story at a glance: an author writes many books, a customer places many orders, each order contains many items, and each item refers to one book.
💡 Design Before You Code
Always draw an ER diagram before writing CREATE TABLE statements. It's much easier to spot missing relationships, redundant data, or awkward structures in a diagram than in SQL code. Tools like dbdiagram.io, draw.io, or even pen and paper work great.
Design Walkthrough: From Requirements to Tables
Let's design a database from scratch. Here are the requirements for a music streaming service:
📋 Requirements
- Users have a name, email, and signup date
- Artists have a name and genre
- Albums have a title, release year, and belong to one artist
- Songs have a title, duration, and belong to one album
- Users can create playlists with a name and description
- Playlists contain songs (in a specific order)
Step 1: Identify Entities
Nouns in the requirements become tables: users, artists, albums, songs, playlists.
Step 2: Identify Relationships
- Artist → Albums: one-to-many (one artist has many albums)
- Album → Songs: one-to-many (one album has many songs)
- User → Playlists: one-to-many (one user creates many playlists)
- Playlist ↔ Songs: many-to-many (a playlist has many songs, a song can be in many playlists) → needs a junction table
Step 3: Draw the ER Diagram
Step 4: Write the SQL
CREATE TABLE users (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(150) NOT NULL UNIQUE,
signup_date DATE NOT NULL DEFAULT (CURRENT_DATE)
);
CREATE TABLE artists (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
genre VARCHAR(50)
);
CREATE TABLE albums (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(200) NOT NULL,
release_year YEAR,
artist_id INT NOT NULL,
FOREIGN KEY (artist_id) REFERENCES artists(id)
);
CREATE TABLE songs (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(200) NOT NULL,
duration_seconds INT NOT NULL,
album_id INT NOT NULL,
FOREIGN KEY (album_id) REFERENCES albums(id)
);
CREATE TABLE playlists (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
description TEXT,
user_id INT NOT NULL,
FOREIGN KEY (user_id) REFERENCES users(id)
);
CREATE TABLE playlist_songs (
id INT AUTO_INCREMENT PRIMARY KEY,
playlist_id INT NOT NULL,
song_id INT NOT NULL,
position INT NOT NULL,
FOREIGN KEY (playlist_id) REFERENCES playlists(id) ON DELETE CASCADE,
FOREIGN KEY (song_id) REFERENCES songs(id),
UNIQUE (playlist_id, position)
);
Step 5: Verify Normalization
- 1NF — all values are atomic, every table has a primary key ✅
- 2NF — no composite PKs with partial dependencies (we used auto-increment IDs) ✅
- 3NF — no non-key column depends on another non-key column (artist genre depends on artist ID, not on album; user email depends on user ID, not on playlist) ✅
✅ Clean, Normalized Design
Each fact is stored once: an artist's genre lives in the artists table, a song's duration lives in the songs table. No redundancy, no anomalies. You can add users without playlists, delete playlists without losing songs, and update an artist's genre in one place.
When to Denormalize
Normalization is the default — but sometimes you intentionally denormalize (add controlled redundancy) for performance or simplicity.
Why Denormalize?
- Performance: A dashboard query that JOINs 8 tables on every page load might be too slow. Storing a pre-computed
total_spentcolumn on the customer row eliminates the JOINs. - Read-heavy systems: If you read data 10,000 times for every 1 write, optimizing for reads (even at the cost of write complexity) makes sense.
- Reporting tables: Analytics and warehousing often use "flat" denormalized tables because analysts need fast, simple queries — not normalized schemas.
Common Denormalization Strategies
| Strategy | Example | Trade-Off |
|---|---|---|
| Cached computed column | total_spent DECIMAL on customers |
Must update on every order change |
| Duplicated column | Store customer_name on orders |
Must sync when name changes |
| Summary/reporting table | monthly_sales table refreshed nightly |
Data is slightly stale |
⚠️ The Golden Rule
Normalize first, denormalize only when you have a proven performance problem. Don't skip normalization because you think it might be slow. Measure first, then denormalize the specific bottleneck. Premature denormalization creates the exact anomalies normalization was designed to prevent.
Exercises
🏋️ Exercise 1: Identify Normal Form Violations
Look at this table and identify which normal form(s) it violates:
| student_id | student_name | courses | advisor_name | advisor_email |
|---|---|---|---|---|
| 1 | Alice | Math, Physics, CS | Dr. Lee | lee@uni.edu |
| 2 | Bob | CS, History | Dr. Lee | lee@uni.edu |
| 3 | Carol | Math | Dr. Patel | patel@uni.edu |
✅ Solution
1NF violation: The courses column contains comma-separated values (not atomic).
3NF violation: advisor_email depends on advisor_name (a non-key column), not on student_id. Dr. Lee's email is duplicated — a transitive dependency.
Fix: Create separate tables for students, courses, advisors, and a junction table for student-course enrollment.
🏋️ Exercise 2: Normalize to 3NF
Take this unnormalized table and normalize it to 3NF. Show the resulting tables with columns and primary/foreign keys.
| invoice_id | invoice_date | vendor_name | vendor_city | items | amounts |
|---|---|---|---|---|---|
| 101 | 2026-03-01 | Acme Corp | Phoenix | Widgets, Bolts | 500, 120 |
| 102 | 2026-03-05 | Acme Corp | Phoenix | Widgets | 250 |
| 103 | 2026-03-10 | Beta LLC | Tucson | Gears, Bolts, Springs | 300, 80, 150 |
✅ Solution
-- vendors (extracted to remove transitive dependency)
CREATE TABLE vendors (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
city VARCHAR(100)
);
-- invoices (references vendor)
CREATE TABLE invoices (
id INT AUTO_INCREMENT PRIMARY KEY,
invoice_date DATE NOT NULL,
vendor_id INT NOT NULL,
FOREIGN KEY (vendor_id) REFERENCES vendors(id)
);
-- invoice_items (atomic values, full key dependency)
CREATE TABLE invoice_items (
id INT AUTO_INCREMENT PRIMARY KEY,
invoice_id INT NOT NULL,
item_name VARCHAR(100) NOT NULL,
amount DECIMAL(10,2) NOT NULL,
FOREIGN KEY (invoice_id) REFERENCES invoices(id)
);
1NF: Comma-separated items/amounts split into individual rows in invoice_items.
2NF: With single-column auto-increment PKs, partial dependencies are eliminated.
3NF: Vendor name and city moved to vendors table — no transitive dependencies.
🏋️ Exercise 3: Design from Requirements
Design a normalized database for a recipe website with these requirements:
- Users have a name, email, and join date
- Recipes have a title, description, prep time, and cook time
- Each recipe is submitted by one user
- Recipes have ingredients with quantities and units (e.g., "2 cups flour")
- Recipes belong to one or more categories (e.g., "Dessert," "Vegetarian")
- Users can save (bookmark) recipes
Draw the ER diagram (or write it as text) and write the CREATE TABLE statements.
✅ Solution
CREATE TABLE users (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(150) NOT NULL UNIQUE,
join_date DATE NOT NULL DEFAULT (CURRENT_DATE)
);
CREATE TABLE categories (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(50) NOT NULL UNIQUE
);
CREATE TABLE recipes (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(200) NOT NULL,
description TEXT,
prep_time_minutes INT,
cook_time_minutes INT,
user_id INT NOT NULL,
FOREIGN KEY (user_id) REFERENCES users(id)
);
CREATE TABLE recipe_ingredients (
id INT AUTO_INCREMENT PRIMARY KEY,
recipe_id INT NOT NULL,
ingredient_name VARCHAR(100) NOT NULL,
quantity DECIMAL(8,2),
unit VARCHAR(30),
FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE
);
-- Junction table: recipes ↔ categories (M:M)
CREATE TABLE recipe_categories (
recipe_id INT NOT NULL,
category_id INT NOT NULL,
PRIMARY KEY (recipe_id, category_id),
FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE,
FOREIGN KEY (category_id) REFERENCES categories(id)
);
-- Junction table: users ↔ recipes bookmarks (M:M)
CREATE TABLE bookmarks (
user_id INT NOT NULL,
recipe_id INT NOT NULL,
saved_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (user_id, recipe_id),
FOREIGN KEY (user_id) REFERENCES users(id),
FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE
);
Key relationships: users → recipes (1:M), recipes ↔ categories (M:M via junction), users ↔ recipes bookmarks (M:M via junction), recipes → ingredients (1:M).
🎯 Quick Quiz
Question 1: What does "atomic" mean in the context of 1NF?
Question 2: When does 2NF apply?
Question 3: What is a transitive dependency?
Question 4: When is denormalization appropriate?
Summary
🎉 Key Takeaways
- Poorly designed tables cause update, insert, and delete anomalies
- 1NF: Every cell has one atomic value — no comma-separated lists or arrays
- 2NF: Every non-key column depends on the entire primary key (no partial dependencies)
- 3NF: No non-key column depends on another non-key column (no transitive dependencies)
- The mnemonic: "the key, the whole key, and nothing but the key"
- Functional dependencies (A → B) are the reasoning tool for normalization
- ER diagrams visualize tables, columns, and relationships — draw them before coding
- Design process: identify entities → identify relationships → draw ER diagram → write SQL → verify normalization
- Denormalization is a deliberate trade-off for performance — normalize first, measure, then denormalize specific bottlenecks
- 3NF is sufficient for most applications — higher normal forms (BCNF, 4NF, 5NF) exist but are rarely needed
Quick Reference
| Concept | Rule |
|---|---|
| 1NF | Atomic values, unique rows, no repeating groups |
| 2NF | 1NF + no partial dependencies on composite keys |
| 3NF | 2NF + no transitive dependencies (non-key → non-key) |
| Functional dependency | A → B means knowing A determines B |
| Junction table | Resolves many-to-many relationships with FKs to both sides |
| Denormalization | Controlled redundancy for proven read-performance needs |
📚 Additional Resources
- MySQL Docs — Using Databases
- Wikipedia — Database Normalization
- Wikipedia — Entity-Relationship Model
🚀 What's Next?
That completes Module 5: Database Design! You now know how to make queries fast (indexes), simplify complex queries (views), and design clean schemas from scratch (normalization). In Module 6: Advanced Features, you'll learn to build stored procedures and functions — reusable blocks of SQL logic that live inside the database and can be called by name.
🎉 Module 5 Complete!
You've mastered the design side of MySQL — indexes for speed, views for simplicity, and normalization for correctness. Your databases won't just work — they'll be fast, clean, and maintainable.