Database Design & Normalization

Why Design Matters

Imagine a spreadsheet where one column is "Customer Name," another is "Customer Email," another is "Customer Phone" — and these same values are repeated on every single order row. Change a customer's email? You'd have to update it in 50 different rows. Miss one, and now you have two conflicting emails for the same person.

That's what happens in an unnormalized database. Normalization fixes these problems by organizing data into well-structured tables where each fact is stored exactly once.

graph LR subgraph bad["❌ Unnormalized"] B["Everything in one giant table
→ Duplicated data everywhere"] end subgraph good["✅ Normalized"] G1["customers"] G2["orders"] G3["books"] G1 --- G2 G2 --- G3 end bad -->|"Normalization"| good

The Problem: A Badly Designed Table

Let's start with a table that stores everything about orders in one flat structure — the kind of thing you might see in a spreadsheet export:

The "Everything" Table

order_id	order_date	cust_name	cust_email	cust_city	book_titles	book_prices	quantities
1	2026-01-15	Alice Johnson	alice@example.com	Las Vegas	Dune, 1984	14.99, 9.99	1, 2
2	2026-02-20	Alice Johnson	alice@example.com	Las Vegas	Pride and Prejudice	7.99	1
3	2026-03-10	Bob Smith	bob@example.com	Denver	Dune	14.99	1

At first glance it looks fine. But look closer — there are serious problems hiding in this design.

Update, Insert, & Delete Anomalies

The "everything" table suffers from three classic problems:

1. Update Anomaly

Alice's email appears in both rows. If she changes her email, you must update every single row containing her data. Miss one? Now you have two different emails for the same customer. Which is correct?

⚠️ The Real Danger

With 1,000 orders from Alice, you'd need to update 1,000 rows — and they must all succeed. One failure creates an inconsistency that's hard to detect and harder to fix.

2. Insert Anomaly

You want to add a new customer (Carol) who hasn't placed an order yet. But every row requires an order_id and order_date. You can't add a customer without faking an order — or leaving required fields NULL.

3. Delete Anomaly

If you delete Bob's only order (order 3), you lose all information about Bob — his name, email, and city are gone. Deleting an order shouldn't erase a customer.

graph TD A["One Big Table"] --> B["Update Anomaly
Change email in 50 rows?"] A --> C["Insert Anomaly
Can't add customer without order"] A --> D["Delete Anomaly
Deleting order erases customer"] B --> E["Solution: Normalization"] C --> E D --> E

These problems exist because different facts are mixed together in one table. Customer data, order data, and book data should live in separate tables. That's what normalization achieves.

First Normal Form (1NF)

📜 1NF Rule

A table is in First Normal Form if:

Every column contains only atomic (single, indivisible) values
There are no repeating groups or arrays in a single cell
Each row is unique (has a primary key)

The Violation

Look at our "everything" table — the book_titles, book_prices, and quantities columns contain comma-separated lists:

order_id	book_titles	book_prices	quantities
1	Dune, 1984	14.99, 9.99	1, 2

"Dune, 1984" in one cell is not atomic — it's two values crammed together. You can't easily query "find all orders containing 1984" without messy string parsing.

The Fix: One Fact Per Cell

Split each item into its own row:

order_id	order_date	cust_name	cust_email	cust_city	book_title	book_price	quantity
1	2026-01-15	Alice Johnson	alice@example.com	Las Vegas	Dune	14.99	1
1	2026-01-15	Alice Johnson	alice@example.com	Las Vegas	1984	9.99	2
2	2026-02-20	Alice Johnson	alice@example.com	Las Vegas	Pride and Prejudice	7.99	1
3	2026-03-10	Bob Smith	bob@example.com	Denver	Dune	14.99	1

Now every cell has exactly one value. The primary key would be (order_id, book_title) — a composite key identifying each unique order-item combination.

✅ 1NF Achieved

No more comma-separated values. Each cell is atomic, each row is unique. But we still have massive redundancy — Alice's name, email, and city are repeated on every row. That's what 2NF addresses.

Functional Dependencies

Before tackling 2NF and 3NF, you need to understand functional dependencies — they're the reasoning tool behind normalization.

A functional dependency means: if you know the value of column A, you can determine the value of column B. We write this as:

A → B    (A determines B)

Examples from Our Table

Dependency	Meaning
`order_id → order_date`	Knowing the order ID tells you the date
`order_id → cust_name, cust_email, cust_city`	Knowing the order ID tells you which customer
`book_title → book_price`	Knowing the book tells you its price
`(order_id, book_title) → quantity`	Knowing the order AND book tells you the quantity

The composite primary key is (order_id, book_title). Notice that some columns depend on the whole key (quantity), while others depend on only part of it (order_date depends only on order_id). That partial dependency is exactly what 2NF eliminates.

Second Normal Form (2NF)

📜 2NF Rule

A table is in Second Normal Form if:

It's already in 1NF
Every non-key column depends on the entire primary key, not just part of it

2NF only applies to tables with composite primary keys. A table with a single-column PK is automatically in 2NF if it's in 1NF.

The Violation

Our 1NF table has primary key (order_id, book_title). But several columns depend on only part of that key:

order_date, cust_name, cust_email, cust_city depend on order_id alone
book_price depends on book_title alone

Only quantity depends on the full composite key (order_id + book_title).

The Fix: Split Into Separate Tables

Move each group of columns to its own table, keyed by the column they actually depend on:

erDiagram orders { int order_id PK date order_date varchar cust_name varchar cust_email varchar cust_city } books { varchar title PK decimal price } order_items { int order_id FK varchar book_title FK int quantity } orders ||--o{ order_items : "has" books ||--o{ order_items : "in"

orders — columns that depend on order_id

order_id	order_date	cust_name	cust_email	cust_city
1	2026-01-15	Alice Johnson	alice@example.com	Las Vegas
2	2026-02-20	Alice Johnson	alice@example.com	Las Vegas
3	2026-03-10	Bob Smith	bob@example.com	Denver

books — columns that depend on book_title

title	price
Dune	14.99
1984	9.99
Pride and Prejudice	7.99

order_items — columns that depend on the full key (order_id, book_title)

order_id	book_title	quantity
1	Dune	1
1	1984	2
2	Pride and Prejudice	1
3	Dune	1

✅ 2NF Achieved

No partial dependencies — every non-key column depends on the entire primary key in its table. Book prices are stored once (in books), not repeated per order. But notice: the orders table still has Alice's name, email, and city repeated across two rows. That's a transitive dependency — and 3NF will fix it.

Third Normal Form (3NF)

📜 3NF Rule

A table is in Third Normal Form if:

It's already in 2NF
No non-key column depends on another non-key column (no transitive dependencies)

Or informally: "Every non-key column must depend on the key, the whole key, and nothing but the key."

The Violation

In our orders table:

order_id → cust_name → cust_email, cust_city

The customer's email and city don't depend on the order — they depend on the customer. If Alice changes her email, you still have to update every order row that mentions her. That's a transitive dependency: order_id determines cust_name, and cust_name determines cust_email.

The Fix: Extract the Customer

Move customer data into its own table and reference it with a foreign key:

erDiagram customers { int id PK varchar name varchar email varchar city } orders { int id PK date order_date int customer_id FK } books { int id PK varchar title decimal price } order_items { int id PK int order_id FK int book_id FK int quantity } customers ||--o{ orders : "places" orders ||--o{ order_items : "contains" books ||--o{ order_items : "included in"

customers

id	name	email	city
1	Alice Johnson	alice@example.com	Las Vegas
2	Bob Smith	bob@example.com	Denver

orders — now just stores the order itself and a reference to the customer

id	order_date	customer_id
1	2026-01-15	1
2	2026-02-20	1
3	2026-03-10	2

✅ 3NF Achieved

Alice's email is stored exactly once — in the customers table. Change it there and it's updated everywhere. No update anomalies. No transitive dependencies. Every non-key column depends on the key, the whole key, and nothing but the key.

This final schema is exactly the bookstore design we've been using throughout this course — customers, orders, order_items, books, authors. Now you know why it's structured that way.

Normalization at a Glance

Normal Form	Rule	What It Eliminates	How to Fix
1NF	Atomic values, no repeating groups	Comma-separated values, arrays in cells	One value per cell, one row per item
2NF	No partial dependencies (on part of composite key)	Columns that depend on only part of the PK	Move them to a table keyed by what they depend on
3NF	No transitive dependencies (non-key → non-key)	Columns that depend on another non-key column	Extract into a new table with its own PK

💡 The Informal Mnemonic

Every non-key column must depend on:

The key (1NF — data is properly atomic and has a key)
The whole key (2NF — not just part of a composite key)
And nothing but the key (3NF — not on some other non-key column)

"So help me Codd." (A nod to Edgar F. Codd, who invented relational database theory.)

Entity-Relationship Diagrams

An ER diagram is a visual blueprint of your database. It shows tables (entities), their columns (attributes), and the relationships between them. ER diagrams are the standard tool for planning and communicating database designs.

Reading ER Diagrams

The relationship lines tell you how tables connect:

Symbol	Meaning	Example
`\|\|--\|\|`	One-to-one	User has one profile
`\|\|--o{`	One-to-many	Customer places many orders
`}o--o{`	Many-to-many	Students enroll in courses (via junction table)

Our Bookstore ER Diagram

erDiagram customers { int id PK varchar first_name varchar last_name varchar email varchar city } orders { int id PK date order_date int customer_id FK } order_items { int id PK int order_id FK int book_id FK int quantity } books { int id PK varchar title decimal price int author_id FK } authors { int id PK varchar name } customers ||--o{ orders : "places" orders ||--o{ order_items : "contains" books ||--o{ order_items : "listed in" authors ||--o{ books : "writes"

This diagram tells the whole story at a glance: an author writes many books, a customer places many orders, each order contains many items, and each item refers to one book.

💡 Design Before You Code

Always draw an ER diagram before writing CREATE TABLE statements. It's much easier to spot missing relationships, redundant data, or awkward structures in a diagram than in SQL code. Tools like dbdiagram.io, draw.io, or even pen and paper work great.

Design Walkthrough: From Requirements to Tables

Let's design a database from scratch. Here are the requirements for a music streaming service:

📋 Requirements

Users have a name, email, and signup date
Artists have a name and genre
Albums have a title, release year, and belong to one artist
Songs have a title, duration, and belong to one album
Users can create playlists with a name and description
Playlists contain songs (in a specific order)

Step 1: Identify Entities

Nouns in the requirements become tables: users, artists, albums, songs, playlists.

Step 2: Identify Relationships

Artist → Albums: one-to-many (one artist has many albums)
Album → Songs: one-to-many (one album has many songs)
User → Playlists: one-to-many (one user creates many playlists)
Playlist ↔ Songs: many-to-many (a playlist has many songs, a song can be in many playlists) → needs a junction table

Step 3: Draw the ER Diagram

erDiagram users { int id PK varchar name varchar email date signup_date } artists { int id PK varchar name varchar genre } albums { int id PK varchar title int release_year int artist_id FK } songs { int id PK varchar title int duration_seconds int album_id FK } playlists { int id PK varchar name varchar description int user_id FK } playlist_songs { int id PK int playlist_id FK int song_id FK int position } artists ||--o{ albums : "releases" albums ||--o{ songs : "contains" users ||--o{ playlists : "creates" playlists ||--o{ playlist_songs : "includes" songs ||--o{ playlist_songs : "appears in"

Step 4: Write the SQL


CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(150) NOT NULL UNIQUE,
    signup_date DATE NOT NULL DEFAULT (CURRENT_DATE)
);

CREATE TABLE artists (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    genre VARCHAR(50)
);

CREATE TABLE albums (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    release_year YEAR,
    artist_id INT NOT NULL,
    FOREIGN KEY (artist_id) REFERENCES artists(id)
);

CREATE TABLE songs (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    duration_seconds INT NOT NULL,
    album_id INT NOT NULL,
    FOREIGN KEY (album_id) REFERENCES albums(id)
);

CREATE TABLE playlists (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    description TEXT,
    user_id INT NOT NULL,
    FOREIGN KEY (user_id) REFERENCES users(id)
);

CREATE TABLE playlist_songs (
    id INT AUTO_INCREMENT PRIMARY KEY,
    playlist_id INT NOT NULL,
    song_id INT NOT NULL,
    position INT NOT NULL,
    FOREIGN KEY (playlist_id) REFERENCES playlists(id) ON DELETE CASCADE,
    FOREIGN KEY (song_id) REFERENCES songs(id),
    UNIQUE (playlist_id, position)
);

Step 5: Verify Normalization

1NF — all values are atomic, every table has a primary key ✅
2NF — no composite PKs with partial dependencies (we used auto-increment IDs) ✅
3NF — no non-key column depends on another non-key column (artist genre depends on artist ID, not on album; user email depends on user ID, not on playlist) ✅

✅ Clean, Normalized Design

Each fact is stored once: an artist's genre lives in the artists table, a song's duration lives in the songs table. No redundancy, no anomalies. You can add users without playlists, delete playlists without losing songs, and update an artist's genre in one place.

When to Denormalize

Normalization is the default — but sometimes you intentionally denormalize (add controlled redundancy) for performance or simplicity.

Why Denormalize?

Performance: A dashboard query that JOINs 8 tables on every page load might be too slow. Storing a pre-computed total_spent column on the customer row eliminates the JOINs.
Read-heavy systems: If you read data 10,000 times for every 1 write, optimizing for reads (even at the cost of write complexity) makes sense.
Reporting tables: Analytics and warehousing often use "flat" denormalized tables because analysts need fast, simple queries — not normalized schemas.

Common Denormalization Strategies

Strategy	Example	Trade-Off
Cached computed column	`total_spent DECIMAL` on customers	Must update on every order change
Duplicated column	Store `customer_name` on orders	Must sync when name changes
Summary/reporting table	`monthly_sales` table refreshed nightly	Data is slightly stale

⚠️ The Golden Rule

Normalize first, denormalize only when you have a proven performance problem. Don't skip normalization because you think it might be slow. Measure first, then denormalize the specific bottleneck. Premature denormalization creates the exact anomalies normalization was designed to prevent.

Exercises

🏋️ Exercise 1: Identify Normal Form Violations

Look at this table and identify which normal form(s) it violates:

student_id	student_name	courses	advisor_name	advisor_email
1	Alice	Math, Physics, CS	Dr. Lee	lee@uni.edu
2	Bob	CS, History	Dr. Lee	lee@uni.edu
3	Carol	Math	Dr. Patel	patel@uni.edu

✅ Solution

1NF violation: The courses column contains comma-separated values (not atomic).

3NF violation: advisor_email depends on advisor_name (a non-key column), not on student_id. Dr. Lee's email is duplicated — a transitive dependency.

Fix: Create separate tables for students, courses, advisors, and a junction table for student-course enrollment.

🏋️ Exercise 2: Normalize to 3NF

Take this unnormalized table and normalize it to 3NF. Show the resulting tables with columns and primary/foreign keys.

invoice_id	invoice_date	vendor_name	vendor_city	items	amounts
101	2026-03-01	Acme Corp	Phoenix	Widgets, Bolts	500, 120
102	2026-03-05	Acme Corp	Phoenix	Widgets	250
103	2026-03-10	Beta LLC	Tucson	Gears, Bolts, Springs	300, 80, 150

✅ Solution


-- vendors (extracted to remove transitive dependency)
CREATE TABLE vendors (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    city VARCHAR(100)
);

-- invoices (references vendor)
CREATE TABLE invoices (
    id INT AUTO_INCREMENT PRIMARY KEY,
    invoice_date DATE NOT NULL,
    vendor_id INT NOT NULL,
    FOREIGN KEY (vendor_id) REFERENCES vendors(id)
);

-- invoice_items (atomic values, full key dependency)
CREATE TABLE invoice_items (
    id INT AUTO_INCREMENT PRIMARY KEY,
    invoice_id INT NOT NULL,
    item_name VARCHAR(100) NOT NULL,
    amount DECIMAL(10,2) NOT NULL,
    FOREIGN KEY (invoice_id) REFERENCES invoices(id)
);

1NF: Comma-separated items/amounts split into individual rows in invoice_items.

2NF: With single-column auto-increment PKs, partial dependencies are eliminated.

3NF: Vendor name and city moved to vendors table — no transitive dependencies.

🏋️ Exercise 3: Design from Requirements

Design a normalized database for a recipe website with these requirements:

Users have a name, email, and join date
Recipes have a title, description, prep time, and cook time
Each recipe is submitted by one user
Recipes have ingredients with quantities and units (e.g., "2 cups flour")
Recipes belong to one or more categories (e.g., "Dessert," "Vegetarian")
Users can save (bookmark) recipes

Draw the ER diagram (or write it as text) and write the CREATE TABLE statements.

✅ Solution


CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(150) NOT NULL UNIQUE,
    join_date DATE NOT NULL DEFAULT (CURRENT_DATE)
);

CREATE TABLE categories (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(50) NOT NULL UNIQUE
);

CREATE TABLE recipes (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    description TEXT,
    prep_time_minutes INT,
    cook_time_minutes INT,
    user_id INT NOT NULL,
    FOREIGN KEY (user_id) REFERENCES users(id)
);

CREATE TABLE recipe_ingredients (
    id INT AUTO_INCREMENT PRIMARY KEY,
    recipe_id INT NOT NULL,
    ingredient_name VARCHAR(100) NOT NULL,
    quantity DECIMAL(8,2),
    unit VARCHAR(30),
    FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE
);

-- Junction table: recipes ↔ categories (M:M)
CREATE TABLE recipe_categories (
    recipe_id INT NOT NULL,
    category_id INT NOT NULL,
    PRIMARY KEY (recipe_id, category_id),
    FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE,
    FOREIGN KEY (category_id) REFERENCES categories(id)
);

-- Junction table: users ↔ recipes bookmarks (M:M)
CREATE TABLE bookmarks (
    user_id INT NOT NULL,
    recipe_id INT NOT NULL,
    saved_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (user_id, recipe_id),
    FOREIGN KEY (user_id) REFERENCES users(id),
    FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE
);

Key relationships: users → recipes (1:M), recipes ↔ categories (M:M via junction), users ↔ recipes bookmarks (M:M via junction), recipes → ingredients (1:M).

🎯 Quick Quiz

Question 1: What does "atomic" mean in the context of 1NF?

Question 2: When does 2NF apply?

Question 3: What is a transitive dependency?

Question 4: When is denormalization appropriate?

Summary

🎉 Key Takeaways

Poorly designed tables cause update, insert, and delete anomalies
1NF: Every cell has one atomic value — no comma-separated lists or arrays
2NF: Every non-key column depends on the entire primary key (no partial dependencies)
3NF: No non-key column depends on another non-key column (no transitive dependencies)
The mnemonic: "the key, the whole key, and nothing but the key"
Functional dependencies (A → B) are the reasoning tool for normalization
ER diagrams visualize tables, columns, and relationships — draw them before coding
Design process: identify entities → identify relationships → draw ER diagram → write SQL → verify normalization
Denormalization is a deliberate trade-off for performance — normalize first, measure, then denormalize specific bottlenecks
3NF is sufficient for most applications — higher normal forms (BCNF, 4NF, 5NF) exist but are rarely needed

Quick Reference

Concept	Rule
1NF	Atomic values, unique rows, no repeating groups
2NF	1NF + no partial dependencies on composite keys
3NF	2NF + no transitive dependencies (non-key → non-key)
Functional dependency	A → B means knowing A determines B
Junction table	Resolves many-to-many relationships with FKs to both sides
Denormalization	Controlled redundancy for proven read-performance needs

📚 Additional Resources

🚀 What's Next?

That completes Module 5: Database Design! You now know how to make queries fast (indexes), simplify complex queries (views), and design clean schemas from scratch (normalization). In Module 6: Advanced Features, you'll learn to build stored procedures and functions — reusable blocks of SQL logic that live inside the database and can be called by name.

🎉 Module 5 Complete!

You've mastered the design side of MySQL — indexes for speed, views for simplicity, and normalization for correctness. Your databases won't just work — they'll be fast, clean, and maintainable.

📑 In This Lesson

Why Design Matters

The Problem: A Badly Designed Table

The "Everything" Table

Update, Insert, & Delete Anomalies

1. Update Anomaly

⚠️ The Real Danger

2. Insert Anomaly

3. Delete Anomaly

First Normal Form (1NF)

📜 1NF Rule

The Violation

The Fix: One Fact Per Cell

✅ 1NF Achieved

Functional Dependencies

Examples from Our Table

Second Normal Form (2NF)

📜 2NF Rule

The Violation

The Fix: Split Into Separate Tables

✅ 2NF Achieved

Third Normal Form (3NF)

📜 3NF Rule

The Violation

The Fix: Extract the Customer

✅ 3NF Achieved

Normalization at a Glance

💡 The Informal Mnemonic

Entity-Relationship Diagrams

Reading ER Diagrams

Our Bookstore ER Diagram

💡 Design Before You Code

Design Walkthrough: From Requirements to Tables

📋 Requirements

Step 1: Identify Entities

Step 2: Identify Relationships

Step 3: Draw the ER Diagram

Step 4: Write the SQL

Step 5: Verify Normalization

✅ Clean, Normalized Design

When to Denormalize

Why Denormalize?

Common Denormalization Strategies

⚠️ The Golden Rule

Exercises

🏋️ Exercise 1: Identify Normal Form Violations

🏋️ Exercise 2: Normalize to 3NF

🏋️ Exercise 3: Design from Requirements

🎯 Quick Quiz

Summary

🎉 Key Takeaways

Quick Reference

📚 Additional Resources

🚀 What's Next?

🎉 Module 5 Complete!