Skip to main content

🏗️ Lesson 16: Database Design & Normalization

You've learned to create tables, define relationships, write JOINs, optimize with indexes, and simplify with views. But how do you design the tables correctly in the first place? A poorly designed schema leads to duplicated data, update headaches, and mysterious bugs. Normalization is the systematic process of organizing tables so that data is stored once, in the right place, with clear relationships. It's the difference between a database that works and one that works well.

🎯 Learning Objectives

By the end of this lesson, you will be able to:

  • Identify the problems caused by poorly designed tables
  • Apply First Normal Form (1NF) to eliminate repeating groups
  • Apply Second Normal Form (2NF) to remove partial dependencies
  • Apply Third Normal Form (3NF) to remove transitive dependencies
  • Read and create Entity-Relationship (ER) diagrams
  • Design a normalized schema from requirements
  • Know when denormalization is the right trade-off

Estimated Time: 50 minutes

Prerequisites: Views (Lesson 15)

📑 In This Lesson

Why Design Matters

Imagine a spreadsheet where one column is "Customer Name," another is "Customer Email," another is "Customer Phone" — and these same values are repeated on every single order row. Change a customer's email? You'd have to update it in 50 different rows. Miss one, and now you have two conflicting emails for the same person.

That's what happens in an unnormalized database. Normalization fixes these problems by organizing data into well-structured tables where each fact is stored exactly once.

graph LR subgraph bad["❌ Unnormalized"] B["Everything in one giant table
→ Duplicated data everywhere"] end subgraph good["✅ Normalized"] G1["customers"] G2["orders"] G3["books"] G1 --- G2 G2 --- G3 end bad -->|"Normalization"| good

The Problem: A Badly Designed Table

Let's start with a table that stores everything about orders in one flat structure — the kind of thing you might see in a spreadsheet export:

The "Everything" Table

order_id order_date cust_name cust_email cust_city book_titles book_prices quantities
1 2026-01-15 Alice Johnson alice@example.com Las Vegas Dune, 1984 14.99, 9.99 1, 2
2 2026-02-20 Alice Johnson alice@example.com Las Vegas Pride and Prejudice 7.99 1
3 2026-03-10 Bob Smith bob@example.com Denver Dune 14.99 1

At first glance it looks fine. But look closer — there are serious problems hiding in this design.

Update, Insert, & Delete Anomalies

The "everything" table suffers from three classic problems:

1. Update Anomaly

Alice's email appears in both rows. If she changes her email, you must update every single row containing her data. Miss one? Now you have two different emails for the same customer. Which is correct?

⚠️ The Real Danger

With 1,000 orders from Alice, you'd need to update 1,000 rows — and they must all succeed. One failure creates an inconsistency that's hard to detect and harder to fix.

2. Insert Anomaly

You want to add a new customer (Carol) who hasn't placed an order yet. But every row requires an order_id and order_date. You can't add a customer without faking an order — or leaving required fields NULL.

3. Delete Anomaly

If you delete Bob's only order (order 3), you lose all information about Bob — his name, email, and city are gone. Deleting an order shouldn't erase a customer.

graph TD A["One Big Table"] --> B["Update Anomaly
Change email in 50 rows?"] A --> C["Insert Anomaly
Can't add customer without order"] A --> D["Delete Anomaly
Deleting order erases customer"] B --> E["Solution: Normalization"] C --> E D --> E

These problems exist because different facts are mixed together in one table. Customer data, order data, and book data should live in separate tables. That's what normalization achieves.

First Normal Form (1NF)

📜 1NF Rule

A table is in First Normal Form if:

  • Every column contains only atomic (single, indivisible) values
  • There are no repeating groups or arrays in a single cell
  • Each row is unique (has a primary key)

The Violation

Look at our "everything" table — the book_titles, book_prices, and quantities columns contain comma-separated lists:

order_idbook_titlesbook_pricesquantities
1Dune, 198414.99, 9.991, 2

"Dune, 1984" in one cell is not atomic — it's two values crammed together. You can't easily query "find all orders containing 1984" without messy string parsing.

The Fix: One Fact Per Cell

Split each item into its own row:

order_idorder_datecust_namecust_emailcust_citybook_titlebook_pricequantity
12026-01-15Alice Johnsonalice@example.comLas VegasDune14.991
12026-01-15Alice Johnsonalice@example.comLas Vegas19849.992
22026-02-20Alice Johnsonalice@example.comLas VegasPride and Prejudice7.991
32026-03-10Bob Smithbob@example.comDenverDune14.991

Now every cell has exactly one value. The primary key would be (order_id, book_title) — a composite key identifying each unique order-item combination.

✅ 1NF Achieved

No more comma-separated values. Each cell is atomic, each row is unique. But we still have massive redundancy — Alice's name, email, and city are repeated on every row. That's what 2NF addresses.

Functional Dependencies

Before tackling 2NF and 3NF, you need to understand functional dependencies — they're the reasoning tool behind normalization.

A functional dependency means: if you know the value of column A, you can determine the value of column B. We write this as:

A → B    (A determines B)

Examples from Our Table

DependencyMeaning
order_id → order_dateKnowing the order ID tells you the date
order_id → cust_name, cust_email, cust_cityKnowing the order ID tells you which customer
book_title → book_priceKnowing the book tells you its price
(order_id, book_title) → quantityKnowing the order AND book tells you the quantity

The composite primary key is (order_id, book_title). Notice that some columns depend on the whole key (quantity), while others depend on only part of it (order_date depends only on order_id). That partial dependency is exactly what 2NF eliminates.

Second Normal Form (2NF)

📜 2NF Rule

A table is in Second Normal Form if:

  • It's already in 1NF
  • Every non-key column depends on the entire primary key, not just part of it

2NF only applies to tables with composite primary keys. A table with a single-column PK is automatically in 2NF if it's in 1NF.

The Violation

Our 1NF table has primary key (order_id, book_title). But several columns depend on only part of that key:

  • order_date, cust_name, cust_email, cust_city depend on order_id alone
  • book_price depends on book_title alone

Only quantity depends on the full composite key (order_id + book_title).

The Fix: Split Into Separate Tables

Move each group of columns to its own table, keyed by the column they actually depend on:

erDiagram orders { int order_id PK date order_date varchar cust_name varchar cust_email varchar cust_city } books { varchar title PK decimal price } order_items { int order_id FK varchar book_title FK int quantity } orders ||--o{ order_items : "has" books ||--o{ order_items : "in"

orders — columns that depend on order_id

order_idorder_datecust_namecust_emailcust_city
12026-01-15Alice Johnsonalice@example.comLas Vegas
22026-02-20Alice Johnsonalice@example.comLas Vegas
32026-03-10Bob Smithbob@example.comDenver

books — columns that depend on book_title

titleprice
Dune14.99
19849.99
Pride and Prejudice7.99

order_items — columns that depend on the full key (order_id, book_title)

order_idbook_titlequantity
1Dune1
119842
2Pride and Prejudice1
3Dune1

✅ 2NF Achieved

No partial dependencies — every non-key column depends on the entire primary key in its table. Book prices are stored once (in books), not repeated per order. But notice: the orders table still has Alice's name, email, and city repeated across two rows. That's a transitive dependency — and 3NF will fix it.

Third Normal Form (3NF)

📜 3NF Rule

A table is in Third Normal Form if:

  • It's already in 2NF
  • No non-key column depends on another non-key column (no transitive dependencies)

Or informally: "Every non-key column must depend on the key, the whole key, and nothing but the key."

The Violation

In our orders table:

order_id → cust_name → cust_email, cust_city

The customer's email and city don't depend on the order — they depend on the customer. If Alice changes her email, you still have to update every order row that mentions her. That's a transitive dependency: order_id determines cust_name, and cust_name determines cust_email.

The Fix: Extract the Customer

Move customer data into its own table and reference it with a foreign key:

erDiagram customers { int id PK varchar name varchar email varchar city } orders { int id PK date order_date int customer_id FK } books { int id PK varchar title decimal price } order_items { int id PK int order_id FK int book_id FK int quantity } customers ||--o{ orders : "places" orders ||--o{ order_items : "contains" books ||--o{ order_items : "included in"

customers

idnameemailcity
1Alice Johnsonalice@example.comLas Vegas
2Bob Smithbob@example.comDenver

orders — now just stores the order itself and a reference to the customer

idorder_datecustomer_id
12026-01-151
22026-02-201
32026-03-102

✅ 3NF Achieved

Alice's email is stored exactly once — in the customers table. Change it there and it's updated everywhere. No update anomalies. No transitive dependencies. Every non-key column depends on the key, the whole key, and nothing but the key.

This final schema is exactly the bookstore design we've been using throughout this course — customers, orders, order_items, books, authors. Now you know why it's structured that way.

Normalization at a Glance

Normal Form Rule What It Eliminates How to Fix
1NF Atomic values, no repeating groups Comma-separated values, arrays in cells One value per cell, one row per item
2NF No partial dependencies (on part of composite key) Columns that depend on only part of the PK Move them to a table keyed by what they depend on
3NF No transitive dependencies (non-key → non-key) Columns that depend on another non-key column Extract into a new table with its own PK
graph LR U["Unnormalized"] -->|"Atomic values"| N1["1NF"] N1 -->|"Remove partial deps"| N2["2NF"] N2 -->|"Remove transitive deps"| N3["3NF"]

💡 The Informal Mnemonic

Every non-key column must depend on:

  • The key (1NF — data is properly atomic and has a key)
  • The whole key (2NF — not just part of a composite key)
  • And nothing but the key (3NF — not on some other non-key column)

"So help me Codd." (A nod to Edgar F. Codd, who invented relational database theory.)

Entity-Relationship Diagrams

An ER diagram is a visual blueprint of your database. It shows tables (entities), their columns (attributes), and the relationships between them. ER diagrams are the standard tool for planning and communicating database designs.

Reading ER Diagrams

The relationship lines tell you how tables connect:

SymbolMeaningExample
||--||One-to-oneUser has one profile
||--o{One-to-manyCustomer places many orders
}o--o{Many-to-manyStudents enroll in courses (via junction table)

Our Bookstore ER Diagram

erDiagram customers { int id PK varchar first_name varchar last_name varchar email varchar city } orders { int id PK date order_date int customer_id FK } order_items { int id PK int order_id FK int book_id FK int quantity } books { int id PK varchar title decimal price int author_id FK } authors { int id PK varchar name } customers ||--o{ orders : "places" orders ||--o{ order_items : "contains" books ||--o{ order_items : "listed in" authors ||--o{ books : "writes"

This diagram tells the whole story at a glance: an author writes many books, a customer places many orders, each order contains many items, and each item refers to one book.

💡 Design Before You Code

Always draw an ER diagram before writing CREATE TABLE statements. It's much easier to spot missing relationships, redundant data, or awkward structures in a diagram than in SQL code. Tools like dbdiagram.io, draw.io, or even pen and paper work great.

Design Walkthrough: From Requirements to Tables

Let's design a database from scratch. Here are the requirements for a music streaming service:

📋 Requirements

  • Users have a name, email, and signup date
  • Artists have a name and genre
  • Albums have a title, release year, and belong to one artist
  • Songs have a title, duration, and belong to one album
  • Users can create playlists with a name and description
  • Playlists contain songs (in a specific order)

Step 1: Identify Entities

Nouns in the requirements become tables: users, artists, albums, songs, playlists.

Step 2: Identify Relationships

  • Artist → Albums: one-to-many (one artist has many albums)
  • Album → Songs: one-to-many (one album has many songs)
  • User → Playlists: one-to-many (one user creates many playlists)
  • Playlist ↔ Songs: many-to-many (a playlist has many songs, a song can be in many playlists) → needs a junction table

Step 3: Draw the ER Diagram

erDiagram users { int id PK varchar name varchar email date signup_date } artists { int id PK varchar name varchar genre } albums { int id PK varchar title int release_year int artist_id FK } songs { int id PK varchar title int duration_seconds int album_id FK } playlists { int id PK varchar name varchar description int user_id FK } playlist_songs { int id PK int playlist_id FK int song_id FK int position } artists ||--o{ albums : "releases" albums ||--o{ songs : "contains" users ||--o{ playlists : "creates" playlists ||--o{ playlist_songs : "includes" songs ||--o{ playlist_songs : "appears in"

Step 4: Write the SQL


CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(150) NOT NULL UNIQUE,
    signup_date DATE NOT NULL DEFAULT (CURRENT_DATE)
);

CREATE TABLE artists (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    genre VARCHAR(50)
);

CREATE TABLE albums (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    release_year YEAR,
    artist_id INT NOT NULL,
    FOREIGN KEY (artist_id) REFERENCES artists(id)
);

CREATE TABLE songs (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    duration_seconds INT NOT NULL,
    album_id INT NOT NULL,
    FOREIGN KEY (album_id) REFERENCES albums(id)
);

CREATE TABLE playlists (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    description TEXT,
    user_id INT NOT NULL,
    FOREIGN KEY (user_id) REFERENCES users(id)
);

CREATE TABLE playlist_songs (
    id INT AUTO_INCREMENT PRIMARY KEY,
    playlist_id INT NOT NULL,
    song_id INT NOT NULL,
    position INT NOT NULL,
    FOREIGN KEY (playlist_id) REFERENCES playlists(id) ON DELETE CASCADE,
    FOREIGN KEY (song_id) REFERENCES songs(id),
    UNIQUE (playlist_id, position)
);
                

Step 5: Verify Normalization

  • 1NF — all values are atomic, every table has a primary key ✅
  • 2NF — no composite PKs with partial dependencies (we used auto-increment IDs) ✅
  • 3NF — no non-key column depends on another non-key column (artist genre depends on artist ID, not on album; user email depends on user ID, not on playlist) ✅

✅ Clean, Normalized Design

Each fact is stored once: an artist's genre lives in the artists table, a song's duration lives in the songs table. No redundancy, no anomalies. You can add users without playlists, delete playlists without losing songs, and update an artist's genre in one place.

When to Denormalize

Normalization is the default — but sometimes you intentionally denormalize (add controlled redundancy) for performance or simplicity.

Why Denormalize?

  • Performance: A dashboard query that JOINs 8 tables on every page load might be too slow. Storing a pre-computed total_spent column on the customer row eliminates the JOINs.
  • Read-heavy systems: If you read data 10,000 times for every 1 write, optimizing for reads (even at the cost of write complexity) makes sense.
  • Reporting tables: Analytics and warehousing often use "flat" denormalized tables because analysts need fast, simple queries — not normalized schemas.

Common Denormalization Strategies

StrategyExampleTrade-Off
Cached computed column total_spent DECIMAL on customers Must update on every order change
Duplicated column Store customer_name on orders Must sync when name changes
Summary/reporting table monthly_sales table refreshed nightly Data is slightly stale

⚠️ The Golden Rule

Normalize first, denormalize only when you have a proven performance problem. Don't skip normalization because you think it might be slow. Measure first, then denormalize the specific bottleneck. Premature denormalization creates the exact anomalies normalization was designed to prevent.

Exercises

🏋️ Exercise 1: Identify Normal Form Violations

Look at this table and identify which normal form(s) it violates:

student_idstudent_namecoursesadvisor_nameadvisor_email
1AliceMath, Physics, CSDr. Leelee@uni.edu
2BobCS, HistoryDr. Leelee@uni.edu
3CarolMathDr. Patelpatel@uni.edu
✅ Solution

1NF violation: The courses column contains comma-separated values (not atomic).

3NF violation: advisor_email depends on advisor_name (a non-key column), not on student_id. Dr. Lee's email is duplicated — a transitive dependency.

Fix: Create separate tables for students, courses, advisors, and a junction table for student-course enrollment.

🏋️ Exercise 2: Normalize to 3NF

Take this unnormalized table and normalize it to 3NF. Show the resulting tables with columns and primary/foreign keys.

invoice_idinvoice_datevendor_namevendor_cityitemsamounts
1012026-03-01Acme CorpPhoenixWidgets, Bolts500, 120
1022026-03-05Acme CorpPhoenixWidgets250
1032026-03-10Beta LLCTucsonGears, Bolts, Springs300, 80, 150
✅ Solution

-- vendors (extracted to remove transitive dependency)
CREATE TABLE vendors (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    city VARCHAR(100)
);

-- invoices (references vendor)
CREATE TABLE invoices (
    id INT AUTO_INCREMENT PRIMARY KEY,
    invoice_date DATE NOT NULL,
    vendor_id INT NOT NULL,
    FOREIGN KEY (vendor_id) REFERENCES vendors(id)
);

-- invoice_items (atomic values, full key dependency)
CREATE TABLE invoice_items (
    id INT AUTO_INCREMENT PRIMARY KEY,
    invoice_id INT NOT NULL,
    item_name VARCHAR(100) NOT NULL,
    amount DECIMAL(10,2) NOT NULL,
    FOREIGN KEY (invoice_id) REFERENCES invoices(id)
);
                        

1NF: Comma-separated items/amounts split into individual rows in invoice_items.

2NF: With single-column auto-increment PKs, partial dependencies are eliminated.

3NF: Vendor name and city moved to vendors table — no transitive dependencies.

🏋️ Exercise 3: Design from Requirements

Design a normalized database for a recipe website with these requirements:

  • Users have a name, email, and join date
  • Recipes have a title, description, prep time, and cook time
  • Each recipe is submitted by one user
  • Recipes have ingredients with quantities and units (e.g., "2 cups flour")
  • Recipes belong to one or more categories (e.g., "Dessert," "Vegetarian")
  • Users can save (bookmark) recipes

Draw the ER diagram (or write it as text) and write the CREATE TABLE statements.

✅ Solution

CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(150) NOT NULL UNIQUE,
    join_date DATE NOT NULL DEFAULT (CURRENT_DATE)
);

CREATE TABLE categories (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(50) NOT NULL UNIQUE
);

CREATE TABLE recipes (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    description TEXT,
    prep_time_minutes INT,
    cook_time_minutes INT,
    user_id INT NOT NULL,
    FOREIGN KEY (user_id) REFERENCES users(id)
);

CREATE TABLE recipe_ingredients (
    id INT AUTO_INCREMENT PRIMARY KEY,
    recipe_id INT NOT NULL,
    ingredient_name VARCHAR(100) NOT NULL,
    quantity DECIMAL(8,2),
    unit VARCHAR(30),
    FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE
);

-- Junction table: recipes ↔ categories (M:M)
CREATE TABLE recipe_categories (
    recipe_id INT NOT NULL,
    category_id INT NOT NULL,
    PRIMARY KEY (recipe_id, category_id),
    FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE,
    FOREIGN KEY (category_id) REFERENCES categories(id)
);

-- Junction table: users ↔ recipes bookmarks (M:M)
CREATE TABLE bookmarks (
    user_id INT NOT NULL,
    recipe_id INT NOT NULL,
    saved_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (user_id, recipe_id),
    FOREIGN KEY (user_id) REFERENCES users(id),
    FOREIGN KEY (recipe_id) REFERENCES recipes(id) ON DELETE CASCADE
);
                        

Key relationships: users → recipes (1:M), recipes ↔ categories (M:M via junction), users ↔ recipes bookmarks (M:M via junction), recipes → ingredients (1:M).

🎯 Quick Quiz

Question 1: What does "atomic" mean in the context of 1NF?

Question 2: When does 2NF apply?

Question 3: What is a transitive dependency?

Question 4: When is denormalization appropriate?

Summary

🎉 Key Takeaways

  • Poorly designed tables cause update, insert, and delete anomalies
  • 1NF: Every cell has one atomic value — no comma-separated lists or arrays
  • 2NF: Every non-key column depends on the entire primary key (no partial dependencies)
  • 3NF: No non-key column depends on another non-key column (no transitive dependencies)
  • The mnemonic: "the key, the whole key, and nothing but the key"
  • Functional dependencies (A → B) are the reasoning tool for normalization
  • ER diagrams visualize tables, columns, and relationships — draw them before coding
  • Design process: identify entities → identify relationships → draw ER diagram → write SQL → verify normalization
  • Denormalization is a deliberate trade-off for performance — normalize first, measure, then denormalize specific bottlenecks
  • 3NF is sufficient for most applications — higher normal forms (BCNF, 4NF, 5NF) exist but are rarely needed

Quick Reference

Concept Rule
1NF Atomic values, unique rows, no repeating groups
2NF 1NF + no partial dependencies on composite keys
3NF 2NF + no transitive dependencies (non-key → non-key)
Functional dependency A → B means knowing A determines B
Junction table Resolves many-to-many relationships with FKs to both sides
Denormalization Controlled redundancy for proven read-performance needs

📚 Additional Resources

🚀 What's Next?

That completes Module 5: Database Design! You now know how to make queries fast (indexes), simplify complex queries (views), and design clean schemas from scratch (normalization). In Module 6: Advanced Features, you'll learn to build stored procedures and functions — reusable blocks of SQL logic that live inside the database and can be called by name.

🎉 Module 5 Complete!

You've mastered the design side of MySQL — indexes for speed, views for simplicity, and normalization for correctness. Your databases won't just work — they'll be fast, clean, and maintainable.