Introduction
When building data pipelines and analytics systems, your SQL table design is the foundation everything else is built on. A single wrong choice in data types or missing constraints can lead to:
- 🐌 Slow queries that take minutes instead of seconds
- 💸 Inflated cloud costs from wasted storage
- 🔥 Data quality issues that break your dashboards
- ⚠️ Pipeline failures at 3 AM
In this guide, we'll explore the 4 most common SQL table design pitfalls that trip up even experienced data engineers—and how to avoid them.
1. Using VARCHAR/NVARCHAR for Numeric Columns
The Problem
Storing numbers as text (VARCHAR, NVARCHAR, TEXT) instead of proper numeric types (INT, BIGINT, DECIMAL) is one of the most common mistakes.
Operating Model for Data Quality
Process trumps technology. Adopt a lightweight governance model:
- Define Data Owners: Assign accountable stakeholders for every critical dataset.
- Establish SLAs: Agree on freshness, accuracy, and availability targets per data product.
- Automate Alerts: Route failures to the owning squad via Slack/Jira with triage playbooks.
- Run Post-Incident Reviews: Track root causes, remediation, and preventive actions.
Data Quality Metrics Dashboard
| Metric | Definition | Target | Alert Threshold |
|---|---|---|---|
| Freshness Lag | Minutes since dataset last updated | < 15 minutes | > 30 minutes |
| Null Rate | Percentage of required fields that are null | < 0.5% | > 2% |
| Duplicate Key Count | Rows violating primary/unique constraints | 0 | > 0 |
| Distribution Drift | KS test score vs rolling 30-day baseline | < 0.1 | > 0.2 |
Tool Stack Comparison
- Lightweight (Startup): dbt tests + Great Expectations + Metabase alerts.
- Growth Stage: dbt + Soda + Monte Carlo for lineage-aware observability.
- Enterprise: Collibra for governance, Bigeye for automated rule discovery, ServiceNow for workflow.
Embedding Quality in the SDLC
- Shifting Left: Add data test suites to CI pipelines so issues are caught before deployment.
- Quality Scorecards: Review data health metrics in sprint ceremonies alongside velocity and defect trends.
- Test Data Factories: Generate synthetic datasets to exercise edge cases and privacy-safe QA environments.
Change Management Toolkit
- Incident Simulations: Run quarterly game days to rehearse responses to data quality regressions.
- Communities of Practice: Facilitate cross-team guilds where analysts and engineers share testing patterns.
- Executive Visibility: Publish trust dashboards with freshness, accuracy, and incident recovery times.
Roadmap for Continuous Improvement
- Quarter 1: Implement foundational tests (nulls, uniqueness, accepted values) on critical tables.
- Quarter 2: Instrument freshness/volume monitors and publish trust scores alongside dashboards.
- Quarter 3: Launch a data incident program with SLAs, runbooks, and ownership.
- Quarter 4: Introduce predictive quality scores leveraging anomaly detection and machine learning.
Example
-- ❌ BAD: Storing age as VARCHAR
CREATE TABLE users (
user_id INT PRIMARY KEY,
name VARCHAR(100),
age VARCHAR(10), -- This should be INT!
salary VARCHAR(20) -- This should be DECIMAL!
);
-- ✅ GOOD: Using proper numeric types
CREATE TABLE users (
user_id INT PRIMARY KEY,
name VARCHAR(100),
age INT,
salary DECIMAL(12, 2)
);
Impact on Your System
| Impact Area | What Happens | Example |
|---|---|---|
| Query Performance | No index optimization; full table scans | WHERE age > 30 can't use indexes efficiently |
| Sorting Issues | Lexicographic sorting instead of numeric | "100" < "20" < "3" (wrong order!) |
| Type Casting | Constant CAST() operations slow queries | CAST(salary AS DECIMAL) in every query |
| Storage Cost | 2-10x more memory and disk space | VARCHAR(10) uses 12+ bytes vs INT using 4 bytes |
Real-World Performance Impact
-- With VARCHAR: ~850ms
SELECT COUNT(*) FROM orders WHERE order_total > '1000';
-- With DECIMAL: ~45ms (19x faster!)
SELECT COUNT(*) FROM orders WHERE order_total > 1000;
2. Inconsistent Data Types for the Same Logical Field
The Problem
Using different data types for the same logical entity across tables creates a cascading nightmare for joins, queries, and data quality.
Example
-- ❌ BAD: Inconsistent UserID types
CREATE TABLE users (
user_id INT PRIMARY KEY, -- INT here
email VARCHAR(255)
);
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
user_id VARCHAR(50), -- VARCHAR here!
order_date DATE
);
-- ✅ GOOD: Consistent UserID type everywhere
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
email VARCHAR(255)
);
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
user_id BIGINT, -- Consistent!
order_date DATE
);
3. Using Overly Large Data Types
The Problem
"Just use BIGINT and VARCHAR(255) everywhere to be safe" seems convenient, but it wastes massive amounts of resources.
Storage Impact Calculation
For 1 million rows:
| Column | Bad Type | Good Type | Waste |
|---|---|---|---|
| product_id | BIGINT (8 bytes) | INT (4 bytes) | 4 MB |
| sku | VARCHAR(255) | VARCHAR(20) | 235 MB |
| name | VARCHAR(255) | VARCHAR(100) | 155 MB |
| TOTAL | - | - | 397 MB |
4. Not Defining Constraints
The Problem
Skipping NOT NULL, CHECK, UNIQUE, and FOREIGN KEY constraints is like driving without seatbelts—you'll be fine until you're not.
Example
-- ❌ BAD: No constraints
CREATE TABLE orders (
order_id INT,
user_id INT,
email VARCHAR(255),
order_total DECIMAL(10,2),
status VARCHAR(20),
created_at TIMESTAMP
);
-- ✅ GOOD: Proper constraints
CREATE TABLE orders (
order_id INT PRIMARY KEY,
user_id INT NOT NULL,
email VARCHAR(255) NOT NULL,
order_total DECIMAL(10,2) NOT NULL CHECK (order_total >= 0),
status VARCHAR(20) NOT NULL CHECK (status IN ('pending', 'completed', 'cancelled')),
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
Best Practices Summary
✅ DO These
- Choose Correct Data Types
- Use
INTfor 99% of IDs - Use
DECIMALfor money, notFLOAT - Use
DATE/TIMESTAMPfor dates, not strings - Use
BOOLEANfor true/false, notINT
- Use
- Maintain Consistency - Use same types for same entities across all tables
- Right-Size Your Columns - Analyze actual data ranges, allow 20-30% room for growth
- Use Constraints Everywhere -
NOT NULL,CHECK,FOREIGN KEY,UNIQUE
❌ DON'T Do These
- ❌ Store numbers as VARCHAR
- ❌ Use different types for same entity
- ❌ Default to BIGINT/VARCHAR(255) everywhere
- ❌ Skip constraints "for flexibility"
- ❌ Ignore data type warnings
Case Study: Repairing a Legacy Order Warehouse
One of our customers ran a commerce platform that had grown for seven years without a dedicated data engineer. Every table was created in the application ORM with default settings (VARCHAR(255) everywhere, no constraints, mixed collations). Analytical queries were taking 17 minutes on average and Looker dashboards timed out daily.
- Discovery: 96% of numeric columns were stored as strings, 40% of foreign-key relationships had broken references, and the largest table consumed 1.8 TB on disk.
- Intervention: We staged a mirror schema, rewrote the DDL with precise data types, and introduced
CHECK/FOREIGN KEYconstraints plus archival tables for history. - Outcome: Query latency dropped to 38 seconds (27x faster), storage footprint shrank by 62%, and the finance team finally trusted end-of-month metrics.
Migration Checklist
- Profile every column with row counts, min/max, and distinct values.
- Define canonical types per domain entity (e.g.,
order_id,customer_id). - Create a staging schema that mirrors production but with corrected definitions.
- Backfill data using
INSERT ... SELECTwith necessary casts; log every row that fails. - Swap with zero downtime via views or rename operations, and monitor post-cutover metrics.
Tooling Recommendations
Schema quality is easier when you automate guardrails:
- Data Modeling: Dbdiagram or SQLDBM to socialize table designs with product teams.
- Static Analysis: sqlfluff and schemachange to lint DDL before it reaches production.
- Observability: Metaplane or Monte Carlo to alert when column types drift or constraints start failing.
- Workflow Orchestration: dbt for code-as-config migrations and Airflow for phased rollouts.
Schema Evolution Playbook
Production schemas change constantly as the business evolves. Reduce risk by engineering evolution into your lifecycle:
- Version Contracts: Maintain schema versions in the catalog and expose compatibility notes with every release.
- Shadow Deployments: Populate new columns in parallel using views before swapping consumers.
- Contract Testing: Gate pull requests with automated schema diff checks to detect breaking changes early.
- Rollback Ready: Store reversible migration scripts and automate point-in-time restores during releases.
Checklist Before Shipping a New Table
- ✅ Business glossary entry with owner, refresh cadence, and downstream consumers
- ✅ Primary keys, foreign keys, and indexes validated against workload expectations
- ✅ Access controls and masking policies reviewed with the security team
- ✅ Load/perf tests executed at projected three-year scale
- ✅ Monitoring dashboards and alert thresholds configured in production
Conclusion
SQL table design is a critical skill for data engineers. These four pitfalls—wrong data types, inconsistent types, oversized columns, and missing constraints—can:
- 📉 Slow down queries by 10-100x
- 💰 Increase cloud costs by 30-50%
- 🐛 Create data quality nightmares
- ⏰ Cause production incidents
Your action items:
- Audit your existing schemas today
- Create a data dictionary for your team
- Add constraints to critical tables
- Review this checklist for every new table
Need Help Optimizing Your Data Infrastructure?
At Eficsy, we specialize in building robust, scalable data pipelines and optimizing database performance. Our data engineering experts can:
- ✅ Audit your existing schemas
- ✅ Design performant data architectures
- ✅ Build reliable ETL/ELT pipelines
- ✅ Reduce cloud infrastructure costs
Contact us for a free consultation: Get Started →