The Hidden Cost of Bad Data
Imagine your CEO making a strategic decision based on a dashboard that shows 20% growth, when in reality, a duplicate data bug masked a 10% decline. Data trust is hard to gain and easy to lose.
The 6 Dimensions of Data Quality
1. Accuracy
Does the data reflect reality? (e.g., Is the customer's age actually 25?)
2. Completeness
Is all the required data present? (e.g., No missing timestamps)
3. Consistency
Does data match across systems? (CRM says $100, ERP says $100)
4. Timeliness
Is the data available when needed? (Real-time vs batch delay)
5. Uniqueness
Are there duplicates? (e.g., Same order ID appearing twice)
6. Validity
Does data follow the rules? (e.g., Email must have @ symbol)
Implementing Automated Tests
Don't rely on hope. Use tools like Great Expectations or dbt tests to enforce quality.
Example: dbt Schema Tests
version: 2
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'returned']
- name: total_amount
tests:
- dbt_utils.expression_is_true:
expression: ">= 0"
Data Observability: The Next Frontier
Testing catches known unknowns. But what about unknown unknowns? (e.g., Volume of orders suddenly drops by 50%).
This is where Data Observability tools come in. They monitor:
- Volume: Row counts over time.
- Freshness: Time since last update.
- Schema Changes: Did a column disappear?
- Distribution: Did the average order value jump from $50 to $5000?
Conclusion
Data quality is not a one-time fix; it's a continuous process. Start small by adding unique and not_null tests to your primary keys, then expand to business logic tests, and finally implement observability.