Most organizations store data indefinitely by default. This creates compounding costs, increased compliance risk, and governance complexity. A well-designed retention policy defines how long data is kept, when it transitions between storage tiers, and when it is permanently deleted — balancing regulatory requirements, analytical utility, and infrastructure costs.
Regulation Retention Requirements
| Regulation / Standard | Data Type | Minimum Retention | Maximum Retention | Notes |
|---|
| GDPR (EU) | Personal data | None specified | As short as necessary for purpose | Data minimization principle — must justify retention |
| CCPA/CPRA (California) | Consumer data | None specified | Reasonable for disclosed purpose | Must disclose retention period in privacy policy |
| SOX (US) | Financial records | 7 years | No maximum | Audit trail for publicly traded companies |
| HIPAA (US) | Health records | 6 years | No maximum (state laws vary) | From date of creation or last effective date |
| PCI-DSS | Cardholder data | As needed for business | Minimize storage | Must not store sensitive auth data post-authorization |
| Basel III / MiFID II | Financial transactions | 5-7 years | No maximum | Transaction records, communications |
| Tax regulations (varies) | Tax records | 3-10 years (varies by country) | No maximum | France: 6 years, US: 7 years, Germany: 10 years |
| Employment law (varies) | Employee records | 1-7 years post-employment | No maximum | Varies significantly by jurisdiction |
| LGPD (Brazil) | Personal data | None specified | Until purpose fulfilled | Similar to GDPR data minimization |
Data Lifecycle Diagram
┌─────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ Creation│───▶│ Active │───▶│ Warm │───▶│ Cold │───▶│ Deletion │
│ │ │ Storage │ │ Archive │ │ Archive │ │ / Purge │
└─────────┘ └──────────┘ └───────────┘ └──────────┘ └──────────┘
│ │ │ │
┌───▼───┐ ┌───▼───┐ ┌────▼───┐ ┌────▼────┐
│ Hot │ │ Infre-│ │ Glacier│ │ Crypto- │
│ storage│ │ quent │ │ / deep │ │ graphic │
│ (SSD) │ │ access│ │ archive│ │ erasure │
└───────┘ └───────┘ └────────┘ └─────────┘
Timeline: Day 0 3-6 months 1-3 years 3-7 years End of life
Cost/GB: $$$ $$ $ ¢ $0
Access: Milliseconds Seconds Minutes-Hours N/A N/A
Cost Impact Matrix: Retain vs. Archive vs. Delete
| Data Volume | Hot Storage (S3 Standard) | Warm Archive (S3 IA) | Cold Archive (S3 Glacier) | Delete | Annual Savings (Delete vs. Hot) |
|---|
| 1 TB | $276/year | $150/year | $48/year | $0/year | $276 |
| 10 TB | $2,760/year | $1,500/year | $480/year | $0/year | $2,760 |
| 100 TB | $27,600/year | $15,000/year | $4,800/year | $0/year | $27,600 |
| 1 PB | $276,000/year | $150,000/year | $48,000/year | $0/year | $276,000 |
Hidden costs of over-retention:
- Compute costs for scanning larger datasets (queries slow down)
- Catalog and metadata management overhead
- Compliance risk (more data = more PII = more breach exposure)
- Engineering time for managing bloated datasets
Retention Policy Template
| Field | Description | Example |
|---|
| Dataset name | Identifier of the dataset or table | orders.fact_transactions |
| Data classification | Sensitivity level | Confidential / PII |
| Owner | Team or individual responsible | Finance Data Team |
| Legal basis | Regulation or business justification | SOX 7-year requirement |
| Active retention | Duration in hot storage | 12 months |
| Warm archive | Duration in infrequent-access tier | 12-36 months |
| Cold archive | Duration in deep archive | 36-84 months |
| Deletion trigger | When and how data is purged | Automated after 84 months, crypto-erase |
| PII handling | How personal data is managed during lifecycle | Pseudonymized at 6 months, deleted at 36 months |
| Review cadence | How often the policy is reviewed | Annually |
| Exceptions | Any approved deviations | Legal hold for active litigation |
Retention Strategy by Data Category
| Data Category | Active | Warm | Cold | Delete | Rationale |
|---|
| Transactional data | 1 year | 1-3 years | 3-7 years | After 7 years | SOX / tax compliance |
| User behavior / clickstream | 3 months | 3-12 months | — | After 12 months | Analytics utility drops quickly |
| PII / customer profiles | As needed | — | — | On request or purpose fulfilled | GDPR data minimization |
| ML training data | 6 months | 6-24 months | — | After retraining | Model lineage, reproducibility |
| Logs / observability | 30 days | 30-90 days | — | After 90 days | Cost, low long-term value |
| Financial reports | 2 years | 2-5 years | 5-10 years | After 10 years | Regulatory, audit |
| Backups | 30 days | — | 30-90 days | After 90 days | DR only, not long-term storage |
Implementation Roadmap
| Phase | Timeline | Deliverables |
|---|
| 1. Inventory | Month 1-2 | Catalog all datasets, classify by sensitivity and regulation |
| 2. Policy design | Month 3-4 | Define retention periods per category, get legal/compliance sign-off |
| 3. Automation | Month 5-7 | Implement lifecycle rules (S3 lifecycle, BigQuery expiration, DB TTLs) |
| 4. Monitoring | Month 8-9 | Dashboard for retention compliance, cost tracking by tier |
| 5. Enforcement | Month 10-12 | Automated alerts for policy violations, quarterly reviews |
Common Pitfalls
- "Keep everything forever" — the default that creates unbounded cost and risk
- No legal review — retention periods set by engineering without compliance input
- Backup blind spot — data deleted from primary but lingering in backups for months
- No PII differentiation — applying the same retention to PII and non-PII data
- Manual deletion — relying on humans instead of automated lifecycle policies
Resources