tadata
Back to home

Data Retention Policies: Balancing Compliance, Cost, and Utility

#data-governance#compliance#finops#data-strategy

Most organizations store data indefinitely by default. This creates compounding costs, increased compliance risk, and governance complexity. A well-designed retention policy defines how long data is kept, when it transitions between storage tiers, and when it is permanently deleted — balancing regulatory requirements, analytical utility, and infrastructure costs.

Regulation Retention Requirements

Regulation / StandardData TypeMinimum RetentionMaximum RetentionNotes
GDPR (EU)Personal dataNone specifiedAs short as necessary for purposeData minimization principle — must justify retention
CCPA/CPRA (California)Consumer dataNone specifiedReasonable for disclosed purposeMust disclose retention period in privacy policy
SOX (US)Financial records7 yearsNo maximumAudit trail for publicly traded companies
HIPAA (US)Health records6 yearsNo maximum (state laws vary)From date of creation or last effective date
PCI-DSSCardholder dataAs needed for businessMinimize storageMust not store sensitive auth data post-authorization
Basel III / MiFID IIFinancial transactions5-7 yearsNo maximumTransaction records, communications
Tax regulations (varies)Tax records3-10 years (varies by country)No maximumFrance: 6 years, US: 7 years, Germany: 10 years
Employment law (varies)Employee records1-7 years post-employmentNo maximumVaries significantly by jurisdiction
LGPD (Brazil)Personal dataNone specifiedUntil purpose fulfilledSimilar to GDPR data minimization

Data Lifecycle Diagram

┌─────────┐    ┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐
│ Creation│───▶│  Active   │───▶│  Warm     │───▶│  Cold    │───▶│ Deletion │
│         │    │  Storage  │    │  Archive  │    │  Archive │    │  / Purge │
└─────────┘    └──────────┘    └───────────┘    └──────────┘    └──────────┘
                   │               │                │                │
               ┌───▼───┐      ┌───▼───┐       ┌────▼───┐      ┌────▼────┐
               │ Hot   │      │ Infre-│       │ Glacier│      │ Crypto- │
               │ storage│      │ quent │       │ / deep │      │ graphic │
               │ (SSD) │      │ access│       │ archive│      │ erasure │
               └───────┘      └───────┘       └────────┘      └─────────┘

Timeline:   Day 0         3-6 months      1-3 years       3-7 years      End of life
Cost/GB:    $$$           $$              $               ¢               $0
Access:     Milliseconds  Seconds         Minutes-Hours   N/A             N/A

Cost Impact Matrix: Retain vs. Archive vs. Delete

Data VolumeHot Storage (S3 Standard)Warm Archive (S3 IA)Cold Archive (S3 Glacier)DeleteAnnual Savings (Delete vs. Hot)
1 TB$276/year$150/year$48/year$0/year$276
10 TB$2,760/year$1,500/year$480/year$0/year$2,760
100 TB$27,600/year$15,000/year$4,800/year$0/year$27,600
1 PB$276,000/year$150,000/year$48,000/year$0/year$276,000

Hidden costs of over-retention:

  • Compute costs for scanning larger datasets (queries slow down)
  • Catalog and metadata management overhead
  • Compliance risk (more data = more PII = more breach exposure)
  • Engineering time for managing bloated datasets

Retention Policy Template

FieldDescriptionExample
Dataset nameIdentifier of the dataset or tableorders.fact_transactions
Data classificationSensitivity levelConfidential / PII
OwnerTeam or individual responsibleFinance Data Team
Legal basisRegulation or business justificationSOX 7-year requirement
Active retentionDuration in hot storage12 months
Warm archiveDuration in infrequent-access tier12-36 months
Cold archiveDuration in deep archive36-84 months
Deletion triggerWhen and how data is purgedAutomated after 84 months, crypto-erase
PII handlingHow personal data is managed during lifecyclePseudonymized at 6 months, deleted at 36 months
Review cadenceHow often the policy is reviewedAnnually
ExceptionsAny approved deviationsLegal hold for active litigation

Retention Strategy by Data Category

Data CategoryActiveWarmColdDeleteRationale
Transactional data1 year1-3 years3-7 yearsAfter 7 yearsSOX / tax compliance
User behavior / clickstream3 months3-12 monthsAfter 12 monthsAnalytics utility drops quickly
PII / customer profilesAs neededOn request or purpose fulfilledGDPR data minimization
ML training data6 months6-24 monthsAfter retrainingModel lineage, reproducibility
Logs / observability30 days30-90 daysAfter 90 daysCost, low long-term value
Financial reports2 years2-5 years5-10 yearsAfter 10 yearsRegulatory, audit
Backups30 days30-90 daysAfter 90 daysDR only, not long-term storage

Implementation Roadmap

PhaseTimelineDeliverables
1. InventoryMonth 1-2Catalog all datasets, classify by sensitivity and regulation
2. Policy designMonth 3-4Define retention periods per category, get legal/compliance sign-off
3. AutomationMonth 5-7Implement lifecycle rules (S3 lifecycle, BigQuery expiration, DB TTLs)
4. MonitoringMonth 8-9Dashboard for retention compliance, cost tracking by tier
5. EnforcementMonth 10-12Automated alerts for policy violations, quarterly reviews

Common Pitfalls

  • "Keep everything forever" — the default that creates unbounded cost and risk
  • No legal review — retention periods set by engineering without compliance input
  • Backup blind spot — data deleted from primary but lingering in backups for months
  • No PII differentiation — applying the same retention to PII and non-PII data
  • Manual deletion — relying on humans instead of automated lifecycle policies

Resources