October 2, 2025

Jim Gallagher's Enterprise AI Pipeline - Part 2: Data Foundation

Solving the $3.1 Trillion Data Quality Problem

JetStor CEO Jim Gallagher reveals how poor data foundations sink AI initiatives. Learn to conquer data gravity, choose the right architecture, and build AI-ready data systems that actually work.

‍

Before you can run AI, you need to walk with data. Most companies are trying to sprint on a foundation of quicksand - data scattered across silos, formats that don't talk, and governance that strangles innovation. The winners aren't the ones with the most data; they're the ones whose data is ready to work.

‍

The $3.1 Trillion Problem Nobody Talks About

‍BM estimates that poor data quality costs the US economy $3.1 trillion annually. But here's the kicker - that number was calculated before AI made data quality existentially important.
In the AI era, bad data doesn't just mean bad reports. It means:

Models that discriminate because of biased training data
Production failures that cost millions per hour
Compliance violations that trigger regulatory hell
Competitive disadvantage that becomes permanent

‍
Yet most companies are sitting on a data foundation that would make a Jenga tower look stable.

‍

Data Gravity: The Physics That's Eating Your Budget

‍Data gravity is simple physics: the bigger your data gets, the harder it becomes to move.

But in AI, this isn't just inconvenient - it's catastrophic to your economics.

‍

The True Cost of Data Movement‍

Let's do the math on a real scenario:
‍

Scenario: Training a large language model on 100TB of text data

Option 1: Move data to cloud GPUs
- Transfer time (1Gbps): 11.5 days
- Transfer time (10Gbps): 27 hours
- AWS transfer cost: $9,000 (ingress "free", egress will get you)
- Productivity loss: 2 weeks of data scientist time = $8,000
- Total cost: $17,000 before you train a single parameter
Option 2: Move compute to data (on-premises GPU cluster)
- Transfer time: 0
- Transfer cost: $0
- Infrastructure cost: Amortized over hundreds of training runs
- Total cost: Approaches zero per run

The Modern Data Topology: Edge to Core to Cloud (and Back)

Here's how smart companies are organizing their data geography:

‍

EDGE (IoT, Sensors, Stores)

├── Real-time inference

├── Data filtering/reduction

└── Critical decisions only

↓

CORE (On-Premises/Colo)

├── Training/retraining

├── Batch inference

├── Data lake/warehouse

└── Compliance/sovereignty

↓

CLOUD (AWS/Azure/GCP)

├── Burst compute

├── Archive storage

├── Disaster recovery

└── Global distribution

The Anti-Pattern: Everything in the cloud

Monthly AWS bill: $50K-500K
Egress charges: "Surprise! Here's another $100K"
Latency: "Why does inference take 2 seconds?"
Sovereignty: "Wait, we can't store German data in US regions?"

‍

Structured vs. Unstructured: The 80/20 Rule Flipped

Traditional IT was built for structured data - neat rows and columns in databases. AI lives on unstructured data - images, video, text, sensor streams.

The Reality Check:

80% of enterprise data is unstructured
90% of AI value comes from unstructured data
95% of storage infrastructure was designed for structured data

See the mismatch?

‍

Storage Requirements by Data Type

Data Type	Volume	Access Pattern	Performance Needs	Storage Type
Training Images	10-100TB	Sequential, Read-heavy	10-50GB/s throughput	NVMe/SSD
Video Streams	100TB-1PB	Sequential, Write-heavy	Sustained 5-20GB/s	SSD/HDD hybrid
Sensor Data	1-10TB/day	Random, Write-heavy	High IOPS (100K+)	NVMe
Text/Documents	1-50TB	Random, Read-heavy	Moderate (1-5GB/s)	SSD
Model Checkpoints	100GB-5TB	Sequential, Write-heavy	Burst 10GB/s+	NVMe
Feature Stores	10GB-1TB	Random, Read-heavy	Low latency (<1ms)	NVMe/Memory

Values are indicative; tune to workload.

‍‍

The Mistake Everyone Makes: One storage tier for everything. Like using a Ferrari for grocery runs and a minivan for racing.

‍

The Lake, The Warehouse, and The Lakehouse: A Decision Framework

Stop letting vendors convince you their architecture is the only way. Here's how to actually decide:

Data Lake: When It Makes Sense

✅ Choose if:

Unstructured data dominates (>70%)
Schema changes frequently
Data scientists need raw data access
Cost per TB matters more than query speed

❌ Avoid if:

Need consistent sub-second queries
Strict governance requirements
Limited data engineering resources

Real cost: $50-200/TB/year on-premises, $23-100/TB/month in cloud

Data Warehouse: When It's Worth It

✅ Choose if:

Structured data with stable schemas
Business intelligence is critical
Need ACID compliance
Query performance trumps flexibility

❌ Avoid if:

Dealing with images, video, or sensor data
Rapid prototyping/experimentation needed
Budget constrained

Real cost: $500-2000/TB/year, plus licensing

Lakehouse: The Hybrid Hope

✅ Choose if:

Need both BI and AI workloads
Have mature data engineering team
Want single source of truth
Delta Lake/Iceberg/Hudi ecosystem fits

❌ Avoid if:

Team lacks Spark/distributed computing skills
Need maximum performance for specific workloads
Still figuring out data strategy

The Pragmatic Approach: Graduated Architecture

Instead of picking one, smart companies graduate their data:

HOT DATA (Last 7 days)

├── NVMe storage

├── Immediate access

├── Full performance

└── Cost: $1000/TB/year

‍

WARM DATA (7-90 days)

├── SSD storage

├── Minutes to access

├── Good performance

└── Cost: $200/TB/year

‍

COLD DATA (90+ days)

├── HDD/Object storage

├── Hours to access

├── Adequate performance

└── Cost: $50/TB/year

The key insight: 90% of AI training uses data from the last 30 days. Why pay hot storage prices for cold data?

‍

Compliance Without Paralysis: The Practical Guide

Governance usually comes in two flavors: ignored entirely or so restrictive nothing gets done. Here's the middle path:

The Minimum Viable Governance Stack

Data Classification: Not everything needs Fort Knox

Level 1: Public data - No restrictions
Level 2: Internal - Basic access controls
Level 3: Confidential - Encryption + audit logs
Level 4: Regulated - Full compliance stack

Access Controls: Simple rules that actually get followed

python
# Bad: Everyone needs VP approval for everything
# Good: Risk-based automation
if data_classification <= 2 and user.team == "data_science":
    grant_access(24_hours)
elif data_classification == 3 and user.clearance:
    grant_access(reviewer=manager)
else:
    require_approval()

‍Audit Requirements: Log what matters

Who accessed what data
What models trained on what datasets
Where data moved between systems
When regulatory data was processed

The 80/20 Rule: 80% of compliance comes from 20% of the effort. Focus on:

Data lineage (know where data came from)
Access logging (know who touched it)
Encryption at rest and in transit
Regular backups with tested restore

Technical Deep Dive: Protocols and Performance

The Protocol Wars: Who Wins for AI?

NFS (Network File System):

✅ Universal compatibility
✅ Simple management
❌ Performance ceiling ~2-3GB/s per client
❌ Cache coherency issues at scale
Verdict: Fine for small teams, breaks at production scale

SMB/CIFS:

✅ Windows ecosystem integration
❌ Even worse performance than NFS
❌ Not native to Linux/GPU ecosystems
Verdict: Just... don't

S3 (Object Storage):

✅ Infinite scale
✅ Perfect for data lakes
✅ Cost effective for cold data
❌ Not POSIX compliant (breaks many AI tools)
❌ High latency for small files
Verdict: Great for archives, painful for active training

Parallel File Systems (Lustre/GPFS/BeeGFS):

✅ Massive throughput (100GB/s+)
✅ Scales to thousands of clients
✅ POSIX compliant
❌ Complex management
❌ Expensive licensing (for some)
Verdict: The gold standard for serious AI workloads

The Hybrid Approach: Use S3 for cold storage, parallel file system for hot data, with intelligent tiering between them.

Building for Real Performance

Stop optimizing for vendor benchmarks. Here's what actually matters:

For Training:

Sequential read throughput: 10GB/s minimum
Checkpoint write speed: 5GB/s burst
Metadata operations: 10K+ ops/sec
Concurrent clients: 50-500 nodes

For Inference:

Random read IOPS: 100K+
Latency: <5ms P99
Cache hit ratio: >80%
Concurrent requests: 1000+

The Performance Stack That Works:

Application Layer

↓

Caching Layer (Redis/Memcached)

↓

Parallel File System (Lustre/BeeGFS)

↓

Block Storage Layer

↓

Mixed Media:

- NVMe: Active datasets

- SSD: Recent data

- HDD: Archives

‍

Migration Reality: Moving Without Bleeding

Every vendor promises "seamless migration." Here's what actually happens:

The Hidden Costs of Data Migration

Scenario: Migrating 500TB from legacy SAN to modern infrastructure

Vendor quote: $50K for migration services
Reality:
- Downtime: 2 weekends × $100K revenue loss = $200K
- Team time: 6 engineers × 2 weeks = $60K
- Parallel running costs: 2 months × $30K = $60K
- Unexpected reformatting: $25K
- Actual cost: $395K

The Migration Strategy That Actually Works

Phase 1: Parallel Running (Month 1-2)

New infrastructure alongside old
Mirror critical datasets
Test with non-production workloads
Cost: 2x infrastructure, but no downtime

Phase 2: Graduated Cutover (Month 2-3)

Move development first
Then staging/test
Production last
Rollback plan for each stage

Phase 3: Validation (Month 3-4)

Performance benchmarking
Data integrity checks
User acceptance testing
Keep old system warm

Phase 4: Decommission (Month 4+)

Archive historical data
Document lessons learned
Celebrate (seriously, migration is hard)

‍

The Build vs. Buy Decision Nobody Gets Right

When Building Makes Sense

✅ You have a team of storage experts ✅ Your needs are truly unique ✅ You have 18 months to get it right ✅ TCO over 5 years beats commercial solutions

Reality check: This describes <5% of companies

When Buying Makes Sense

✅ You need it working in 90 days ✅ You want someone to blame (support) ✅ Your team should focus on AI, not storage ✅ TCO includes operational overhead

Reality check: This is 95% of companies

The Third Option: Ecosystem Approach

Don't build everything, don't buy from one vendor:

Storage OS from vendor A (with support)
Drives from vendor B (best price/performance)
Networking from vendor C (already have it)
Software layer from vendor D (best features)

Advantages:

No vendor lock-in
Best-of-breed everything
Competitive pricing
Flexibility to change

Requirements:

Vendor who plays well with others
Standards-based architecture
Strong integration support

‍

Case Study: How a Retailer Fixed Their Foundation

The Situation:

50TB of sales data in Oracle
200TB of customer behavior in Hadoop
500TB of video from stores in cold storage
10TB of new data daily
AI initiative stalled for 8 months

The Problem:

Data scientists spending 80% time on data wrangling
24-hour delay to access video data
$50K/month in cloud egress charges
Can't iterate fast enough to compete

The Solution:

Deployed parallel file system for hot data (last 30 days)
Automated tiering to object storage (30+ days)
Built data catalog with Apache Atlas
Implemented graduated access controls

The Results:

Data access time: 24 hours → 5 minutes
Model iteration time: 2 weeks → 2 days
Monthly cloud costs: $50K → $12K
Time to production: 8 months → 6 weeks

The Lesson: They didn't need more data or better models. They needed data that was ready to work.

‍

Your Data Foundation Checklist

Before moving to the next section, ensure you have answers to:

Data Strategy

Where does your data live today?
What are your data gravity costs?
Have you mapped data flow from edge to core to cloud?
Do you know your hot/warm/cold data ratios?

Architecture Decisions

Lake, warehouse, or lakehouse?
Which protocols match your workloads?
What are your real performance requirements?
How will you handle compliance?

Migration Planning

What's your migration budget (real, not vendor quotes)?
Do you have a rollback plan?
Who owns the migration project?
What's your acceptable downtime?

Build vs Buy

Do you have storage expertise in-house?
What's your real TCO (including operations)?
Can you afford vendor lock-in?
Do you need ecosystem flexibility?

If you're not confident in at least 12 of these 16, stop building AI and fix your foundation first.

‍

The Bottom Line

Your data foundation determines your AI ceiling. You can have the best models, the smartest data scientists, and the fastest GPUs - but if your data isn't organized, accessible, and performant, you're building on quicksand.

The winners in AI won't be the companies with the most data. They'll be the companies whose data is ready to work. And that starts with storage infrastructure that's designed for AI workloads, not retrofitted from the database era.

‍

Author:

Jim Gallagher's Enterprise AI Pipeline - Part 2: Data Foundation

Solving the $3.1 Trillion Data Quality Problem

The $3.1 Trillion Problem Nobody Talks About

Data Gravity: The Physics That's Eating Your Budget

The True Cost of Data Movement‍

The Modern Data Topology: Edge to Core to Cloud (and Back)

EDGE (IoT, Sensors, Stores)

CORE (On-Premises/Colo)

CLOUD (AWS/Azure/GCP)

Structured vs. Unstructured: The 80/20 Rule Flipped

Storage Requirements by Data Type

The Lake, The Warehouse, and The Lakehouse: A Decision Framework

Data Lake: When It Makes Sense

Data Warehouse: When It's Worth It

Lakehouse: The Hybrid Hope

The Pragmatic Approach: Graduated Architecture

Compliance Without Paralysis: The Practical Guide

The Minimum Viable Governance Stack

Technical Deep Dive: Protocols and Performance

The Protocol Wars: Who Wins for AI?

Building for Real Performance

Migration Reality: Moving Without Bleeding

The Hidden Costs of Data Migration

The Migration Strategy That Actually Works

The Build vs. Buy Decision Nobody Gets Right

When Building Makes Sense

When Buying Makes Sense

The Third Option: Ecosystem Approach

Case Study: How a Retailer Fixed Their Foundation

Your Data Foundation Checklist

Data Strategy

Architecture Decisions

Migration Planning

Build vs Buy

The Bottom Line

Other articles

Klik Solutions 10th Annual March Madness

Which SSD is better for your needs QLC or TLC?