October 2, 2025

Jim Gallagher's Enterprise AI Pipeline - Part 2: Data Foundation

Solving the $3.1 Trillion Data Quality Problem

JetStor CEO Jim Gallagher reveals how poor data foundations sink AI initiatives. Learn to conquer data gravity, choose the right architecture, and build AI-ready data systems that actually work.

Before you can run AI, you need to walk with data. Most companies are trying to sprint on a foundation of quicksand - data scattered across silos, formats that don't talk, and governance that strangles innovation. The winners aren't the ones with the most data; they're the ones whose data is ready to work.

The $3.1 Trillion Problem Nobody Talks About

BM estimates that poor data quality costs the US economy $3.1 trillion annually. But here's the kicker - that number was calculated before AI made data quality existentially important.
In the AI era, bad data doesn't just mean bad reports. It means:

  • Models that discriminate because of biased training data
  • Production failures that cost millions per hour
  • Compliance violations that trigger regulatory hell
  • Competitive disadvantage that becomes permanent


Yet most companies are sitting on a data foundation that would make a Jenga tower look stable.

Data Gravity: The Physics That's Eating Your Budget

Data gravity is simple physics: the bigger your data gets, the harder it becomes to move.

But in AI, this isn't just inconvenient - it's catastrophic to your economics.

The True Cost of Data Movement

Let's do the math on a real scenario:

Scenario: Training a large language model on 100TB of text data

  • Option 1: Move data to cloud GPUs
    • Transfer time (1Gbps): 11.5 days
    • Transfer time (10Gbps): 27 hours
    • AWS transfer cost: $9,000 (ingress "free", egress will get you)
    • Productivity loss: 2 weeks of data scientist time = $8,000
    • Total cost: $17,000 before you train a single parameter
  • Option 2: Move compute to data (on-premises GPU cluster)
    • Transfer time: 0
    • Transfer cost: $0
    • Infrastructure cost: Amortized over hundreds of training runs
    • Total cost: Approaches zero per run

The Modern Data Topology: Edge to Core to Cloud (and Back)

Here's how smart companies are organizing their data geography:

EDGE (IoT, Sensors, Stores)

├── Real-time inference

├── Data filtering/reduction

└── Critical decisions only

    ↓

CORE (On-Premises/Colo)

├── Training/retraining

├── Batch inference

├── Data lake/warehouse

└── Compliance/sovereignty

    ↓

CLOUD (AWS/Azure/GCP)

├── Burst compute

├── Archive storage

├── Disaster recovery

└── Global distribution

The Anti-Pattern: Everything in the cloud

  • Monthly AWS bill: $50K-500K
  • Egress charges: "Surprise! Here's another $100K"
  • Latency: "Why does inference take 2 seconds?"
  • Sovereignty: "Wait, we can't store German data in US regions?"

Structured vs. Unstructured: The 80/20 Rule Flipped

Traditional IT was built for structured data - neat rows and columns in databases. AI lives on unstructured data - images, video, text, sensor streams.

The Reality Check:

  • 80% of enterprise data is unstructured
  • 90% of AI value comes from unstructured data
  • 95% of storage infrastructure was designed for structured data

See the mismatch?

Storage Requirements by Data Type

Data Type Volume Access Pattern Performance Needs Storage Type
Training Images 10-100TB Sequential, Read-heavy 10-50GB/s throughput NVMe/SSD
Video Streams 100TB-1PB Sequential, Write-heavy Sustained 5-20GB/s SSD/HDD hybrid
Sensor Data 1-10TB/day Random, Write-heavy High IOPS (100K+) NVMe
Text/Documents 1-50TB Random, Read-heavy Moderate (1-5GB/s) SSD
Model Checkpoints 100GB-5TB Sequential, Write-heavy Burst 10GB/s+ NVMe
Feature Stores 10GB-1TB Random, Read-heavy Low latency (<1ms) NVMe/Memory
Values are indicative; tune to workload.

The Mistake Everyone Makes: One storage tier for everything. Like using a Ferrari for grocery runs and a minivan for racing.

The Lake, The Warehouse, and The Lakehouse: A Decision Framework

Stop letting vendors convince you their architecture is the only way. Here's how to actually decide:

Data Lake: When It Makes Sense

Choose if:

  • Unstructured data dominates (>70%)
  • Schema changes frequently
  • Data scientists need raw data access
  • Cost per TB matters more than query speed

Avoid if:

  • Need consistent sub-second queries
  • Strict governance requirements
  • Limited data engineering resources

Real cost: $50-200/TB/year on-premises, $23-100/TB/month in cloud

Data Warehouse: When It's Worth It

Choose if:

  • Structured data with stable schemas
  • Business intelligence is critical
  • Need ACID compliance
  • Query performance trumps flexibility

Avoid if:

  • Dealing with images, video, or sensor data
  • Rapid prototyping/experimentation needed
  • Budget constrained

Real cost: $500-2000/TB/year, plus licensing

Lakehouse: The Hybrid Hope

Choose if:

  • Need both BI and AI workloads
  • Have mature data engineering team
  • Want single source of truth
  • Delta Lake/Iceberg/Hudi ecosystem fits

Avoid if:

  • Team lacks Spark/distributed computing skills
  • Need maximum performance for specific workloads
  • Still figuring out data strategy

The Pragmatic Approach: Graduated Architecture

Instead of picking one, smart companies graduate their data:

HOT DATA (Last 7 days)

├── NVMe storage

├── Immediate access

├── Full performance

└── Cost: $1000/TB/year

WARM DATA (7-90 days)

├── SSD storage  

├── Minutes to access

├── Good performance

└── Cost: $200/TB/year

COLD DATA (90+ days)

├── HDD/Object storage

├── Hours to access

├── Adequate performance

└── Cost: $50/TB/year

The key insight: 90% of AI training uses data from the last 30 days. Why pay hot storage prices for cold data?

Compliance Without Paralysis: The Practical Guide

Governance usually comes in two flavors: ignored entirely or so restrictive nothing gets done. Here's the middle path:

The Minimum Viable Governance Stack

Data Classification: Not everything needs Fort Knox

  • Level 1: Public data - No restrictions
  • Level 2: Internal - Basic access controls
  • Level 3: Confidential - Encryption + audit logs
  • Level 4: Regulated - Full compliance stack

Access Controls: Simple rules that actually get followed

python
# Bad: Everyone needs VP approval for everything
# Good: Risk-based automation
if data_classification <= 2 and user.team == "data_science":
    grant_access(24_hours)
elif data_classification == 3 and user.clearance:
    grant_access(reviewer=manager)
else:
    require_approval()

Audit Requirements: Log what matters

  • Who accessed what data
  • What models trained on what datasets
  • Where data moved between systems
  • When regulatory data was processed

The 80/20 Rule: 80% of compliance comes from 20% of the effort. Focus on:

  1. Data lineage (know where data came from)
  2. Access logging (know who touched it)
  3. Encryption at rest and in transit
  4. Regular backups with tested restore

Technical Deep Dive: Protocols and Performance

The Protocol Wars: Who Wins for AI?

NFS (Network File System):

  • ✅ Universal compatibility
  • ✅ Simple management
  • ❌ Performance ceiling ~2-3GB/s per client
  • ❌ Cache coherency issues at scale
  • Verdict: Fine for small teams, breaks at production scale

SMB/CIFS:

  • ✅ Windows ecosystem integration
  • ❌ Even worse performance than NFS
  • ❌ Not native to Linux/GPU ecosystems
  • Verdict: Just... don't

S3 (Object Storage):

  • ✅ Infinite scale
  • ✅ Perfect for data lakes
  • ✅ Cost effective for cold data
  • ❌ Not POSIX compliant (breaks many AI tools)
  • ❌ High latency for small files
  • Verdict: Great for archives, painful for active training

Parallel File Systems (Lustre/GPFS/BeeGFS):

  • ✅ Massive throughput (100GB/s+)
  • ✅ Scales to thousands of clients
  • ✅ POSIX compliant
  • ❌ Complex management
  • ❌ Expensive licensing (for some)
  • Verdict: The gold standard for serious AI workloads

The Hybrid Approach: Use S3 for cold storage, parallel file system for hot data, with intelligent tiering between them.

Building for Real Performance

Stop optimizing for vendor benchmarks. Here's what actually matters:

For Training:

  • Sequential read throughput: 10GB/s minimum
  • Checkpoint write speed: 5GB/s burst
  • Metadata operations: 10K+ ops/sec
  • Concurrent clients: 50-500 nodes

For Inference:

  • Random read IOPS: 100K+
  • Latency: <5ms P99
  • Cache hit ratio: >80%
  • Concurrent requests: 1000+

The Performance Stack That Works:

Application Layer

    ↓

Caching Layer (Redis/Memcached)

    ↓  

Parallel File System (Lustre/BeeGFS)

    ↓

Block Storage Layer

    ↓

Mixed Media:

- NVMe: Active datasets

- SSD: Recent data

- HDD: Archives

Migration Reality: Moving Without Bleeding

Every vendor promises "seamless migration." Here's what actually happens:

The Hidden Costs of Data Migration

Scenario: Migrating 500TB from legacy SAN to modern infrastructure

  • Vendor quote: $50K for migration services
  • Reality:
    • Downtime: 2 weekends × $100K revenue loss = $200K
    • Team time: 6 engineers × 2 weeks = $60K
    • Parallel running costs: 2 months × $30K = $60K
    • Unexpected reformatting: $25K
    • Actual cost: $395K

The Migration Strategy That Actually Works

Phase 1: Parallel Running (Month 1-2)

  • New infrastructure alongside old
  • Mirror critical datasets
  • Test with non-production workloads
  • Cost: 2x infrastructure, but no downtime

Phase 2: Graduated Cutover (Month 2-3)

  • Move development first
  • Then staging/test
  • Production last
  • Rollback plan for each stage

Phase 3: Validation (Month 3-4)

  • Performance benchmarking
  • Data integrity checks
  • User acceptance testing
  • Keep old system warm

Phase 4: Decommission (Month 4+)

  • Archive historical data
  • Document lessons learned
  • Celebrate (seriously, migration is hard)

The Build vs. Buy Decision Nobody Gets Right

When Building Makes Sense

✅ You have a team of storage experts ✅ Your needs are truly unique ✅ You have 18 months to get it right ✅ TCO over 5 years beats commercial solutions

Reality check: This describes <5% of companies

When Buying Makes Sense

✅ You need it working in 90 days ✅ You want someone to blame (support) ✅ Your team should focus on AI, not storage ✅ TCO includes operational overhead

Reality check: This is 95% of companies

The Third Option: Ecosystem Approach

Don't build everything, don't buy from one vendor:

  • Storage OS from vendor A (with support)
  • Drives from vendor B (best price/performance)
  • Networking from vendor C (already have it)
  • Software layer from vendor D (best features)

Advantages:

  • No vendor lock-in
  • Best-of-breed everything
  • Competitive pricing
  • Flexibility to change

Requirements:

  • Vendor who plays well with others
  • Standards-based architecture
  • Strong integration support

Case Study: How a Retailer Fixed Their Foundation

The Situation:

  • 50TB of sales data in Oracle
  • 200TB of customer behavior in Hadoop
  • 500TB of video from stores in cold storage
  • 10TB of new data daily
  • AI initiative stalled for 8 months

The Problem:

  • Data scientists spending 80% time on data wrangling
  • 24-hour delay to access video data
  • $50K/month in cloud egress charges
  • Can't iterate fast enough to compete

The Solution:

  • Deployed parallel file system for hot data (last 30 days)
  • Automated tiering to object storage (30+ days)
  • Built data catalog with Apache Atlas
  • Implemented graduated access controls

The Results:

  • Data access time: 24 hours → 5 minutes
  • Model iteration time: 2 weeks → 2 days
  • Monthly cloud costs: $50K → $12K
  • Time to production: 8 months → 6 weeks

The Lesson: They didn't need more data or better models. They needed data that was ready to work.

Your Data Foundation Checklist

Before moving to the next section, ensure you have answers to:

Data Strategy

  • Where does your data live today?
  • What are your data gravity costs?
  • Have you mapped data flow from edge to core to cloud?
  • Do you know your hot/warm/cold data ratios?

Architecture Decisions

  • Lake, warehouse, or lakehouse?
  • Which protocols match your workloads?
  • What are your real performance requirements?
  • How will you handle compliance?

Migration Planning

  • What's your migration budget (real, not vendor quotes)?
  • Do you have a rollback plan?
  • Who owns the migration project?
  • What's your acceptable downtime?

Build vs Buy

  • Do you have storage expertise in-house?
  • What's your real TCO (including operations)?
  • Can you afford vendor lock-in?
  • Do you need ecosystem flexibility?

If you're not confident in at least 12 of these 16, stop building AI and fix your foundation first.

The Bottom Line

Your data foundation determines your AI ceiling. You can have the best models, the smartest data scientists, and the fastest GPUs - but if your data isn't organized, accessible, and performant, you're building on quicksand.

The winners in AI won't be the companies with the most data. They'll be the companies whose data is ready to work. And that starts with storage infrastructure that's designed for AI workloads, not retrofitted from the database era.

Author:

Other articles

March 25, 2025
Jim Gallagher’s Take on Why We Built JetStor Signature (Spoiler - It’s Not Just About Storage)

JetStor CEO Jim Gallagher gets real on storage's future, why "commodity" doesn’t mean compromise, and the exabyte era ahead. No fluff.

More
Down arrow
June 3, 2013
ZFS Data Storage Management

ZFS offers a powerful, free, and scalable solution for data storage with advanced features like self-healing, data protection, and efficient tiered storage management.

More
Down arrow