How OpenAI Runs Spark: A Case Study in Hybrid Data Infrastructure

At the recent Open Lakehouse + AI Summit, OpenAI's data platform team gave a detailed account of how they run Spark internally. It's a revealing look at the operational reality of serving over a thousand internal customers across model training, analytics, safety research, and finance.
Their setup is representative of large-scale data platforms. They run both Databricks and a self-hosted "OpenAI Spark" on Kubernetes, unified through a shared Unity Catalog. Users switch between engines by changing a single configuration parameter. This hybrid pattern has become the norm for organizations processing data at serious volume, and OpenAI's experience illuminates why.
The Hybrid Reality
Three forces push enterprises toward running their own Spark alongside managed services. First, data security requirements often mandate that sensitive workloads stay within controlled infrastructure - no amount of compliance certifications fully satisfies some internal security teams. Second, the economics shift at scale: organizations processing petabytes daily often find that self-hosted deployments dramatically reduce costs for predictable, high-volume workloads. Third, operating your own stack means you can debug it. Full source code access and the ability to implement workload-specific optimizations matter when you're troubleshooting production incidents.
Building the Infrastructure Layer
The OpenAI team's account of scaling self-hosted Spark follows a familiar trajectory. Initial deployment is straightforward - Spark on Kubernetes, Airflow integration, jobs start flowing - and then usage grows.
Kubernetes control plane limits surface first - API servers buckling under listing operations from thousands of concurrent jobs. The response is multiple clusters, which immediately creates routing problems. Static routing (annotating jobs with target clusters) proves operationally painful. The solution is a gateway service that handles dynamic routing, access control, quota tracking, and auto-tuning based on historical patterns. This is infrastructure that managed services provide invisibly, and that self-hosted deployments must build explicitly.
Catalog integration across both managed and self-hosted environments requires careful coordination: permission verification, scoped credentials, distribution to executors. These are solved problems, but solving them yourself takes engineering time.
Performance at Petabyte Scale
OpenAI's talk gets more interesting when it turns to optimizations that don't appear in Spark documentation. Their CDC ingestion example is illustrative: at petabyte scale, Spark's default merge operation breaks down because mixed event types require outer joins that can't be broadcast. Their solution - splitting merges into separate operations for updates/deletes versus inserts - is the kind of pattern that emerges only from production experience.
Cloud storage API limits create another class of problems. Transaction-per-second caps become bottlenecks when scanning tables with extensive metadata. The optimizations are straightforward once you know to look: listing only from the last known commit, caching metadata, eliminating redundant status checks.
The most impactful optimization they described involves recognizing what data doesn't need to be read at all. Merge operations that update rows based on key matching don't need to scan target table columns if the CDC payload already contains the necessary data. This column pruning yielded 98% reductions in data scanned for some of their workloads.
The Architectural Ceiling
Even with these optimizations, OpenAI's team acknowledged limitations that configuration tuning can't address. PySpark's architecture creates both performance overhead and debugging complexity. JSON processing remains expensive. These are consequences of Spark's JVM-based architecture, and the industry is responding.
Remote shuffle services decouple shuffle data from executor lifecycles. Native acceleration engines process data in columnar format with SIMD instructions. This is the problem Flarion addresses directly - accelerating Spark workloads natively without requiring pipeline changes, targeting exactly the architectural constraints OpenAI describes. Organizations facing similar ceilings can evaluate whether native acceleration closes the gap before committing to the engineering investment of building their own optimization layer.
OpenAI's scale is unique, but its challenges are common. Hybrid deployments, control plane scaling, storage API limits, the performance ceiling of JVM-based processing - these are what enterprises running Spark at scale consistently encounter. Their solutions represent current best practice. The question for most organizations is when they'll face these problems, and whether they'll be ready.
