Blog

Storage Format Lock-in: The Constraint Limiting Query Performance

•

read time

•

November 17, 2025

Modern distributed processing engines achieve remarkable scale and reliability, yet they share a fundamental architectural constraint: rigid storage format requirements. This constraint forces organizations to accept significant performance penalties, sometimes 10-100x slower queries than technically possible, simply because their processing engine cannot adapt to optimized storage formats.

Why Parquet Falls Short for Selective Analytics

Parquet excels as a universal columnar format, but its design prioritizes compatibility and compression over query performance. Understanding its limitations requires examining how analytical queries actually execute.

The Row Group Problem

Parquet organizes data into row groups, typically 100,000 to 1 million rows each. This coarse granularity creates a fundamental problem: if even a single row in a row group matches your query filter, you must read ALL requested columns for the ENTIRE row group.

Consider a query filtering for a specific customer ID in a billion-row table. That customer's 100 transactions might be scattered across 100 different row groups. Parquet must read 100 row groups × 100,000 rows × all requested columns, processing 10 million rows to return 100.

Single-Phase Execution

Parquet readers process queries in a single phase:
‍

1. Read all requested columns for relevant row groups

2. Decompress the data

3. Apply filters

4. Return matching rows

This means reading and decompressing massive amounts of data that will immediately be discarded. There's no mechanism to filter first, then read only what's needed.

Limited Predicate Pushdown

While Parquet stores min/max statistics per row group and optional Bloom filters, these only help skip entire row groups. For analytical queries where some matching rows exist in many row groups, this provides minimal benefit. You still read entire 100,000-row chunks to extract perhaps 10 rows from each.

How Query-Optimized Formats Solve These Problems

Formats like ClickHouse's MergeTree take a fundamentally different approach:

Granular Storage

Instead of 100,000-row groups, data is organized into 8,192-row granules. This 12x finer granularity means reading much less unnecessary data when matches are sparse. Finding those same 100 customer transactions requires reading ~12x less data just from granularity alone.

Two-Phase Execution with PREWHERE

Query-optimized formats implement two-phase execution:

1. Phase 1: Read ONLY filter columns for candidate granules

2. Phase 2: For rows that pass filters, read the remaining requested columns

This seemingly simple change has a profound impact. Instead of reading 20 columns for millions of rows, you read 1-2 filter columns first, identify the 100 matching rows, then read the other 18 columns for just those 100 rows.

Sparse Indexing

A sparse primary index stores one entry per granule (every 8,192 rows), creating a tiny index that fits entirely in memory even for billion-row tables. Binary search on this index instantly identifies which granules to read, eliminating 99%+ of data before any I/O occurs.

Lazy Materialization

Column reads are deferred until absolutely necessary. If a query has multiple filters, the engine applies them progressively, reading additional columns only for rows that survive each filter. This minimizes decompression work and memory bandwidth usage.

The Performance Gap: A First-Principles Calculation

Let's calculate the actual performance difference using a realistic analytical query on data stored in S3:

Scenario:

100 million rows, 100 columns
Query with 0.01% selectivity returning 10,000 rows
Projects 20 columns
All data in S3 (network I/O is the same for both formats)

Parquet Execution:

Data organized in 100,000-row groups
10,000 matching rows spread across ~100 row groups
Must read all 20 requested columns for entire row groups containing any match
Compressed data transferred from S3: 160 MB (with 10:1 compression)
Data decompressed and processed: 1.6 GB

Query-Optimized Format (like ClickHouse MergeTree):

Data organized in 8,192-row granules with sparse index
Two-phase execution: read filter columns first, identify matches, then read other columns
Compressed data transferred from S3: 46 MB (with 3.5:1 compression)
Data decompressed and processed: 162 MB

Despite storing data 2.9x larger on S3 due to less aggressive compression, the query-optimized format transfers 3.5x less data over the network and processes 10x less decompressed data. Processing 1.6 GB vs 162 MB of decompressed data is the difference between 100ms and 10ms query time

Why Processing Engines Don't Support Alternative Formats

While Spark and Ray technically could support arbitrary storage formats through extension APIs, in practice they don't. The ecosystem has converged on Parquet, ORC, and Avro, leaving significant performance opportunities unexplored.

The Spark Reality

Spark provides a DataSource V2 API for custom formats, yet virtually no production deployments use them. The practical barriers:

Format implementations must handle Spark's complex internal row representation
Custom formats cannot leverage Spark's vectorized execution optimizations
The Catalyst optimizer cannot reason about custom format capabilities
Maintaining custom format readers requires deep Spark internals knowledge

Ray's Format Limitations

Ray delegates data loading to Pandas or PyArrow:

python

@ray.remote

def process_partition(file_path):

# Limited to formats Pandas supports

df = pd.read_parquet(file_path)

return df.groupby('device_id').mean()

Adding optimized formats would require writing C++ extensions, ensuring serialization compatibility, and maintaining format-specific optimizations Ray doesn't understand.

The Architectural Impedance Mismatch

Even when custom format support exists, fundamental architectural assumptions prevent leveraging format-specific optimizations. Standard engines read all requested columns before filtering and lack the execution primitives for two-phase execution. The granularity mismatch between coarse row groups and fine granules cannot be bridged through format plugins.

Real-World Impact on S3-Based Data Lakes

Production systems using S3 data lakes demonstrate concrete impact:

E-commerce Recommendation Engine

Dataset in S3: 500M products × 800 attributes
Parquet in Spark: 8.2 seconds average latency
Possible with optimized format: 100ms
Impact: Real-time recommendations impossible

Financial Risk Analytics

Dataset in S3: 10 years of trades, 2,000 columns
Parquet in Spark: 45 seconds per portfolio
Possible with optimized format: 500ms
Impact: Risk managers wait minutes for updates that could take seconds

IoT Monitoring

Dataset in S3: 100,000 devices × 1,000 metrics
Parquet in Ray: 12 seconds per device group
Possible with optimized format: 50ms
Impact: Alert latency prevents real-time response

The Compression-Performance Trade-off

Parquet achieves superior compression, typically 2-3x better than query-optimized formats. A 100GB dataset might compress to 10GB in Parquet versus 30GB in an optimized format, translating to $2.30/month versus $6.90/month on S3.

However, compute costs dominate this equation. When queries run 10-100x faster with optimized formats, the required infrastructure shrinks proportionally. A workload requiring a 16-node Spark cluster at $8,000/month might run on 2 nodes at $1,000/month with format optimization. The $7,000/month compute savings overwhelms the $4.60/month additional S3 storage cost by a factor of 1,500x.

Performance Patterns by Query Type

Different query patterns show varying sensitivity to format selection:

Highly Selective Queries (<0.1% of rows returned)

Performance difference: 20-100x
Parquet's coarse row groups create massive read amplification

Wide Table Projections (many columns, few rows)

Performance difference: 10-50x
Single-phase execution forces reading all columns upfront

Aggregation Queries

Performance difference: 5-20x
Processing unnecessary rows dominates computation

Full Table Scans

Performance difference: 0.7-1.4x (Parquet often faster)
Parquet's superior compression provides advantage when reading everything

The Hidden Costs of Format Lock-in

Beyond direct performance impact, format rigidity creates cascading inefficiencies:

Infrastructure Over-provisioning: Organizations scale clusters to compensate for format inefficiency. A workload naturally requiring 2 nodes might run on 20 nodes to achieve acceptable latency.

Architectural Workarounds: Teams implement pre-aggregation pipelines, materialized views, and caching layers, each adding operational complexity without addressing the root cause.

Opportunity Costs: When queries take 30 seconds instead of 300ms, entire categories of applications become infeasible.

The Path Forward

Parquet remains excellent for data interchange, archival storage, and full-scan workloads. The opportunity lies in enabling processing engines to leverage format diversity based on workload requirements.

The ideal architecture would support understanding the profile of the workloads, adjusting compaction accordingly, and using the best data format for the job. Current processing engines cannot implement this strategy due to their format rigidity. They lack the flexibility to transparently read different formats while applying format-specific optimizations.

This limitation is beginning to change. Next-generation execution engines recognize that format flexibility is essential for modern analytical workloads. By decoupling query execution from storage assumptions, systems like Flarion can deliver the performance that has always been technically possible but practically unreachable.

Organizations no longer need to accept 10x performance penalties as the price of distributed processing. The ability to match storage formats to query patterns represents the difference between queries that frustrate users and analytics that drive real-time decisions.

‍

Resources

Blog

This is some text inside of a div block.

Why the World Needs Flarion

Unlocking Data's Full Potential

read time

November 27, 2024

Did you know that data-driven organizations spend up to 40% of their IT budgets on data processing alone?

As organizations scale their data processing capabilities, two critical challenges emerge: the mounting costs of processing big data and the pressing need for faster performance. Today, we're sharing our journey and explaining why Flarion is transforming how organizations leverage their data assets while staying competitive in an increasingly data-driven world.

The Journey to Better Data Processing

Through years of experience across diverse industries, Flarion’s co-founders witnessed the universal struggle of escalating data processing costs and performance bottlenecks.

During his years building data processing systems for mass-scale consumer applications and autonomous vehicles, Ran experienced firsthand how organizations struggled with the growing costs and performance demands of expanding datasets. In consumer applications, better insights can help create great experiences for hundreds of millions of people, but the high computational costs and processing limitations often make this prohibitively expensive. In autonomous vehicles, data processing at scale allows us to understand and tackle the toughest "long tail" challenges, but technical limitations can make this slow and cost-inefficient.

Through his extensive work with enterprises across various industries, Udi observed a consistent pattern: organizations were hitting both a performance and cost ceiling in their data processing capabilities. Despite significant investments in infrastructure and talent, companies found themselves constrained by processing limitations that held back their ability to launch new features or products while managing escalating infrastructure costs.

The Evolution of Data Processing Needs

The landscape of data processing has evolved dramatically. What started as simple analytics has transformed into complex data pipelines processing hundreds of terabytes daily. These diverse challenges underline the pressing need for solutions that address both speed and cost at scale.

In automotive, processing speed directly impacts vehicle safety and performance, while processing costs affect vehicle affordability and market competitiveness. In financial services, faster data processing enables real-time decision-making and better risk assessment, but the infrastructure costs of high-frequency trading and real-time analytics can quickly erode profit margins. For e-commerce companies, efficient data processing means better customer recommendations and inventory management, yet the cost of processing massive customer datasets across global markets can be prohibitive. Almost every industry relies heavily on efficient data processing and analytics, making both speed and cost optimization critical factors in maintaining competitive advantage.

A New Approach to Performance

Traditional approaches to improving data processing often involve extensive code changes, specialized expertise, or specific deployment requirements. For enterprises with massive legacy codebases, these solutions are often impractical or impossible to implement, creating additional complexity without solving the fundamental challenges of performance and cost efficiency.

We built Flarion with a different vision: what if organizations could dramatically improve their data processing performance without changing their code or disrupting existing workflows? With new Spark, Hadoop and Ray execution engines, we've created a solution that delivers up to 3x performance improvement while maintaining robust reliability and full compatibility. Most importantly, Flarion can be implemented in just 5 minutes, requiring minimal effort from organizations looking to modernize their data stack.

Enabling Innovation Through Efficiency

The impact of accelerated data processing extends far beyond just faster completion times. When organizations can process their data more efficiently and cost-effectively, they can explore new use cases, launch innovative features, and focus on extracting value from their data rather than managing infrastructure costs.

For AI and machine learning applications, efficient data processing is becoming increasingly crucial. The ability to process large datasets quickly and reliably can mean the difference between a successful model deployment and a missed opportunity. With Flarion, organizations can focus on innovation rather than infrastructure optimization, all while maintaining their existing codebase and operations.

The Future of Data Processing

As we enter an era where data drives competitive advantage, organizations need solutions that enable them to process more data, faster and more cost-effectively. The future of data processing isn't just about handling today's workloads - it's about being ready for tomorrow's challenges while managing costs sustainably.

With Flarion, organizations are not just keeping pace—they’re leading the charge into a data-driven future. Our solution enables organizations to unlock the full potential of their data assets, whether they're running data processing in the cloud or on-premises. By delivering significant performance improvements through advanced optimization techniques, we're helping organizations process their data more efficiently while reducing their infrastructure costs. Most importantly, we're doing this in a way that respects the reality of enterprise systems - with a solution that can be implemented in minutes, not months.

The future of data processing should empower organizations to focus on innovation and value creation without being held back by legacy infrastructure or rising costs.

At Flarion, we're making that future a reality.

Featured Blog

Blog

AI Agents Are the New Users of Data Infrastructure

read time

March 30, 2026

The consumer of data infrastructure is changing. For two decades, the primary user of platforms like Spark, Ray, and Trino has been a human engineer - someone who writes queries, tunes configurations, debugs failures, and builds intuition about their cluster over months of operation. Now, AI agents are increasingly the ones writing SQL, submitting Spark jobs, orchestrating ETL pipelines, and running analytical workloads. Enterprises are increasingly adopting agentic data engineering workflows. At Flarion, we see a few key dynamics at play.

‍

Tuning Will Converge - Then What?

Configuration tuning is getting automated. Agents can already follow runbooks, detect common failure patterns, and iterate on Spark configurations across runs. Auto-tuning systems are getting smarter. Over time, the combination of agents, automated tooling, and better platform defaults will squeeze most of the available performance out of configuration space. Executor sizing, partition counts, memory fractions, shuffle parameters - these will increasingly be handled by software rather than tribal knowledge.

Once configurations are well-tuned, the remaining performance gap is bounded by the execution engine itself. Spark's JVM-based, row-oriented architecture has fundamental overhead - garbage collection pauses, serialization costs, per-row function call overhead - that no amount of configuration tuning can eliminate. You can tune a Spark job perfectly and still leave 30-60% of potential performance on the table because of the engine's underlying execution model.

Flarion operates below this ceiling. Its native execution engine, built in Rust on Apache Arrow and DataFusion, replaces Spark's physical operators with vectorized execution. The gains come from a fundamentally different execution model. This means Flarion's acceleration stacks on top of whatever tuning an agent or auto-tuner has already done. Even the most optimized jobs get meaningfully faster.

For agents, this is the critical property. After exhausting configuration changes, an agent has two levers: continue building workload-specific tuning rules (low marginal benefit), or enable Flarion in a single configuration line and get 30-60% improvement immediately. In practice, it will do both - but the plugin delivers value on day one while tuning accumulates value over months.

‍

Separation of Concerns

There's also a trust dimension. Even as agents become capable of modifying code and infrastructure, most organizations won't want them to. Separation of concerns applies to agents the same way it applies to engineering teams: the system writing analytical queries should not also be the system reconfiguring the cluster, rewriting pipeline code, or modifying deployment infrastructure. Giving an agent broad permissions to change Spark configurations, refactor ETL logic, or resize clusters introduces risk that most platform teams won't accept.

This means agents will operate within boundaries. They'll submit workloads, not redesign them. They'll use tools, not modify the toolchain. The performance improvements available to an agent are limited to what can be achieved without crossing those boundaries.

Flarion fits precisely within these boundaries - a single configuration change that delivers material performance gains without touching application code, cluster settings, or deployment infrastructure. The agent stays in its lane. The acceleration happens underneath.

‍

Agents Are Measurement-Driven

Agents evaluate tools the way engineers wish they could: purely on observed results. An agent running a Spark job with Flarion enabled will observe that the job completed 40% faster at the same resource cost. It will observe this consistently across runs. Same input, same output, less time, less money. The signal is clean and unambiguous.

This matters because agents will increasingly be the ones deciding which tools to use within their permitted boundaries. They'll A/B test configurations, track cost-per-query over time, and converge on whatever delivers the best results. Tools that provide clear, repeatable value survive this selection process. Tools that require persuasion or organizational buy-in to demonstrate value don't - because agents aren't amenable to persuasion.

Flarion's 30-60% acceleration on real production workloads, validated across companies processing billions of records daily, is exactly the kind of signal that agents optimize for.

‍

The Zero-Config Threshold

The combination of bounded permissions and measurement-driven evaluation creates clear selection pressure on infrastructure. The platforms that agents will adopt are the ones that cross what might be called the zero-config threshold: the point where a tool can be activated and deliver value without requiring expertise.

Consider what an agent requires of data infrastructure. First, activation must be trivial - a single parameter, a plugin, a flag. Something that's easily testable and verifiable. Second, failure modes must be graceful. If something isn't supported, the system should fall back silently rather than throw an error the agent must handle. Agents are poor at diagnosing infrastructure-specific failures; they need systems that degrade predictably rather than fail unexpectedly. Third, the cost model must be transparent. In cloud environments, wall-clock time is cost. An agent optimizing for efficiency needs tools where faster execution directly equals lower spend, without requiring hardware-specific provisioning decisions.

There's a broader principle at work here: infrastructure designed to be usable by an agent will also be easier for a human. Every property that makes a tool agent-friendly - trivial activation, graceful fallback, transparent cost - also makes it friendlier to the human engineer who doesn't have time to read a tuning guide. Building for agents raises the floor for everyone.

Flarion crosses this threshold by design. Its native execution engine intercepts Spark's physical execution plan and replaces supported operators with vectorized native execution. Unsupported operations fall back transparently to Spark. The agent never sees a Flarion-specific error. It never needs to know which operations are accelerated and which aren't. The entire acceleration layer is invisible to the caller, which is precisely what makes it usable by an agent.

‍

Why Battle-Tested Ecosystems Win

Agents will prefer established ecosystems for the same reasons enterprises do: proven reliability at scale, broad connector support, extensive documentation that language models can reason about, and operational patterns that are well-understood. Spark processes petabytes daily across thousands of organizations. Ray orchestrates ML workloads at companies pushing the boundaries of model training. These platforms have accumulated years of production hardening that no new system can replicate quickly.

Making these ecosystems perform better without requiring expertise is where the real leverage lies. Flarion takes this approach across engines. Today it accelerates Spark workloads across every major deployment - open-source Spark, EMR, Dataproc, Databricks, and Spark on Kubernetes. The same Rust-based execution engine extends to Ray Data pipelines and Trino. An agent building a multi-engine analytical workflow gets consistent acceleration everywhere, through the same mechanism: enable the plugin, get faster results. No engine-specific optimization logic. No architectural trade-offs to evaluate.

This also means no new attack surface. Flarion runs as an in-process plugin inside the existing environment. No data leaves the perimeter. An agent can enable acceleration without triggering security reviews or compliance concerns - a friction reducer that matters enormously for enterprise adoption of agentic workflows.

‍

Where This Goes

The logical endpoint of this trend is outcome-oriented infrastructure - systems where agents submit workloads with constraints like "as cheap as possible, under 20 minutes" and the platform figures out the rest. The infrastructure handles resource allocation, configuration tuning, hardware routing, and failure recovery autonomously.

Flarion is building toward this future. The vision is autonomous execution where workloads are submitted with SLA targets and the system handles everything else - auto-tuning, auto-scaling, auto-recovery. The interface an agent actually wants: declare the outcome, let the infrastructure deliver it.

The building blocks are here today. A native execution engine that eliminates JVM overhead. Vectorized processing that leverages modern hardware. Transparent fallback that guarantees compatibility. Cross-engine support that works wherever the workload runs.

The agents are already here. The question is which platforms are ready for them.

‍

Featured Blog

Blog

How OpenAI Runs Spark: A Case Study in Hybrid Data Infrastructure

read time

February 2, 2026

At the recent Open Lakehouse + AI Summit, OpenAI's data platform team gave a detailed account of how they run Spark internally. It's a revealing look at the operational reality of serving over a thousand internal customers across model training, analytics, safety research, and finance.

Their setup is representative of large-scale data platforms. They run both Databricks and a self-hosted "OpenAI Spark" on Kubernetes, unified through a shared Unity Catalog. Users switch between engines by changing a single configuration parameter. This hybrid pattern has become the norm for organizations processing data at serious volume, and OpenAI's experience illuminates why.

The Hybrid Reality

Three forces push enterprises toward running their own Spark alongside managed services. First, data security requirements often mandate that sensitive workloads stay within controlled infrastructure - no amount of compliance certifications fully satisfies some internal security teams. Second, the economics shift at scale: organizations processing petabytes daily often find that self-hosted deployments dramatically reduce costs for predictable, high-volume workloads. Third, operating your own stack means you can debug it. Full source code access and the ability to implement workload-specific optimizations matter when you're troubleshooting production incidents.

Building the Infrastructure Layer

The OpenAI team's account of scaling self-hosted Spark follows a familiar trajectory. Initial deployment is straightforward - Spark on Kubernetes, Airflow integration, jobs start flowing - and then usage grows.

Kubernetes control plane limits surface first - API servers buckling under listing operations from thousands of concurrent jobs. The response is multiple clusters, which immediately creates routing problems. Static routing (annotating jobs with target clusters) proves operationally painful. The solution is a gateway service that handles dynamic routing, access control, quota tracking, and auto-tuning based on historical patterns. This is infrastructure that managed services provide invisibly, and that self-hosted deployments must build explicitly.

Catalog integration across both managed and self-hosted environments requires careful coordination: permission verification, scoped credentials, distribution to executors. These are solved problems, but solving them yourself takes engineering time.

Performance at Petabyte Scale

OpenAI's talk gets more interesting when it turns to optimizations that don't appear in Spark documentation. Their CDC ingestion example is illustrative: at petabyte scale, Spark's default merge operation breaks down because mixed event types require outer joins that can't be broadcast. Their solution - splitting merges into separate operations for updates/deletes versus inserts - is the kind of pattern that emerges only from production experience.

Cloud storage API limits create another class of problems. Transaction-per-second caps become bottlenecks when scanning tables with extensive metadata. The optimizations are straightforward once you know to look: listing only from the last known commit, caching metadata, eliminating redundant status checks.

The most impactful optimization they described involves recognizing what data doesn't need to be read at all. Merge operations that update rows based on key matching don't need to scan target table columns if the CDC payload already contains the necessary data. This column pruning yielded 98% reductions in data scanned for some of their workloads.

The Architectural Ceiling

Even with these optimizations, OpenAI's team acknowledged limitations that configuration tuning can't address. PySpark's architecture creates both performance overhead and debugging complexity. JSON processing remains expensive. These are consequences of Spark's JVM-based architecture, and the industry is responding.

Remote shuffle services decouple shuffle data from executor lifecycles. Native acceleration engines process data in columnar format with SIMD instructions. This is the problem Flarion addresses directly - accelerating Spark workloads natively without requiring pipeline changes, targeting exactly the architectural constraints OpenAI describes. Organizations facing similar ceilings can evaluate whether native acceleration closes the gap before committing to the engineering investment of building their own optimization layer.

OpenAI's scale is unique, but its challenges are common. Hybrid deployments, control plane scaling, storage API limits, the performance ceiling of JVM-based processing - these are what enterprises running Spark at scale consistently encounter. Their solutions represent current best practice. The question for most organizations is when they'll face these problems, and whether they'll be ready.

‍

Thank you

You have been subscribed.

Oops! Something went wrong while submitting the form.

Storage Format Lock-in: The Constraint Limiting Query Performance

Why Parquet Falls Short for Selective Analytics

The Row Group Problem

Single-Phase Execution

Limited Predicate Pushdown

How Query-Optimized Formats Solve These Problems

Granular Storage

Two-Phase Execution with PREWHERE

Sparse Indexing

Lazy Materialization

The Performance Gap: A First-Principles Calculation

Why Processing Engines Don't Support Alternative Formats

The Architectural Impedance Mismatch

Real-World Impact on S3-Based Data Lakes

The Compression-Performance Trade-off

Performance Patterns by Query Type

The Hidden Costs of Format Lock-in

The Path Forward

Related Posts

The Journey to Better Data Processing

The Evolution of Data Processing Needs

A New Approach to Performance

Enabling Innovation Through Efficiency

The Future of Data Processing

Tuning Will Converge - Then What?

Separation of Concerns

Agents Are Measurement-Driven

The Zero-Config Threshold

Why Battle-Tested Ecosystems Win

Where This Goes

The Hybrid Reality

Building the Infrastructure Layer

Performance at Petabyte Scale

The Architectural Ceiling

Subscribe today

Thank you