The Data Infrastructure Investment Thesis: Lessons from Snowflake, Databricks, and dbt Labs

The Decade That Defined Modern Data Infrastructure

Between 2012 and 2023, the data infrastructure market produced some of the most significant venture returns in technology history. Three companies in particular stand out not merely for the scale of their outcomes, but for what they reveal about the structural dynamics that create category-defining data infrastructure businesses. Snowflake went public in September 2020 in the largest software IPO to that point, opening at a market capitalization exceeding $33 billion on revenues that had been essentially zero five years earlier. Databricks, still private as of 2023, closed a Series I at a $43 billion valuation, making it one of the most valuable private technology companies in the world. dbt Labs, which builds the open-source transformation layer that has become a standard component of the modern data stack, raised a $222 million Series D in 2022 at a valuation that reflected its commanding position at the center of how data teams work.

These are not flukes. They are the product of specific structural conditions that serious investors in data infrastructure have learned to identify and back early. At DataInx Ventures, our Seed investment thesis in data infrastructure is built directly on the patterns these companies established. This piece examines what those patterns are, why they produced the outcomes they did, and how they inform what we look for in the next generation of founders building at the foundation of the data stack.

$33B+ Snowflake IPO valuation Â· 2020

$43B Databricks valuation Â· 2023

$222M dbt Labs Series D Â· 2022

Snowflake: The Architecture Arbitrage Playbook

Snowflake was founded in 2012 by three data warehouse veterans who had spent their careers building and operating the systems that enterprises relied on for analytics. Their core insight was not that enterprises needed better analytics. It was that the dominant data warehouse architecture—built around tightly coupled compute and storage, designed for on-premise deployment, and requiring significant upfront capacity planning—was structurally incompatible with the cloud era.

The founders observed that the economics of cloud storage and compute were diverging. Storage costs were falling toward near-zero. Compute could be provisioned and released in seconds. Yet the dominant data warehouse vendors, including Teradata, IBM Netezza, and the earlier iterations of Oracle Exadata, were designed around assumptions that were artifacts of an on-premise world: that storage and compute had to be tightly coupled because network latency between them was prohibitive, and that capacity had to be provisioned ahead of time because spinning up new resources took days or weeks.

Snowflake's architecture separated storage from compute completely. Data lived in Amazon S3, cost-effectively, at virtually unlimited scale. Compute clusters, called virtual warehouses, could be spun up in seconds, scaled to any size, and paused when not in use. Multiple independent compute clusters could read the same data simultaneously, eliminating the contention problems that plagued shared-resource architectures. The result was a product that was cheaper, faster, and more flexible than anything incumbents could offer—not because Snowflake was marginally better, but because the underlying architectural economics were categorically superior.

The Snowflake story teaches several things about what creates durable advantage in data infrastructure. First, genuine architectural innovation—not feature differentiation—is the basis for category creation. Snowflake did not win by having a better query optimizer than Teradata. It won by making Teradata's entire architectural model economically obsolete. Second, timing relative to platform transitions matters enormously. The cloud storage cost curve made Snowflake's architecture viable in 2012 in a way it could not have been in 2005. Founders who identify the point at which a platform transition makes a new architectural model superior to the incumbent model are positioned to build category-defining companies.

Third, the enterprise willingness to pay for data infrastructure is genuinely extraordinary once trust is established. Snowflake's average contract values and expansion rates were among the highest ever seen at IPO in enterprise software. The reason is structural: data is becoming a core operating input for nearly every enterprise function, and the infrastructure that makes data reliably accessible is not discretionary spending. When Snowflake demonstrated that it was production-reliable, the spend followed at scale.

Databricks: The Open-Source Land-and-Expand Model

Databricks has a different origin story that illuminates a different set of structural dynamics. The company was founded in 2013 by the creators of Apache Spark, the open-source distributed computing framework that had already become the standard for large-scale data processing at companies including Facebook, Netflix, and Airbnb. The founding team spun out of UC Berkeley's AMPLab with both the technical credibility to define the category and the community distribution advantage that open-source provenance provides.

The core insight behind Databricks was that the machine learning and data engineering communities needed a unified platform that could handle the full lifecycle of data work—from large-scale data preparation through model training through production deployment—without requiring engineers to cobble together half a dozen separate tools that were difficult to integrate and expensive to operate at scale. Apache Spark provided the computational foundation. Databricks built the managed platform, the collaborative notebooks, the MLflow experiment tracking, and the Delta Lake storage layer that turned Spark from a powerful but operationally demanding framework into a production-ready platform that data teams could actually use.

What Databricks demonstrates most clearly is the power of open-source distribution as a go-to-market strategy for infrastructure companies. By maintaining Apache Spark as an open-source project and ensuring that Databricks remained the most capable way to run it, the company acquired distribution at essentially no cost to itself. Data engineers learned Spark in university courses, used it at their current jobs, and carried that knowledge with them when they moved to new companies. Every new hire who knew Spark was a potential champion for the Databricks managed platform. The community flywheel was self-reinforcing and structurally difficult for closed-source competitors to replicate.

The most defensible position in data infrastructure is not the best product at launch. It is the platform that becomes the default workflow for the most influential data practitioners in the market.

The other dimension of the Databricks story that merits careful attention is the company's evolution from a Spark managed service into a unified analytics and AI platform. The Delta Lake storage format, introduced in 2019, transformed Databricks from a compute platform into a data platform by adding ACID transactions, schema enforcement, and time-travel capabilities to the data lake. Unity Catalog, introduced in 2021, added enterprise governance capabilities. The pattern—start with a specific, technically superior product that earns the trust of data engineers, then systematically expand the platform scope—is one we see repeatedly in the most successful data infrastructure companies.

By 2023, Databricks was generating over a billion dollars in annual recurring revenue and growing at rates that most public software companies could not match. The $43 billion valuation it commanded reflected not just its current revenue position but the breadth of the platform it had assembled and the structural difficulty of displacing it once an enterprise had committed its data workflows to Delta Lake.

dbt Labs: Community-Led Category Creation

The dbt Labs story is in some ways the most instructive for the Seed-stage investment thesis, because it demonstrates how a relatively focused open-source tool can define and own a category that sits at a critical intersection of the data stack, even when that category did not previously exist as a named market.

dbt—which stands for data build tool—was originally a side project created by Fishtown Analytics, the data consultancy that eventually became dbt Labs. The tool addressed a specific, painful problem that data analysts faced constantly: the process of transforming raw data in a data warehouse into clean, analytics-ready models was managed through a combination of SQL scripts, spreadsheets, and institutional knowledge, with no version control, no testing framework, and no documentation standard. Every data team reinvented this process independently, and the results were generally fragile, opaque, and difficult to maintain.

dbt introduced a disciplined workflow for data transformation: models defined in SQL files, managed with version control, tested against data quality assertions, and documented through a lightweight metadata layer. It was not technically complex. But it was precisely the right tool for the moment when data analysts were beginning to operate more like software engineers—deploying code, managing dependencies, and taking ownership of production data pipelines—and needed the tooling to match that operating model.

The growth of dbt was driven almost entirely by community adoption. The dbt Slack community grew to tens of thousands of members. The dbt documentation became the standard reference for how analytics engineering should work. Job postings for data roles began specifying dbt proficiency as a requirement. By the time dbt Labs raised its $222 million Series D in 2022, the tool had become so central to the modern data stack that Snowflake, Databricks, and every major cloud data warehouse had certified integrations with it. The Series D valued the company at over $4 billion, a remarkable outcome for a business built around a relatively simple open-source tool.

What dbt Labs demonstrates is that category definition is itself a form of competitive moat. By naming and standardizing the analytics engineering discipline, dbt Labs made itself synonymous with that discipline. Engineers who learned data transformation learned it through dbt. Enterprises that standardized on dbt created internal tooling and workflows that depended on it. The switching cost was not the cost of replacing the tool; it was the cost of retraining teams and rebuilding workflows that had been organized around the dbt mental model for years.

Pattern Recognition: What These Companies Share

Across Snowflake, Databricks, and dbt Labs, several structural patterns recur that inform DataInx's approach to evaluating Seed investments in data infrastructure.

Architectural Leverage at the Right Moment

All three companies built at a moment when a platform transition—the shift to cloud-native architecture, the emergence of GPU-accelerated machine learning, the maturation of cloud data warehouses—made a new architectural model categorically superior to what existed before. They did not win by being incrementally better. They won by making the incumbent approach economically or technically obsolete. Identifying these platform transition moments early, before the market has coalesced around a dominant player, is the primary task for a Seed investor in data infrastructure.

Developer-Led Distribution

None of these companies built their initial user base through enterprise sales motions. They built it by creating products that individual engineers found genuinely useful and adopted independently. The enterprise sales infrastructure came later, built on top of a distribution base that was already wide and growing. This pattern—sometimes called product-led growth, sometimes developer-led growth—is structurally important because it creates a distribution advantage that is very difficult for sales-led competitors to replicate. A company cannot buy the kind of community trust that dbt Labs built over five years of consistent engagement with the analytics engineering community.

Infrastructure-Level Switching Costs

Once an enterprise committed its data workflows to Snowflake's architecture, its production pipelines to Databricks and Delta Lake, or its transformation logic to dbt, the cost of switching was not primarily the cost of the new software. It was the cost of rebuilding institutional knowledge, retraining engineering teams, migrating years of accumulated data models, and renegotiating the integrations that adjacent systems had built against the incumbent platform. These switching costs compound over time. They are what turns a technically superior product into a business with durable pricing power and revenue predictability.

Expansion Revenue That Reflects Business Growth

All three companies were built on pricing models that naturally expanded as customer businesses grew. Snowflake's consumption-based pricing meant that as enterprises processed more data, their Snowflake spend grew automatically. Databricks' seat and compute model expanded as data teams scaled. dbt Labs' enterprise tier expanded as the number of data models, environments, and users grew. This alignment between customer success and vendor revenue is one of the most powerful structural features a data infrastructure business can have, and it consistently produces net revenue retention figures well above 100 percent at the companies that get it right.

The Next Generation: Where the Opportunity Sits

The success of Snowflake, Databricks, and dbt Labs did not close the data infrastructure opportunity. It opened it. These platforms have become foundational infrastructure for a generation of enterprise data work, and every foundational platform creates a surface area for new companies to build on top of or adjacent to it.

At DataInx, we see particularly compelling Seed opportunities in several areas that represent the natural evolution of the trends these companies established. First, real-time data infrastructure: Snowflake and Databricks were architected primarily for batch and interactive analytics. The emergence of streaming-native architectures, driven by companies such as Redpanda—which raised $100 million in 2023 for its Kafka-compatible streaming platform—suggests that the next layer of data infrastructure investment will center on making real-time data as accessible and reliable as batch data has become.

Second, the data layer for AI-native applications: the training and inference workflows of large language models and multimodal AI systems require data infrastructure capabilities that the current generation of tools was not designed to support. Vector databases, feature stores, and data pipelines optimized for model training at scale represent genuine architectural needs that incumbent platforms address only partially. Turso, which raised a Seed round in 2023 for its edge-native SQLite-compatible database, exemplifies the kind of purpose-built infrastructure that AI-native application patterns demand.

Third, the operational layer that makes the modern data stack governable at enterprise scale: as enterprises have assembled data stacks from multiple best-of-breed components, the cost and complexity of operating, monitoring, and governing those stacks has grown significantly. The companies that build the observability, lineage, cost management, and governance capabilities that tie together heterogeneous data environments are building at a structural intersection that did not exist a decade ago.

Fourth, data infrastructure for regulated industries: financial services, healthcare, and life sciences companies face data governance and compliance requirements that general-purpose platforms address inadequately. Vultr, which raised $333 million in 2022 to expand its cloud infrastructure footprint, reflects the broader trend toward purpose-built infrastructure that can accommodate the specific requirements of regulated enterprise environments. The data infrastructure layer within these environments represents a market that is both large and structurally underserved by current solutions.

DataInx's Investment Framework

DataInx focuses exclusively on Seed Round investments in data infrastructure. Our conviction is that the pattern-recognition principles established by Snowflake, Databricks, and dbt Labs provide a reproducible framework for identifying the companies most likely to define the next generation of the data stack.

We look for founding teams that have built data infrastructure at scale, ideally as practitioners who experienced the problem they are solving before deciding to build a product around it. We pay careful attention to whether the architectural approach represents a genuine discontinuity from what exists or merely an incremental improvement. We evaluate the distribution strategy with the same rigor we apply to the product, because in data infrastructure, the companies that win community trust early typically maintain category leadership far into the growth phase.

We also look closely at the expansion economics embedded in the pricing model. The businesses we want to back are ones where customer success and vendor revenue grow in the same direction—where the best outcome for the customer is also the best outcome for the company. This alignment is not guaranteed by any particular pricing structure; it requires deliberate architectural and product decisions that the founding team must make early.

Finally, we are particularly attentive to the timing question. The right architectural insight at the wrong moment—before the underlying platform transition it depends on has matured—is as expensive as the wrong architectural insight. Snowflake was possible in 2012 because cloud object storage costs had reached the threshold where separated storage and compute made economic sense. It would not have been possible in 2005 for reasons that had nothing to do with the quality of the architectural insight. Part of what we try to do is evaluate not just whether an idea is technically correct but whether the market infrastructure required to realize it is present today.

Conclusion: The Infrastructure Era Is Not Over

The data infrastructure market is often described as mature by analysts who observe the dominant positions of Snowflake and Databricks and conclude that the category formation opportunity has passed. We think this misreads the structure of the market significantly. The platforms that have achieved dominance in the current generation are built on architectural assumptions—batch-oriented processing, centralized data warehouses, GPU-centric training workflows—that are already under pressure from the next generation of AI-native application patterns.

The enterprises that are building seriously for the next decade of data operations are already discovering that their current infrastructure stack does not elegantly accommodate real-time inference requirements, edge data patterns, or the governance demands that AI regulation will impose. These gaps are not marginal. They are the kind of architectural discontinuities that have historically created the conditions for new category leaders to emerge.

Snowflake, Databricks, and dbt Labs were all Seed-stage companies at one point, built by founders who had identified a genuine architectural mismatch between the needs of data practitioners and the capabilities of the incumbent tooling. We believe those conditions are present again today, and we are actively seeking the founders who are building the data infrastructure platforms that will define the next generation of the stack. If you are building in this space, we would be glad to hear from you.