Cloud Infrastructure —Modern Data Stack
As a sophomore in college in 2011, I interned at Credit Suisse as an analyst in the private banking data warehouse team in Singapore. My job was to create data pipelines on Talend software with the end-goal of determining which private banking products were getting the most traction by region. Needless to say, it was a cumbersome process. There were several databases with different schemas and data had to be manipulated before loading into the data warehouse due to compute constraints. Fast forward a decade later, I am pleasantly surprised to see how far we have come with Snowflake and its ecosystem making the entire process highly self-serve.
The recent public release of ChatGPT has reinforced the notion that data and AI are the megatrends of the 2020s and will lead to the largest wave of innovation since mobile and cloud. Historically, many enterprises have not leveraged data deeply to drive operational insights. In most cases, enterprises have been limited by the technology infrastructure underpinning data which has been suboptimal for their purposes. However, that’s changing now, driven by the confluence of cloud infrastructure innovations that started in 2010 but only became ready to be adopted at scale in the last 3–4 years. I believe that enterprises will have to re-architect their data stack to benefit from these megatrends or risk being left behind. An entire ecosystem consisting of data platforms, tools and applications, referred to as the “Modern Data Stack”, is developing and in my view, forms one of the most attractive opportunity sets in software currently.
The Modern Data Stack
Historically, enterprises have had data warehouses on-premises in their data center for use cases such as reporting and analytics as represented in the “Old Architecture” diagram below. While valuable, these use cases are more reactive rather than proactive. They allow enterprises to understand their operations at macro level but do not drive decision-making at the micro level.
The driver for the underutilization of data has been the technology infrastructure. Given the high upfront costs of setting up an on-premise data center, there was only as much computing power available. The ability to crunch large amounts of data was low and latency & cost of querying was high.
Starting 2010, with the launch of Google’s BigQuery, following innovations in cloud infrastructure ushered in a new age of data:
- Cloud-native massively parallel processing (“MPP”) analytical (OLAP) databases that led to (i) querying performance improving by a factor of 10–1000x over existing relational (OLTP) databases, (ii) elasticity of compute allowing as much computing power on-demand as required and (iii) no requisition of upfront investment. Net-net, the power of compute went WAY UP and cost of compute went WAY DOWN.
- Innovation by Snowflake in the form of separation of compute and storage layers allowed for a simplified data stack architecture, thereby blurring lines between data lake and data warehouse and making it easier for data analysts to self-serve (details below).
- Snowflake and its partners such as dbt supported SQL, the language known to all data analysts, allowing them to self-serve easily without knowledge of Python or Hadoop or Spark. The ecosystem coalesced around the ubiquitous SQL.
- An entire ecosystem developed around cloud-native data warehouses that lowered the friction for enterprises to migrate from legacy on-premise data stack (see history of modern data stack below).
These innovations have had a profound impact on the way enterprises can leverage their data:
- New Use Cases such as AI/ML Are Now Feasible: With cloud-native MPP OLAP databases, querying speeds have improved dramatically. Furthermore, enterprises aren’t limited by the size of their data center, rather they have very large computing power for data analysis at their disposal which could be scaled up or down. No upfront investment is required (though minimum annual commitment to cloud providers is required). All of this has allowed for new use cases that were previously unfeasible to now become feasible, key among which has been AI/ML which requires large computing power to train models on proprietary data.
- Much Easier to Self-serve: Previously, given the compute constraints in the data center, data engineers would create complex data pipelines just to manage around the constraints (such as pre-aggregating data to create OLAP Cubes by writing custom Python script, a task that may take days to complete). However, with modern powerful cloud OLAP data warehouses, there has been a transition from ETL to ELT which enables business analysts to self-serve without relying as much on a data engineer. Business analysts can load data from sources to the data warehouse with Fivetran, transform it within the data warehouse using SQL with dbt and query it directly from Snowflake or Databricks using SQL. From a business standpoint, data analyses moves much faster at lower costs.
- Simplified Data Architecture (see “Modern Architecture” below): There are often two tiers of data storage within enterprises — data lake and data warehouse. Data lake contains raw structured and unstructured data stored cheaply which would then be served up to a data warehouse for running specific queries. However, Snowflake has innovated with separating compute and storage. As such, enterprises can now potentially dump all their data in Snowflake data warehouse / Databricks Lakehouse and pay only for the compute — a two-tier architecture is no longer necessary (though is still preferred as the cost of storing in data lake is lower than storing it on Snowflake; however, this may change). This simplified architecture can allow for even easier self-service, lower Total Cost of Ownership (“TCO”) and unified data governance.
To enable the new paradigm of data architecture, an entire ecosystem consisting of data platforms and tools (referred to as “Modern Data Stack”, see diagram below) has developed in the recent years. The key components (with key vendors in parentheses) of the stack consists of:
- Data platforms for querying and processing (Snowflake, Databricks, Google’s BigQuery)
- Data ingestion & transformation tools (Fivetran, dbt)
- Data pipeline orchestration (Airflow, Prefect)
- Data observability & testing (Monte Carlo)
- Data governance and catalog (Alation, Collibra, Atlan)
- BI / Visualization (Looker, Mode)
Size of the Prize (aka TAM)
Enterprises digital transformation has been a secular trend for the past decade with COVID-19 pandemic further accelerating the trend. The result has been that lower hanging workloads such as CRM and HCM have largely moved to the cloud. CRM had ‘22E cloud penetration of 83% and HCM had ‘22E cloud penetration of 77%. More complex workloads, including data warehouses, are still primarily on premise. Cloud penetration for data management in ‘22E is 35%. Furthermore, as data becomes a differentiator for enterprises allowing them to serve more personalized content and make data-driven decisions, the demand for cloud data management solutions is expected to grow rapidly.
Snowflake, using Gartner’s research, estimates its 2026 TAM to be $248 billion versus its current revenue run-rate of over $2 billion. Databricks, which has historically specialized in data science but increasing encroaching upon Snowflake’s data warehouse territory, has a current revenue run-rate of over $1 billion. As a cross-check, IDC expects FY26 TAM for DBMS, BI & Analytics to be $195 billion. In a nutshell, the TAMs in this category are large and highly underpenetrated.
Investment Opportunity Set
While the basic plumbing for the modern data stack is in place, we are still in early innings with regards to the ecosystem. There are several players in each category, with on-premise legacy vendors racing to adapt (e.g., Talend, Informatica) while cloud-native start-ups trying to displace existing vendors or take a pie of the newly created sub-segments (such as data governance). As investors, there is an opportunity to 1) pick winners among data platforms that will likely offer some version of an end-to-end solution and 2) invest in best-in-breed tools which are deeply embedded in their enterprise customers’ workflows. Some of the key players in each segment are presented in the market map below.
Furthermore, there is an opportunity to create and/or invest in the emerging tools and applications on the modern data stack such as:
- Real time data streaming analytics tools— Typically, analytics has been done on batches of data which are periodically loaded into an OLAP data warehouse. Plumbing for analytics on real time streaming data is not well developed and is currently being done outside of the modern data stack, not over it.
- Reverse ETL tools — Typically, operational tools such as Salesforce provide data that make its way into data warehouse via ETL pipelines and deliver insights. These insights can be automatically fed back into the operational tools via reverse ETL pipelines. For example, feeding insights on customers back into Salesforce to allow account managers to drive cross-selling.
- AI / ML tools and applications — Adoption of AI/ML is currently data-centric instead of incrementally improving models — tools needed include those for data labeling, monitoring/observability, ingestion, orchestration, governance and validation & auditing (in aggregate known as MLOps). Furthermore, AI/ML applications are in infancy and we’ll see an explosion of companies leveraging AI/ML to solve real-world problems.
- Data sharing tools and data marketplace — Companies that want to share data with each other currently do it via an APIs resulting in data integration overheads. However, for companies using modern cloud data platforms, they can potentially provide access to their database directly on the data platform to their counterparties (easy to do if both parties use the same platform such as Snowflake, requires a 3rd party tool for cross-platform sharing such as Snowflake and Databricks). Emerging cybersecurity paradigms such as zero-trust architecture make it easy to securely share data. Furthermore, data sharing platforms and tools may also enable frictionless data marketplaces which will allow sellers to better monetize their data and buyers to layer-on third-party proprietary data into their AI/ML models.
Asia Start-up Landscape
As I have discussed previously, Asia is increasingly seeing globally competitive SaaS companies driven by the following macro factors:
- All of SaaS, especially developer-led infrastructure SaaS, is gravitating towards PLG-led distribution model. This allows Asian founders to target their potential customers in the U.S. remotely and allows U.S. developers to find companies solving their problems on popular developer platforms such as Product Hunt.
- Asia has a deep bench of engineering/product talent such as engineers/product managers trained at global R&D centers in India and talent from homegrown internet firms in China.
- Funding is increasingly democratized and booming start-up ecosystems in Bangalore and Shenzhen are allowing discovery of founder talent in the region.
Some of the most interesting Asian start-ups in the modern data stack space that I am tracking are in the table below. If your start-up’s name is missing from the list, please drop me a message!
Conclusion
In summary, modern data stack will underpin some of the major innovations we will see in data and AI in the coming decade. With only one listed cloud-native company (Snowflake), the ecosystem is still in early stages of growth and will provide numerous opportunities for founders and private investors.
If you have any comments or questions, please feel free to drop me a message. If you are building in the space, I would love to hear from you.
Subscribe to my Newsletter!
https://gjuneja1.substack.com/