Meet Bauplan: making all software engineers data engineers

Bauplan is a new serverless data platform that makes complex data pipelines accessible to all software engineers—not just specialists. Built on Python and object storage, it lets devs build, query, and version data with ease using familiar tools.

At this point, it is essentially a truism that every product in the world is becoming AI and data-enabled. The rise of foundation models has democratized machine learning to such an extent that basically every software engineer in the world can now implement and build AI-powered features, even without a formal background in data science or ML.

Yet, AI products of even moderate complexity all still require substantial data engineering work, and unlike working with language models, data engineering is far from being democratized.

The second you start thinking about enrichment pipelines, retrieval & embedding pipelines, feature engineering, offline dataset curation for evaluation or fine tuning, or pre and post-processing of LLM calls, you are back to the world of “big data”. But, very few engineers have the experience needed to manage & orchestrate Spark, Ray, Kubernetes, EMR. As a result, data platform ends up becoming the critical bottleneck for so many companies.

At the same time, a number of market & technology trends are reshaping the data infrastructure landscape:

Python is becoming the lingua franca of operational data & AI workloads
Table formats like Iceberg are shifting the center of gravity for data storage to object storage
The memory capacity of compute is growing exponentially. When coupled with embedded libraries like DuckDB & Datafusion, scale up architectures are much more feasible than before, and scale-out distributed systems are losing relevance

When viewed from this lens, a data platform of the future should be Python-centric, built on object storage, and centered around a scale-up, serverless architecture that allows anyone, not just specialized data engineers, to build data pipelines.

This is precisely what Bauplan is building, and why we are excited to announce our lead investment into their $7.3M seed round.

Bauplan was started by Ciro, Jacopo, and Mattia, a trio of fun, brilliant serial entrepreneurs with extensive backgrounds in machine learning and data infrastructure. They most recently led the AI team at Coveo after it acquired their startup Tooso, an early leader in AI search and recommendation systems. Ciro ended up being part of the management team during Coveo’s IPO process, Mattia led customer operations, and Jacopo was the public face of much of Coveo’s technical work including giving the 2022 keynote for NVIDIA RecSys.

Under their leadership, Coveo became one of the most cited companies in the world in search & recommendation systems, trailing only companies like Meta & Google. Coveo won best industry paper at NAACL 2021, held the 2021 SIGIR ecomm data challenge, and their eCommerce embedding model, FashionClip, is still to this day one of the most downloaded and widely used embedding models on HuggingFace.

While many founding teams in data/AI have strong research backgrounds, very few of such teams also have such an applied lens with their research. What stands out the most about Ciro, Jacopo, and Mattia is their focus on usability and real world workloads. A lot of their side projects over the years have oriented around how to create simple, “reasonable scale”, applied data and AI systems - e.g. Recs at a Reasonable Scale, You Don’t Need a Bigger Boat. These projects in many ways laid the foundation for some of the ideas Bauplan is built around.

While at Coveo, it became pretty clear to them that pre-trained models were lowering the bar to using ML and that a lot of the special skills needed by ML teams were becoming somewhat esoteric. At the same time, data infrastructure remained very fragmented and most of their time was spent around DataOps and MLOps.

So, the trio began on a multi-year journey to explore how they could rethink data pipelines and move them to something that felt dramatically simpler, more accessible, and closer to software engineering - analogous to the shift DevOps had from click-ops to infrastructure as code.

Their early assumption was that you could build a data platform with existing serverless paradigms like Lambda and OpenWhisk on top of cloud object storage, partially thanks to the rise of embedded OLAP systems like DuckDB. But, as they dug in, they started to realize that existing serverless paradigms do not work at all for data workloads - memory limits, compute timeouts, restrictions on communication between functions, and more all make it basically impossible to do real data work in this space.

The team increasingly recognized that this challenge was also an immense opportunity - making complex data workloads serverless would require a complete re-invention of what a serverless runtime looks like and how it operates. And so, after more research into areas like virtualization, query planning, and declarative function execution, they ended up with a complete re-invention of what it looks like to build data & AI applications on top of object storage.

The result of all this technology work is something that, to users, feels delightfully simple. A freshly-minted CS grad would feel at home in Bauplan. The user writes pure SQL & Python - no frameworks, no infrastructure, no configuration is required - and serverless functions are immediately executed in the cloud for you. Your workflow feels local and seamless, there is an immediate feedback loop since the runtime boots and runs essentially instantly, and it is fully programmable with natively integrated versioning, branching, and environment management. Notably, it is also on the order of 10-100x more cost-effective than alternatives in many cases.

This simplicity is reflective of a deeper product design philosophy - Bauplan aims to introduce as few concepts into the platform as possible. Not everyone knows what a container is, but everyone knows what a package is. Not everyone knows what Parquet or Iceberg are, but everyone knows what a table and a branch are. Few people know about kubernetes or Spark, but everyone knows what a Python function is. Maintaining expressiveness while keeping the mental model of the platform so simple is extremely challenging, and indeed Bauplan spent almost a year simply refining their core abstraction to achieve it.

Bauplan is now in production with a number of customers, including one of the largest media companies in all of Europe which is pushing terabytes of data through the platform per week. What has been interesting to see among early customers is that Bauplan is often initially pulled in by a data science team, but then quickly adopted by many software engineers outside of the data organization who are now empowered to build data products on their own. In the future, all software engineers will be data engineers, and Bauplan is laying the foundation for this.

It's rare to work with a team where the deeper you explore a space, the more complex the challenge becomes—and the more exciting it gets. If you’re looking to work on some of the hardest problems in systems engineering to rethink what a data engineering platform looks like, the team is hiring. And if you want a simpler way of building a data platform on top of object storage, try out Bauplan.

Welcome aboard, Bauplan!

News + Insights

Meet Bauplan: making all software engineers data engineers

Stay up to date

Navigate

News + Insights

Meet Bauplan: making all software engineers data engineers

Share

Stay up to date

Navigate

Follow Us