All applications are becoming distributed systems

Traditionally, most SaaS applications followed a relatively straightforward system architecture - a frontend client makes API calls via REST or GraphQL to a stateless backend which does all core data manipulation via database queries.

Such an architecture requires essentially no reasoning about distributed systems (e.g. consensus, coordination, consistency, state invalidation, splitting computation across nodes, etc). Although the application may use a distributed system (a database), the application developer could generally treat it as a black box, with caches being the notoriously annoying classic exception.

This paradigm is rapidly changing. Specifically, I would argue that modern applications are increasingly becoming distributed systems in the sense that the compute and storage model of the application is spread across client, edge, and cloud.

Some examples of this include:

“Local First” applications like Linear & Notion, which treat the client & server as peers which can diverge on state and must regularly come to consensus
Most SaaS apps doing real-time collaboration in their product using techniques like CRDTs, such as Sequence
Companies like API.Video, Womp, and Sequence which do remote video processing (often on edge infrastructure) but then also do some graphics work in the client, essentially splitting their application’s compute model
Companies like RillData & Motherduck who adopt hybrid query execution patterns, running large scale analytical compute both locally and remotely
Architectures like Apple Intelligence, which run variations of ML models both locally and in the cloud and dynamically route across them based on the workload

In simple terms, these architectures trade higher system complexity for faster, more reactive, more responsive application which sometimes offer secondary benefits such as multiplayer collaboration, offline-mode support, and reduced server-side compute costs.

Their emergence is driven by a number of recent technology improvements in areas like virtualization and distributed systems which make it easier to handle this degree of system complexity. It is also driven as a result of all applications becoming data & AI driven, making all applications increasingly stateful & compute heavy, thereby necessitating more extreme methods for reducing latency and improving responsiveness.

This is an interesting trend because it substantially changes how applications are designed, it greatly increases application complexity, and it introduces the need to reason about distributed systems to a new set of engineers who generally never had to think much about it before (application engineers). The hatred most application teams have towards maintaining cache coherence with tools like Redis and Memcached is a good illustration of how messy it can get when people who aren’t trained in distributed systems have to reason about them.

This post explores this trend across a few dimensions:

The technical enablers driving these architecture shifts
The market forces & workload changes driving these architecture shifts
How developer tooling may need to evolve to accomodate these architecture shifts

Technology Enablers

There are numerous recent technology advances which have dramatically simplified building an application in this way.

Embedded libraries

There has been a range of activity recently in single-node, embedded, zero-dependency libraries which offer database-like characteristics in a very small form factor that can be run anywhere (e.g. locally, in the browser, etc).

Good recent examples of this include DuckDB (analytical queries), LanceDB (search queries), KuzuDB (graph queries). SQLite is the original example of this and is what has been used for data storage in most embedded systems, IoT devices, and mobile devices for decades.

Such libraries are very interesting because they, in theory, allow you to run the same compute engine at every layer of your application - client, edge, & cloud. This is an extremely useful basis for “hybrid execution” architectures which dynamically route queries or compute across different nodes based on the query.

“Edge” Compute & Storage

Tooling and infrastructure for provisioning globally distributed edge compute has massively improved over the last few years. Vendors like Fly and Cloudflare (Workers, D1) make it easy to provision compute & storage essentially anywhere on earth, and a range of newer startups like Turso & Tigris are starting to offer higher level cloud abstractions on top of these geo-distributed datacenters.

While edge infrastructure is not a new concept - CDNs have obviously been around for awhile helping to serve static content faster and vendors like Fastly offer an array of edge services for application developers such as web application firewalls - it is only more recently that the full range of cloud compute primitives were really accessible beyond AWS-East.

This vastly simplifies what it takes to build an application that offloads some processing or storage to the edge, which is a fundamental requirement for many compute heavy but latency constrained products. The second you move a significant portion of your application workload to the edge, you start to enter into something that looks more like a distributed system.

Modern Application Frameworks

Tightly coupled with the rise of edge infrastructure has been the evolution of frontend frameworks such as React which abstract certain aspects of client vs. server and edge vs. client. A good example of this is React server side rendering and server components.

CRDTs & Sync Engines

CRDTs, or conflict-free replicated data types, are a class of data structure which allow concurrent changes on different devices to be merged automatically without requiring any central server. Definitionally, any application built around CRDTs is a distributed system since they mediate state syncing & state conflicts across peer nodes.

CRDTs are the basis for many multiplayer SaaS applications (alongside operational transformation, which seems to be falling out of favor) such as Notion & Figma. Libraries like Automerge and Yjs have made it dramatically simpler to build CRDT-based applications, and as a result you see multiplayer increasingly becoming a table stakes component of many SaaS products.

CRDTs are a subset of a broader class of technical solutions to rapidly syncing potentially divergent state across many nodes, which I would broadly describe as “sync engines”.

Linear, for example, has shared extensive information about their synchronization engine which allows their product to be offline-first. Linear does not rely on specialized CRDT-based data structures but rather, fast message passing with basic heuristics for how to handle conflicts. A range of startups, such as Orbiting Hail and Electric SQL, aim to offer something similar to others.

Virtualization

Lightweight, portable virtualization technologies such as Firecracker, V8, and WebAssembly have matured substantially recently, and increasingly make it possible to run the same binary everywhere - whether locally, in the browser, on the edge, or in the cloud.

Web Assembly, in particular, is enabling a whole new class of applications which distribute compute & storage. Good examples of this include:

Electric SQL, which runs Postgres in the browser via Web Assembly and then syncs it with the server
Modyfi, which runs diffusion models in the browser via Web Assembly and also runs them on the server
SQLSync, which uses Web Assembly in the client and server to run reducer logic for state syncing
Shopify uses WebAssembly to power its edge server-side rendering workflows

Streaming & Stream Processing

Advances in stream processing and incrementally materialized views make it dramatically more realistic to ship data across client, edge, and cloud. Passing an entire database or table over the network is a non-starter in most cases, but passing only diffs or incremental changes can be an effective basis for maintaining a local sync or copy of data somewhere.

Improvements in CDC (e.g. Debezium) and incrementally materialized views (e.g. Noria, Differential Dataflow, DBSP, Materialize, Feldera) are making this much more feasible. For example - PlanetScale Boost automatically accelerates DB queries via an embedded KV cache that is kept coherent via maintaining an incrementally materialized view. This sort of pattern would have been very difficult to build just 4-5 years ago.

One data architecture I expect we'll see more in 2023 is #SQLite/#DuckDB deployed as caches at the edge, updated via change feeds from system-of-record: stellar read performance due to close local proximity to users and fully queryable data models tailored for specific use cases.
— Gunnar Morling 🌍 (@gunnarmorling) January 2, 2023

Market Drivers

In addition to it being easier to build software in this way, it is also now more essential as a result of two significant market shifts.

The first is user expectations. The widespread adoption of tools like Figma, Notion, Linear, and similar which offer offline support, multiplayer collaboration, full reactivity, and essentially instant responses for all actions have set a new baseline for user experience in a lot of software.

The second is that applications have become dramatically more state and compute intensive. We are moving to a world where essentially all products are data-driven and AI-enabled, and where as a result data processing and model inference become a core component of the system. As applications naturally become more stateful and compute-constrained, it becomes increasingly essential to adopt architectures which help to mitigate the latency & compute overhead incurred.

Good examples of this include:

Applications which adopt session backends, where you need to maintain a large amount of state in memory for constant manipulation by the user
Any company doing work with LLMs or diffusion models who want to push some inference to the edge, often for latency reasons - e.g. Web Stable Diffusion
Mosaic, a cool data visualization library that helps dynamically coordinate data processing logic across a local database in the client and a remote database in the cloud, in order to make massive scale data visualization tasks feel almost instant

How developer tooling may need to evolve to accomodate these architecture shifts

This shift in the way that applications are being built has some interesting implications for how developer tooling may need to evolve.

First, there will likely be an explosion of tools which aim to abstract many of the paradigms I describe here - particularly the distribution and syncing of data. For example, LiveBlocks aims to abstract CRDTs for multiplayer collaboration, ElectricSQL aims to abstract sync engines, and Jamsocket aims to abstract session backends. In many cases, this harkons back to what Firebase did for mobile app sync long ago.

Second, I think there will begin to be more attention paid to “smart orchestration” engines. Assuming that it is easy enough to distribute and sync data across client, edge, and cloud, the next question is how exactly to distribute the data and route queries and workloads against it?

An interesting example of this is Cloudflare Smart Placement, where Cloudflare will dynamically execute your compute either close to the user or close to a cloud database based on the nature of the query and how stateful it is.

Similarly, companies like Motherduck which build around hybrid execution models will need to consider questions like:

What data do you pull locally vs. keep remote? What is this a function of?
Do you reactively sync data locally, or do you predictively sync data locally?
If a query needs to access both local data and remote data, what is the most efficient way to fulfill this query?
Do consistency concerns ever influence whether I query remote data vs. local data?

Efforts like Apple Intelligence which adopt an analogous hybrid execution paradigm but for RAG systems will also need to consider when to route to the smaller, local model vs the smarter, bigger, remote model, in addition to having all the same data routing concerns for retrieval.

There is likely a huge amount of room for richer research in this domain - one way to think of it is database planning & optimization but applied to distributed systems, not to individual databases. Work such as Suki and Hydro are interesting efforts roughly in this direction. A key question here is how much of this sort of system optimization can be generalized, vs. how much will need to be extremely application specific.

Distributed systems expertise as an underserved differentiation for application engineers

A secondary implication of all of this is that having expertise in distributed systems can be a significant potential advantage as an application engineering team.

Distributed systems engineering forms the basis of many of these more complex architectures, and often is the basis by which you can now build truly differentiated products especially in more compute intensive categories such as design tools, CAD tools, engineering tools, AI-native tools, and similar.

Yet, few application engineers actually have much expertise in this area - most strong distributed systems people work on “traditional” distributed systems problems (e.g. databases, backend infrastructure, etc).

Frontend engineers haven’t fully realized how relevant some old ideas from distributed systems engineering now are to their work - Browsertech Digest

As a result, I think there is a pretty significant talent arbitrage in the market right now for people applying distributed systems ideas to the way modern web applications are built. Companies which push in these areas can build really differentiated products as a result - like RillData in exploratory data analysis, Womp in 3D editing, and similar.

This blog by Tably is a particularly fun example of this showing how the right application of these sorts of distributed systems ideas might allow you to built a really novel spreadsheet product. I think there is opportunity for more teams thinking this way.

Summary

Overall, data is becoming more “distributed” in applications. The world is moving more towards application that assume that data is split between client, edge, and server, which intelligently cache, sync, & store data across these layers and intelligently route queries/workloads across them.

This trend is facilitated by a range of technology advances, especially virtualization layers like Web Assembly and embedded zero-dependency DB libraries like pg-lite, and is starting to become more table stakes thanks to consumer expectations around application behavior and the rise of more compute and ML-intensive applications.

As this trend matures, I think it will necessitate further evolution of developer tooling, specifically with respect to syncing, orchestration, and routing of data and compute across an increasingly complex system architecture.

News + Insights

All applications are becoming distributed systems

Technology Enablers

Market Drivers

How developer tooling may need to evolve to accomodate these architecture shifts

Distributed systems expertise as an underserved differentiation for application engineers

Summary

Stay up to date

Navigate

News + Insights

All applications are becoming distributed systems

Technology Enablers

Market Drivers

How developer tooling may need to evolve to accomodate these architecture shifts

Distributed systems expertise as an underserved differentiation for application engineers

Summary

Share

Stay up to date

Navigate

Follow Us