Applied large language model startups have exploded in the past year. Enormous advances in underlying language modeling technology, coupled with the early success of products like Github CoPilot, have led to a huge array of founders using LLMs to rethink workflows ranging from code reviews to copywriting to analyzing unstructured product feedback.
Much has been written about this emerging ecosystem — I would recommend the excellent articles by Elad Gil, Leigh Marie Braswell, and Vinay Iyengar as starting points — and in general, it is exciting to see so many nascent startups in this area. However, I worry that many startups in this space are focusing on the wrong things early on. Specifically, after having met and looked into numerous companies in this space, it seems that UX and product design is the predominant bottleneck holding back most applied large language model startups, not data or modeling.
This article will explain why I think this is the case, highlight many of the key UX issues I observe, and offer recommendations for how a founder building on top of LLMs might account for this.
To start, let me paint a picture of the common journey I see many language model startups go through.
A technically superb founding team starts with a vision of a big use case that can be obviously re-imagined with current language model technology. They throw together a pitch deck that looks something like the following:
90% of the team’s early focus is on data, modeling, system architecture, and technical defensibility. The team is likely hyper-aware of the fact that many people believe LLMs are becoming commoditized, and as a result, tries to spend a lot of time articulating why their technical approach is unique and not easily mimicked by a hobbyist just using GPT-3.
After fundraising and a lot of time building, the startup eventually gets an MVP of the product together and puts it in front of customers. The demo blows prospects away, but then, things seem to fade. The prospect tries it out in a POC, but the novelty runs out, and they stop using it. Engagement is low. The customer churns. Despite how incredible the demo is, the startup struggles to find real product market fit.
In my experience, the root cause that typically drives this outcome is that it is extraordinarily hard to teach a human how to work together with a probabilistic system in an environment where they are not used to one. Any language model startup that is designed to be human-in-the-loop (which is the vast majority) will need to implicitly answer the following types of questions for their users:
These are, first and foremost, UX questions that require very careful design thinking and user research to solve. If you forgo such work and simply throw a naked GPT-3 text prompt into your UI, or just make a raw API call to the GPT-3 API every time the user presses a given hotkey, you are essentially asking the user to figure these things out on their own. The cognitive burden on users to figure such questions out via trial and error is immense. So if you don’t identify how to craft your product in a way that makes these answers so obvious to your users that they can get it in <5 minutes, they are very likely to give up on your product once the novelty effect wears off.
In my experience, despite these UX questions being the critical bottleneck to product market fit and the hardest thing to get right in most cases, they are chronically undervalued by most startups in this space. Indeed, if you speak with some of the few companies who have actually built widely adopted applied LLM products, you’ll find that almost universally they ended up having to spend more time on UX and the human-computer interface as they did on modeling.
I believe there are a few underlying factors that lead so many teams to pay such little attention to these sorts of UX questions despite their criticality.
First, these issues really only surface once someone starts trying to use the product in the context of their daily workflow. This market is characterized by an immense gap between a prospect being impressed or interested in a demo to actually converting to a paid and engaged user for precisely this reason. I believe this is also why many of the PLG and self-service applied LLM startups have high churn — on first use, the product is “cool,” but after some time, you realize it is too difficult to use effectively.
Second, many teams believe that improved model accuracy can allow them to avoid deeply considering these questions from a product & design perspective. Unfortunately, these challenges are essentially always present, regardless of your system’s accuracy (within some bounds). It doesn’t really matter if your LLM accuracy is 80% or 95%, as in either case, the user still needs to reason through failure modalities and understand what to expect when interacting with the system. Since you’re almost certainly not going to achieve 100% accuracy unless you have scoped down the use case to something extremely specific, you are generally better off getting to a baseline accuracy that is good enough and then building a product that allows a user to know how to work around the model (this paper on using language models to help write short stories is an excellent illustration of this concept).
Finally, there is a talent issue — most of the people building in this space come from research, Ph.D., or engineering backgrounds and simply don’t consider product and design decisions like this as first-class problems to solve.
To further explain why the UX of interfacing with LLMs is so hard to get right, let’s briefly explore one of the most popular emerging LLM use cases — copywriting. There are a number of notable companies utilizing GPT-3 or similar models for copywriting, such as Jasper.ai, Copy.ai, and Anyword. At a high level, these products all work in the following way — you tell it what you’re trying to write (e.g. advertising header), you provide some context (e.g. product description), and it generates suggestions for you which theoretically convert users at a high rate.
If you try these products yourself or speak to people who use them, you’ll find that the chief factor that differentiates them is not model accuracy. All of them are pretty good at optimizing copy for conversion, and frankly, it is not possible for a user to meaningfully compare the quality of the output copy without putting extensive time into manually A/B testing the services. In other words — differences in model accuracy are essentially imperceptible to the user. What does immediately differentiate the products from a user’s perspective, however, is the “periphery” of the product around the core language synthesis engine.
For example:
Solving these sorts of UX questions well has a substantially larger impact on a user’s ability to select good copy that they are happy with than marginally fine-tuning a language model to be slightly better at copywriting suggestions.
Indeed, many companies in this category have high churn, and I suspect that the vast majority of variance in churn/retention/engagement across such startups is driven by these UX-related factors. Products that handle them poorly are fun to try, but difficult to trust and hard to reason about over time. Products that handle them well are magical — using them feels effortless, and it is clear to understand how they fit into a real workflow.
At this point, I’ve hopefully convinced you that it is important to think about the UX of how a human will interact with an LLM in the context of your product. But how might you actually go about doing that, and what are the right design patterns?
While this space is extraordinarily early, and there is a lot to be figured out, I want to share some common themes I see emerging in the hope that it sparks your imagination and gets you thinking. For each, I’ll reference several emerging startups that are doing interesting things in the context of that UX pattern.
If you’ve ever used Gmail, you have likely come across the “smart compose” feature. As you type a sentence, Gmail occasionally flashes a suggested auto-complete in grey. The user can either quickly accept it or keep typing and ignore it.
This feature is quite delightful to use because it is simple to understand, very accurate, and requires zero cognitive effort on the part of the user. I would argue that most of these benefits stem from the fact that these suggestions are system-prompted, not user-prompted. Suggestions are only shown when there is confidence that the result is very high quality, and the “push” based nature of the suggestions means that the user does not even need to be aware that this feature exists to utilize it.
Although smart compose is not technically built using LLMs, I would argue that these principles very cleanly translate to most LLM startups. The more you can move away from a user-driven invocation of the model to an automatic triggering of the model in the right instances, the more you can ensure that model output is perceived to be extremely trustworthy and high quality, and the less work you are asking the user to do. In your product, if it is critical to give the user more flexibility on when and how to trigger the model, I would explore different ways to “hint” to the user when the model is more likely to be effective or accurate vs. not.
An emerging area of research within language models is the field of “prompt engineering.” Researchers have realized that for a wide class of language model tasks, changing the way you prompt the model results in a massive delta in the accuracy of the model’s results.
A fun example of this is that if you ask language models to solve math problems, simply adding the phrase “Let’s think step by step” results in much higher accuracy.
The fact that there is an entire research field around prompt engineering hints at a broader point — it is really, really complex to figure out how to interface with a large language model, and subtle differences in the way you “speak” with it can have a massive impact on how useful it is. The critical implication of this is you don’t want to force users of your product to be prompt engineering researchers.
Unfortunately, this is the status quo for interacting with foundation models in most products today. In language generation, most products expose fairly “naked” text boxes, and the user is left trying to test and experiment with what creates a good output. One automated email follow-up tool I tried still required me to type “Write an email…” as the start of my prompt to get a good output.
To the extent possible, I would strongly encourage building an abstraction layer between what the user inputs and what you actually prompt the model with. A simple but good example of this is TattoosAI, which is a tattoo image generation application. Note that the user interface has abstracted things like style, color, and artist into categorical drop-downs and specific fields. Under the hood, the product clearly converts those categorical values into very specific prompts, which generate good tattoo results. This would otherwise be extraordinarily difficult for a user to get right on their own by directly prompting a model like Stable Diffusion.
I suspect that, in many cases, it is not a good idea to provide a single open text field input as the primary way a user interfaces with your model. This likely gives too much freedom to your user and, as a result, opens them up to too many weird failure modalities based on not knowing the right way to talk to the model. Instead, identify constrained, defined situations where the model should be queried and productize that in such a way that the user hopefully can’t mess it up.
One of the biggest hurdles to overcome when building an LLM-based application is trust. As a user, to what extent can I default assume the output is good, versus to what extent do I need to double-check the output? The more the user needs to validate everything that the machine produces, the less the machine is actually providing any value.
Code autocompletion is an interesting case study in this context. Products like Github CoPilot and Replit AI mode are some of the earliest breakout use cases of large language models — CoPilot has hundreds of thousands of subscribers paying over $100 a year for the service, and many people love the product.
Yet, if you talk to a wide enough sample of people who use CoPilot, you will sometimes hear mixed feedback. Specifically, CoPilot regularly suggests code snippets that are incorrect and doesn’t compile or contain errors, meaning that engineers must very carefully analyze code suggestions before moving forward. Essentially, CoPilot introduces a tradeoff between spending less time writing boilerplate code and spending more time reading “someone else’s” code (aka the LLM).
It is very interesting to contrast this with Google’s recent work in LLM-based code autocompletion. Specifically, Google has built a hybrid code completion system that is not only based on transformer models but also incorporates semantic engines, which are essentially the rule-based auto-complete systems that have traditionally powered code suggestions in IDEs. The combination of these systems allows for two forms of advanced auto-complete that significantly mitigate user trust issues:
Interestingly, Google directly tested pure LLM based autocomplete suggestions and found that there was a much lower user acceptance rate for them due to trust issues:
This leads to a common drawback of ML-powered code completion whereby the model may suggest code that looks correct but doesn’t compile. Based on internal user experience research, this issue can lead to the erosion of user trust over time while reducing productivity gains….The acceptance rate for single-line completions improved by 1.9x over the first six weeks of incorporating the feature, presumably due to increased user trust. As a comparison, for languages where we did not add semantic checking, we only saw a 1.3x increase in acceptance.
Replit’s AI mode follows similar principles — “We apply a collection of heuristic filters to decide to discard, truncate or otherwise transform some suggestions; soon, we’ll also apply a reinforcement learning layer to understand the kinds of suggestion that are helpful to users, filtering out suggestions that are unlikely to be accepted to prioritize suggestions that are genuinely helpful.
Generalizing this, I find that many best-in-class LLM-based products complement the LLM with some form of validation checks (often heuristic in nature) to avoid errant output. It can be catastrophic to suggest something to a user that seems nonsensical or obviously wrong. So especially if you have a product that allows a user to trigger the model in arbitrary situations, it often becomes essential to build mechanisms to mitigate the chances of this.
Of course, not all use cases have validation checks as simple as testing whether code can compile. As such, I would also recommend thinking about affordances you can build that allow the user to know when to gut-check the output as well as how to gut-check the output. This might involve building some testing or evaluation framework that the user can use to define domain-specific tests or invariants that they want to hold, or it might involve workflows that ask the user to test/verify the output and edit it in certain cases.
Debuild is a good example of a simple app that has approached validation affordances well. Debuild allows you to create simple web applications with natural language — you start by typing something like “A to-do list that allows me to track, edit, and input tasks in a hierarchical format,” and it will create a React application based on this. Importantly, the workflow of the application natively incorporates debugging, testing, and validation steps. After you type the initial natural language prompt, rather than take you straight to the output, it takes you through a number of intermediate validation steps. For example — it shows you a list of “use cases” it believes your app needs to support based on the natural language prompt, and it allows you to edit/modify that list before moving on.
This step-by-step, validation-oriented workflow takes a use case that would likely otherwise be way too frustrating — as there is too much room for the LLM to get things wrong if you try to do everything at once — and makes it very delightful. As a user, I implicitly understand when I am expected to validate the output and how I am expected to do that, and as a result, I don’t really mind if the machine sometimes is a little bit wrong.
Related to this, I think there is a lot of room for LLM-based products to explore richer error messaging and fallback workflows. Rather than defaulting to always showing the model output, explore whether there are ways to predict or guess when the output may be low quality or uncertain. In such instances, see if you can do something different from a product perspective. This might involve sanitizing the user input and throwing an error message if the user prompt is too “weird” in one way or another. It might involve analyzing the confidence of the LLM’s response and either not showing the response or showing it with a warning if the confidence is too low (with the caveat that the “proper” way to analyze the confidence level of an LLM inference is an active area of research which is still not well understood). It might involve analyzing the output and putting the user into a fallback “you need to check this output” workflow under certain situations. I think there is a lot of design space to be explored here.
Most generative products built around LLMs today are “stateless” in the sense that you provide an input, you get an output, and nothing about that interaction influences the future behavior of the model.
Yet, in my experience, this structure does not align well with how humans like to approach creative thinking. The creative process is typically much more iterative, with each step derived from the last. For example:
In this sense, creative idea generation is often “stateful” — each step plays off the last step’s output. I suspect most applied LLM products focused on creative generation or synthesis will likely benefit from treating user interactions more like sessions where the user can see some initial outputs, highlight what they like or don’t like, and then further nudge the model in the direction they want the output to go.
Many emerging LLM-based copywriting tools are starting to do basic versions of this. Copy.ai allows you to pick initial suggestions you like and generate “More like this,” while Anyword allows you to highlight outputs and “rephrase” them, which changes the wording but preserves the core meaning.
You can imagine many more sophisticated versions of this, where the system allows you to very elegantly define what characteristics you like in certain outputs (wording, tone, style, structure, length, etc), as well as what types of changes you want to make. These sorts of features offer users a much greater feeling of control and help users avoid the frustrations that can stem from re-prompting the model from scratch and “losing” something you liked a lot but which was just a little bit off.
The fact that accuracy and trust will be an issue for many of these models greatly exacerbates the importance of this concept. If you build in the right affordances for the user to deal with “bad” output and select for “good” output via an iterative process, you drastically reduce the perceived negative impact of bad suggestions. Implicitly, this design tells the user not to expect that the model’s output will be perfect and gives them the tools to deal with that. This is a much more realistic mental model for how a user should interact with the LLM in most situations.
Some of the most delightful implementations of LLMs that I have used thus far are built to solve extremely specific and narrow workflows, such as AutoRegex, which converts English to regular expressions, and Warp’s tool for converting natural language to CLI commands. Both of these tools tackle an extremely narrow subset of language generation, and the result is compelling. The model essentially always does exactly what you expect it to do as a user, the perceived accuracy is very high, and there is no confusion as a user about how to prompt, trigger or use the model.
When you compare this with more general products, such as LLM-based text editors like Lex, the difference in user experience is profound. While Lex is extremely cool and fun to use, to me, it still feels much closer to a novelty and much further from a core tool that I can use day to day. It is simply too challenging to understand when to trigger the model and to guess what it will write.
Many startups in this space have begun to realize this. If you look at the evolution of the LLM-based copywriting products, many have gone from more general copywriting interfaces to highly segmented UIs for very specific copywriting use cases— e.g., Copy.ai has different parts of the product for writing “Facebook Primary Text,” “Facebook Headlines,” and “Facebook Link Descriptions.” In the automated code review space, I see many of the best startups starting by focusing on specific code ecosystems (e.g., Rust) and specific classes of code review use cases (e.g., style errors).
The most commonly discussed benefit of specificity is that it may allow for a more specialized, fine-tuned model to be used, leading either to performance improvements or a latency/cost reduction. While this is certainly true, it is far from the only benefit. Constraining the use case also allows you to make much more opinionated product and UX decisions at the interface between the model and the human and abstract much more of the model prompting and model output.
For example, many companies working on analyzing unstructured user feedback via LLMs, such as Viable and Enterpret have moved away from highly prompt-driven user interfaces where a user can ask whatever they want. Instead, they focus more on “out of the box,” segmented use cases such as trending negative feedback or new feedback tied to a specific recent release of the product. This approach removes the need for the user to identify what to ask the model as well as the right way to prompt it — the user doesn’t need to come up with the idea that analyzing trending negative feedback might be a good idea, nor do they need to figure out the right way to ask the model to provide trending negative feedback.
In coding, there is also a clear trend toward workflow specificity. Replit treats rewriting code, explaining code, and generating code as completely distinct use cases from both a UI perspective and from a model perspective. Some products are beginning to further segment even within code generation, treating full-function code completion based on a function signature (e.g., have the LLM write a function based on a comment) as distinct from in-line code autocomplete (e.g., guess the rest of the line of code I am writing) in terms of how they surface in the UI or are triggered.
Similarly, I suspect that most LLM-based word editors will, over time, move away from utilizing a hotkey command as the universal way to trigger an LLM, and move towards much more cleanly defined use cases that are visually surfaced in distinct ways or triggered via different mechanisms. Paragraph generation, sentence or phrase auto-complete, rewording/rewriting, information gathering, and having the model help you “brainstorm” what to write are likely quite distinct in how they should be productized; the Worldcraft paper on short story writing has some great illustrations and examples of this. While segmenting these use cases will certainly allow for individual models to be fine-tuned, I suspect that most of the marginal benefit comes from a simpler-to-understand product that is easier to use, trust, and interface with.
The last UX principle worth touching on is latency and performance. Many of the most interesting applied LLM use cases live in the context of a “flow state” task like writing, coding, or similar, which you ideally do not want to interrupt with a loading spinner that lasts 5 seconds. If you are building an LLM-based application, it is worth deeply studying the latency thresholds that your use cases will require to achieve a good UX. From there, you can identify the right model and system architecture.
For example, TabNine is a code completion startup that has a very different architecture than Github CoPilot — rather than utilizing one monolithic language model, their product is built around an ensemble of specific models fine-tuned for different types of tasks. Notably, utilizing much smaller models allows them to do inline, real-time code suggestions as you type each character, something CoPilot can not achieve.
Replit, similarly, has done extensive optimization work to get their median response time to 400ms. I suspect that virtually all generative writing LLM use cases will benefit massively from hitting the 100ms threshold over time.
Although performance improvements are obviously technical in nature, my argument would be that they really need to start from a deep understanding of the user, the user’s workflow, and the user experience you want to create. There is a huge design space of ways to modulate the cost, speed, size, and accuracy tradeoff of language models— selecting different baseline pre-trained models, fine-tuning, prefix tuning, etc. — and in many applications, there are likely pareto optimal points in this search space that should be identified.
Tied to this, I suspect that in many cases, there will be other innovative ways to drive performance, such as intelligent use of caching, parallelization of compute, utilizing an ensemble of models (including some small language models or traditional model architectures), clever integrations between semantic/heuristic checks and the language model, and similar. Subtle UX changes can often also result in dramatic differences in perceived performance and should not be discounted. As an example, Replit has its code completion models “stream” the results line by line for multi-line suggestions rather than wait until the whole response can be shown. This results in a very different user perception of speed and a vastly improved UX.
The best applied LLM startups will have very strong opinions about where they should sit among these tradeoffs and will likely identify clever system architectures to achieve a differentiated user experience.
Many people starting LLM-based companies think a lot about building proprietary data moats or data feedback loops that allow for data or model-driven defensibility over time. I generally think that for the majority of vertical use cases within LLMs, it will be hard to achieve any substantive advantage in this manner — these models are obviously becoming commoditized, and the benefits of marginal accuracy beyond a certain point for most of these use cases are limited, and in many situations, there is enough easily accessible or open data.
It is commonly believed that product and UX insights do not generally drive defensibility, given they are easily copied. If one of the copywriting companies figures out the perfect workflow for LLM-assisted copywriting, can’t the other quickly mimic that? While this is true to an extent, it misses something — many UX insights in this space need to be coupled with unique, domain-specific technical work to enable them. Examples of this that have already been touched on in this article include Google’s semantic validation engine for its code completion system and TabNine’s ensemble approach to language modeling to allow for inline code suggestions while the developer types.
Superhuman is a good illustrative example of this in a different domain. A lot of what makes Superhuman special is how fast everything is, enabling the user to stay in a flow state, but this was only achieved via extreme engineering effort to ensure every single action in the product happens in <100 ms.
I am certain that many applied LLM teams that start by identifying the ideal user experience for their use case will need to do research-level work in explainability, confidence analysis, prompt engineering, performance, testing, and validation, or similar to actually build that user experience. Domain-specific context integration, fine-tuning customer data, and continuous learning are some other interesting areas for unique IP to be developed that will enable improved user experiences. This sort of domain-specific, “user-aware” research is often the hardest type of technical progress for others to duplicate or mimic. Related, I suspect that many enduring startups in this space will build out complex hybrid systems that combine heuristics, symbolic approaches, transformer-based models, and legacy language models in an elegant way.
Great startups in this category will think from first principles about the UX they want to deliver and identify truly novel ways to achieve that technically in a way that few others can. The more specific these UX insights are to a specific domain, the better, as the less likely they are to be obviated by general research improvements in large language models.
In some sense, much of what I have written about in this article is not particularly new. The central issue of teaching a human how to interact with a probabilistic machine has been true for the vast majority of applied ML companies over the last decade. In each case, you need to figure out how to build a product that “gracefully handles the confusion matrix” and engender trust among the users interacting with the model. Indeed, I think many LLM companies can likely learn an immense amount from the best-in-class ML-enabled creative tools that exist today, such as Runway.
Yet, I can’t help but feel that these problems are even harder to solve and more critical to get right in the applied LLM space. This heavily relates to the ambition of startups in this category. It is, frankly, incredible to even consider the idea that machines can so foundationally influence complex, creative, knowledge-worker tasks like writing software. The scope of these tasks is, to some extent, much broader than most previously applied AI workflows, which dramatically increases the importance of creating carefully tuned products with outstanding UX, especially at the interface of the model and the human.
It is worth noting that many of these UX questions can be tested, de-risked, and iterated on with a very small amount of money. You can build a hacky prototype with a hard-coded backend to test the interaction pattern — no LLM is required. I think far more startups in the space should do things like this before they build a huge language model stack. Think through all the different ways a user could interact with the product and reason through ways to address the questions I posed at the start of the article. Take an off-the-shelf LLM model and focus on refining the UX of interacting with the product before you try to do anything custom from an architecture or fine-tuning perspective. These are not hard things to do, but so few teams do them.
This space desperately needs more founding teams with strong design and product instincts who treat UX as a first-class problem. The foundational interaction patterns for using LLMs have not been figured out, and there is a big opportunity for the startups that are at the forefront of this.
If you’re a team like this working on an applied LLM startup and thinking a lot about how to get the human-machine interface right, I’d love to chat with you — feel free to reach out at davis (at) innovationendeavors (dot) com. And if you liked this, I write more frequently on topics in computing here.