Harpinder (Harpi) is a Partner at Innovation Endeavors and an experienced entrepreneur. He has spent nearly two decades building products for emerging industries at the intersection of mobile, data, and deep tech. He founded two technology companies that were successfully acquired — both of which he co-founded with Innovation Endeavors Partner, Scott Brady.
Harpi led investments in: Panther, a threat detection and response platform powered by a highly scalable security data lake and detection-as-code; and Gatik, which is developing self-driving solutions for urban logistics.
Prior to becoming a partner at Innovation Endeavors, Harpi was co-founder and CEO at Slice Technologies (acquired by Rakuten), a data and analytics company that provides market insights about eCommerce to the world’s largest consumer brands. He previously co-founded FiberTower (acquired by First Avenue Networks), the largest independent provider of backhaul to wireless carriers, where he was the head of product.
Harpi grew up in Chandigarh, a city in northern India. He earned a BS in computer science at the IIT-BHU in Varanasi. He immigrated to the US to attend Indiana University Bloomington, where he earned an MS in computer science and, later, an MBA from the Stanford Graduate School of Business. In addition to investing, Harpi is a lecturer in management at the Stanford Graduate School of Business.
Using AI to bring clarity and precision to private market investing: Our investment in Tetrix
Over the next few years, we expect that recent advances in data processing, cloud infrastructure, and artificial intelligence will allow for a fundamentally new class of vertical software products to be built that bake “intelligence” into all layers of the workflow. Large language models, with their ability to ingest and semantically reason over even the gnarliest unstructured data sets, have, in particular, unlocked new opportunities to go after markets that were not possible to address with legacy information processing and ML techniques. This pattern is exemplified by several companies we’ve partnered with today, like Alphasense, Weave, and Trunk Tools, all of which are applying foundation models to bring greater productivity and decision-making insight to their industries.
Our conviction in this thesis only deepened when we incubated and partnered with an exceptional duo, Olivier Babin and Naunidh Singh Bhalla, as they led a Research Driven Ideation (RDI) investigation into the changing needs of private market investors. Combining their backgrounds in financial services with a unique perspective on how LLMs might allow them to rethink some of the core needs in the industry, the team conducted hundreds of customer interviews to identify a massive pain point around performance data collection and analysis. Today, we are excited to announce that we are leading the seed round for the solution to this problem: Tetrix.
In the financial services world, institutional capital allocators like pension funds, sovereign wealth funds, university endowments, and family offices are responsible for investing in a diversified portfolio of assets, which can include traditional investments (public stocks and bonds), as well as “alternative investments” like private equity, venture capital, and hedge funds. The risk-return profile of these alternative investments has garnered increasing interest from investors, and consequently, investment in alternatives is expected to grow from around $13 trillion today to $23 trillion by 2026.
Despite this growth, many of the workflows around alternatives investing have remained extremely arcane. In particular, while there is robust data and reporting infrastructure that help an allocator dissect their holdings and positions in public securities in real-time, in the alternatives world, allocators typically receive documents (like financial statements, SOIs, and investor letters) from their fund managers as unstructured PDFs. To perform even basic analysis in a spreadsheet requires first parsing this data into a usable format. While computer vision techniques such as OCR can be helpful in domains where documents are standardized, in this case, each document has a unique format that can change quarter-to-quarter. Moreover, stylistic choices made by fund managers (such as using a portfolio company’s logo instead of its name) can limit the effectiveness of legacy text extraction techniques. So, most funds must employ a human team, most commonly via an outsourced service provider like MSCI, to manually key in data from these documents.
This manual dependency creates a fundamental three-way tradeoff between the speed of parsing, accuracy of the data, and amount of information that can be pulled. In our customer discovery interviews, we heard cases of numbers being reported inaccurately, investment teams waiting over a month for data to be processed, and back office teams trying to prepare Q2 2024 financial statements using Q4 2023 data. Most surprising to us were the implications of these challenges on the questions that allocators could ultimately answer over their holdings. For example, an allocator might struggle with the following simple queries:
Company X is about to IPO and I’m invested in it through multiple VC funds — how much do I own in total?
Are my managers marking the same company at similar or drastically different valuations?
Were my managers able to win allocation into the best companies in their sector over the last decade?
What is my exposure by geography, sector or stage?
Now, enter Tetrix, a company building a data ingestion and analytics platform that helps capital allocators manage their investments in alternative assets. Using large language models and other natural language processing techniques, the product retrieves and parses reports and one-pagers from fund managers and extracts key data points. The flexibility of transformer-based models enable Tetrix to scale to automatically parse reports in any format and handle the domain-specific semantic nuances of working with these documents.
Once the data is ingested, the Tetrix platform offers rich insights and benchmarking tools that help allocators answer questions about their underlying assets, benchmark managers within and across industries, plan and forecast future cash needs, generate reports for boards and investment committees, and more. The end result is that allocators get more accurate and more real-time data, a deeper understanding of performance and risk, and faster throughput on decision-making and analytical workflows, enabling them to make better investment decisions.
The team and story behind Tetrix is unique and follows a long history we have of incubating companies through Research Driven Ideation. I first got to know Olivier and Naunidh when they were students at the Stanford GSB. Olivier was formerly a banker and VC, spending time at Goldman Sachs before joining the Softbank Vision Fund. Naunidh graduated from the Singapore University of Technology and Design and was a software engineer and tech lead at JP Morgan before joining the GSB. The two had a deep shared interest in entrepreneurship, exploring various ideas for months before they graduated, and were even voted “most likely to build a unicorn” by their Stanford class.
After they graduated, the team decided to forgo corporate jobs and undertake a structured search process to find a startup idea to work on. Having used the same RDI process with my co-founders to start both of my companies, the whole Innovation Endeavors team and I were excited to support them in the incubation process over the last year. After conducting hundreds of interviews and exploring dozens of ideas across the financial services and supply chain, it became clear to Olivier, Naunidh, and us that the idea that ultimately became Tetrix had the right combination of market size, willingness to pay, technical difficulty, and “why now” to make it an exciting opportunity.
Although we agreed with the team about the various reasons why this is a good market to tackle — level of pain, low quality of incumbent solutions, the opportunity for technical disruption using LLMs, etc. — what got us most excited to invest was the team itself. We were consistently impressed by their customer-centric and fail-forward-fast approach to the incubation process — constantly experimenting, learning, and refining while asking the right questions. The team has the technical chops to solve the hardest problems and the GTM sophistication required for this vertical. These, we believe, will propel them more than anything to build an incredible company. We are excited to continue to support Tetrix in this next phase of the journey. Welcome to the Innovation Endeavors family!
Using AI to bring clarity and precision to private market investing: Our investment in Tetrix
Building AI-powered software engineering tools: Essential technical considerations for founders
In 2022, Github caught lightning in a bottle by releasing Github Copilot, their AI coding assistant. Today, over 37,000 businesses–including a third of the Fortune 500–use the product, which can reportedly help developers code as much as 55% faster. This, however, is only the tip of the spear in terms of what’s possible when AI is applied to software engineering. Many aspiring founders, themselves engineers, want to get in on the ground floor of this industry revolution by bringing products to market that drive productivity gains across the SDLC.
We spent several months meeting researchers and companies pushing the boundaries of AI-enabled software engineering, from code generation and testing to migrations and beyond. In this two-part series, we will share our guide for building AI-powered developer tools and discuss some areas in particular that we are most excited to see disrupted by this technology.
In this post, a handbook for founder CTOs, we’re covering some common design patterns and their tradeoffs, as well as key engineering challenges and current methods for addressing them.
We’ve seen a number of design patterns emerge that serve as building blocks for bringing AI into software engineering tools, each with its own benefits and tradeoffs.
Solo programming versus pair programming interaction model
One core design decision founders will make is regarding the interaction model between humans and AI in their product. We see two common modalities, which we call the solo programming approach and the pair programming approach. In the solo programming model, AI acts independently and receives feedback and guidance as if it were just another human engineer. The typical implementation we see is an AI agent opening pull requests or issues into a repo while engaging with and responding to other contributors. See here and here for examples of an AI bot working collaboratively but independently to close issues.
In the pair programming model, AI works hand-in-hand with the user to achieve a shared goal, which usually means working simultaneously on the same file. This is the interaction model you experience in AI-enabled IDEs like Replit, Sourcegraph, and Github Copilot. You can see an example below from Replit AI. Another potential implementation of pair programming is a chatbot, where the user converses with the AI to help it refine a generated snippet of code.
In our view, the solo programming model has greater upside potential for creating developer productivity gains because full autonomy implies offloading 100% of planning, structuring, and writing code to the AI. The drawback is that providing feedback to an agent like this is cumbersome, given that the primary channel for this feedback is typically pull request comments. By contrast, feedback to a pair programming “autocompletion” happens in the flow of composition with quick iteration cycles. People are accustomed to giving feedback in this format, as evidenced by new research suggesting that users are satisfied as long as the bot gives them a good starting point. The drawback of this approach is that there is a ceiling on how much productivity improvement a pair programming form factor like this can bring, as humans will always need to be kept in the loop.
How do you decide, then, if your product should leverage solo or pair programming? The answer lies in the value proposition you want to offer. If your core pitch to customers is that you can handle a task they otherwise would not undertake (like migrations or tech-debt cleanup), then it is essential that their product experience is a magical AI bot that pushes code to solve their problem. There may be little patience on the part of customers to provide in-depth feedback and human-in-the-loop supervision. After all, your customers didn’t want to do that task in the first place, so the bot should not create more work for the user than absolutely necessary.
If instead your value proposition is around providing a speed boost to workflows that engineers inexorably must perform (such as optimization or unit testing), we think a pair programming experience is likely the right fit. Engineers may be satisfied even with higher margins of error and more hands-on coaching of the bot as long as they get a quantifiable performance uplift on these frequent tasks.
Deterministic versus probabilistic code mutation
Most AI developer tools need to perform code mutation: making edits to lines, functions, modules, and files. Companies can approach this either deterministically or probabilistically.
Let’s consider the deterministic approach first. Under the hood, this involves leveraging sophisticated regex-based pattern matching algorithms (called codemods) that replace one string with another the same way every time. For example, here is a snippet from Grit’s documentation, that shows how to use GritQL to replace all console log messages with an alert.
Although we describe deterministic code changes as string matching plus replacement, there is a great deal of technical nuance involved. Grit, for example, deeply understands the abstract syntax tree structure of your code, allowing it to ignore matching patterns in, say, quotations or comments. While deterministic code changes are reliable, they require some upfront effort to program and configure, especially if there is branching or conditional logic. They are also not as “creative” or “adaptable” at LLMs since they are intended to be used to perform particular transformations on a particular pattern with high reliability and consistency.
Probabilistic approaches use AI to author code directly. LLMs are token-prediction machines, and their ability to code stems from the inclusion of source code in their training corpus. Hence, with an appropriate prompt, they are able to write text that looks like code and can compile. But given that the model is selecting the next token probabilistically, it is possible for AI to generate incorrect, vulnerable, or nonsense code as well. New coding models are released frequently and benchmarked against various evaluation suites. Leaderboards such as EvalPlus can be helpful pointers to keep up with the state of the art.
Most products, we believe, will converge to using a combination of deterministic and probabilistic approaches to perform code mutation. Teams will need to make a decision about how much of each to leverage based on a few factors:
What reliability can you achieve with probabilistic methods alone?
Which customers, if any, will pay for a product with that level of reliability?
To what extent are you willing and able to use deterministic methods to assist AI and increase reliability?
Is there sufficient market opportunity for the universe of use cases bounded by the above decisions?
It will be interesting to see how the industry balances the percentage of code mutations that happen via codemods versus AI across different use cases. We believe that the choice of deterministic versus probabilistic architectures for a particular application can lead to vastly different outcomes for competing companies.
Zero-shot versus agent-driven architecture
In a zero-shot (or few-shot) approach, an LLM receives a prompt, perhaps one that is enriched using RAG or other in-context learning methods, and produces an output, which might be a code mutation, docstring, or answer to a question. The below diagram from Google’s recent blog post on Vertex AI Codey offers a great visualization of how RAG works in a few-shot LLM approach (more on the chunking, embedding, and indexing in the next section).
Agents, by contrast, are multi-step reasoning engines that use a combination of zero-shot and few-shot LLMs to iterate toward a goal. This may involve any number of intermediate planning and self-reflection steps. For example, you could ask an agent to plan out a debugging investigation, including what root causes it should explore and which functions or files it should dig into. Then, at each step, you could ask the AI to reflect on if it has identified the bug or if it needs to engage in further exploration. Where agents become particularly powerful is when they can also leverage tools to either mutate the codebase (e.g. codemods) or build additional context by pulling from external sources like a language server. More on this idea of an agent-computer interface can be found in the SWE-agent paper by Yang et al., which highlights some of the considerations that you should make when designing an “IDE” intended for use by AI systems rather than humans.
One example of an agentic workflow can be found in the paper: “AutoCodeRover: Autonomous Program Improvement” by Zhang et al. In this case, the AI agent leverages both self-reflection and a toolkit, including functional primitives like code/AST search to build context and suggest repairs for GitHub issues. The agent is first prompted to use its various tools to retrieve relevant functions that could be the source of the bug. After retrieving context, the AI is asked to reflect on whether it has enough information to identify the root cause. Once the AI answers affirmatively, it generates a patch and is asked to validate if the patch can be applied to the program. If it cannot, the bot tries again. A diagram of the workflow is shown below.
To see an example of this idea taken even further, you might be interested in reading “Communicative Agents for Software Development” by Qian et al. which offers a deeper dive into a multi-agent system that has “CTO”, “designer”, “tester”, and “progammer” agents chatting with each other to decompose a task and accomplish a shared goal.
Whether a zero/few-shot approach is sufficient or you need to introduce LLM agents is more of a technical implementation detail than a design tradeoff. Agents may be harder to steer, but may lead to greater overall success given their ability to zoom in and out on particular subproblems within a complex task.
Multi-agent collaboration has emerged as a key AI agentic design pattern. Given a complex task like writing software, a multi-agent approach would break down the task into subtasks to be executed by different roles -- such as a software engineer, product manager, designer, QA…
Planning is a critical part of the agentic workflow. AI agent products can be segmented along two product directions depending on if humans assist with the planning or the AI composes the plan independently.
As an example, on the human-directed side, we can look at Momentic in the end-to-end testing space. In Momentic, a higher level test plan is written by the user with very detailed instructions on testing procedures that are then executed with the help of AI. This is useful because testers care more about verifying that the application follows the correct intent (e.g., displaying the weather) and would prefer to not get hung up on particular assertions (“the value of the weather html element is 72”), which create brittleness. AI is quite good at implementing this sort of fuzziness. For example, in the workflow below, the AI can assert that we logged in successfully whether the page says so explicitly or displays a landing page that can clearly only be accessed by an authenticated user.
For an example of an independent planning approach, we can look at Goast.ai, which seeks to automate root cause analysis and debugging workflows by ingesting observability data and dynamically searching a codebase to do auto-remediation. As the diagram below shows, Goast uses a multi-agent architecture that involves (1) a context engine that retrieves semantic information from the codebase useful for exploring an RCA, (2) root analysis and implementation agents to perform the investigation and remediation, and most critically, for this section (3) a solution planning agent.
The question of whether to use independent or human-directed planning in your product closely parallels our earlier discussion on solo vs pair programming approaches: it depends on the value proposition you want to offer. In the planning phase, the decision about whether to incorporate humans-in-the-loop for plan generation should largely be based on the delta between the effort required to create a plan for your use case versus the time required to implement it.
For example, in testing, it takes significantly longer to implement a test case in a browser automation framework like Selenium versus articulating the higher level steps in natural language. In fact, QA teams generally want some level of control over the system and what actions it takes, so a completely AI-generated and executed test plan may be counter to the goals of the user.
By contrast, in debugging/RCA, the plan is the core task to be automated. While it can be challenging to figure out how to explore and prune the search space to zero in on the root cause of a bug, many times the fix itself is quite simple. And, if one seeks to create a fully autonomous SRE as the core value proposition, a human-assisted planning approach is counter to the goals of the product.
Technical challenges
Most companies building products in this category encounter the same technical roadblocks. Here, we outline a few of them and some common strategies that companies have been employing to address them.
Preprocessing and indexing
Although there have been a lot of recent advancements made in expanding model context windows to hundreds of thousands or even millions of tokens, many codebases are still far too large to fit in a single context window. Even if it were possible, it’s not clear that this approach itself is actually helpful to the model, as models can struggle to effectively use long context windows (a problem called “lost-in-the-middle”). Hence, there is a significant challenge that companies will face in pre-processing codebases so that at inference time the AI can parsimoniously retrieve the context it needs to answer a prompt. Because of how impactful preprocessing is to the rest of the AI stack, this is one of the key areas where companies in this space can differentiate technically.
Many companies will start by chunking the codebase into usable snippets and generating embeddings for them using a model like Voyage Code. These embeddings are then stored in a Vector DB so they can be queried later, essentially allowing the AI to pull the K most relevant code snippets for a given prompt and perform re-ranking. There is a lot of nuanced complexity in crafting a chunking and retrieval strategy because one could generate embeddings for files, modules, logical blocks, or individual lines. The more granular your embeddings, the more fine-grained, but also more myopic your retrieval. Pinecone provides some useful guidelines for thinking about chunking strategies. At a high level, you can think about chunking using a few different approaches:
Size/length: embedding a consistent number of characters or lines per document
Structure-based: embedding based on blocks (if/else statements, functions, for-loops), modules, or classes
File-based: embedding entire files
Component-based: embedding based on discrete components that work together in a logical unit, such as front-end components (auth, feed, profile page, etc.) or microservices
If you’re interested in a concrete example of how chunking can be done, take a look at this blog post from Sweep.dev, which talks about how tree-sitter can be used to recursively chunk codebases based on the abstract syntax tree. Updating or refreshing this index can potentially be a drag for large codebases, but Cursor has done some interesting work leveraging Merkle trees to efficiently update their index.
At Cursor, we’re fascinated by the problem of deeply understanding codebases.
One useful primitive we’ve been focused on is code graph construction and traversal.
In addition to code, you can also think about embedding other kinds of files like documentation, product specs, and release notes. A knowledge graph that captures semantic relationships between concepts related to the architecture can be particularly helpful for these concepts. The best strategy, though, might be to combine multiple different indexing strategies for different granularities of search or retrieval. A component-level vector DB might be helpful for understanding how broad parts of the system function at a high level, whereas a block-level embeddings database might be helpful for identifying targets for code mutations. If you expose these as tools to an agent, it may be able to dynamically reason about which database provides the most relevant context depending on the current state of its exploration.
One way to augment your chunking strategy is to also pair this with a non-AI-based mapping of a codebase, including files, functions, dependencies, and call graphs. This context can be very valuable if you need to provide the AI with a list of libraries or functions available in the codebase, pull additional relevant/dependent code into the context window even if it didn’t appear in a vector similarity search, or locate and mutate files in the repo.
Validation and assurance
Success for most AI developer tools is dependent on their ability to contribute code that gets accepted into their customer repositories. Trust in the tool is a critical prerequisite for this, which is greatly improved when the product can make assurances around its code’s safety, functionality, performance, and accuracy. There are several techniques companies employ to this end.
Linters and static analyzers
One of the most basic techniques an AI can do before proposing a pull request would be to run a static analyzer or linter on the code. These tools help check syntax, style, security, memory leaks, and more. AI can be prompted to leverage linters as a tool for self-reflection, iterating until the linter no longer complains.
Testing
Comprehensive test coverage has long been a core acceptance criteria for development projects. Hence, it would make sense to require that any time AI modifies the codebase, it should ensure that all existing tests pass and propose additional tests that cover any new functionality being introduced. While this is great in theory, there can be some complications and limitations in practice.
First, there may not be an existing regression test suite for the part of the code you are touching. This could be due to poor development practices and hygiene, gaps in current test coverage, a completely greenfield feature, or the need to rewrite the entire existing test suite in a new language or framework (for example, during migration tasks). In these cases, the AI may first need to generate a test suite that accurately captures the intended behavior of the system, a difficult problem known as specification inference.
Second, test coverage itself is an expensive way of gaining a limited view into the correctness of your code. Test suites are not free to run, and metrics like code coverage alone are not enough to guarantee that the code has been thoroughly tested or is 100% functionally correct.
Finally, if AI is asked to respond to a failing test, it must generate a patch that provides a general solution to the bug instead of a hacky workaround for the particular test.
Formal Methods
Another less explored avenue for software validation is to employ formal methods like Model Checking, which provide strong logical guarantees on the behavior of finite state machines. Although these techniques provide more completeness and rigor, they become computationally very expensive and challenging to implement as your state machine gets larger. Hence, it may be impractical to use this technique (without significant levels of abstraction) for large codebases.
Human Feedback
The last line of defense against bad AI-generated code making its way into a repo is human review and feedback. As discussed in the previous section, AI tools can receive feedback through PR comments if taking a solo programming approach or in-line if leveraging pair programming. In some cases, a sandbox environment will need to be spun up to allow humans to click through applications end-to-end.
Conclusion
In this piece, we covered some of the core technical design decisions, trade offs, and challenges that product-builders in this space will face. One way to summarize these learnings would be to think in terms of two guiding questions:
#1: How much do you want humans to be in the loop?
Do you want to adopt the solo programming or pair programming model? If using an agentic approach, will humans or AI be doing the planning? How much human attention is needed to validate the code?
#2: How will you ensure the reliability and accuracy of your system?
Do you need to build deterministic codemods or are your AI guardrails and validation enough to ensure accuracy? If you need to rely on deterministic methods, does this impact your breadth of use cases or scalability? Is an agentic architecture more helpful or too challenging for what it’s worth? How does your ability to index or pre-process the codebase lead to better results?
Although the technical design principles discussed in this post form some of the building blocks of great products, they alone are not enough to create a big company, as the market dynamic plays a big role in the outcomes of any startup. For that reason, we shared a companion guide for founder CEOs that covers how to form a rigorous hypothesis about how an idea or problem statement can form the basis for a large, independent company and craft a business model to bring that product to market.
If you’re building in this space, or would like to discuss more about some of the ideas in this post, we would love to hear from you. Reach out to us at diyer [at] innovationendeavors [dot] com
Building AI-powered software engineering tools: Essential commercial considerations for founders
In 2022, Github caught lightning in a bottle by releasing Github Copilot, their AI coding assistant. Today, over 37,000 businesses–including a third of the Fortune 500–use the product, which can reportedly help developers code as much as 55% faster. This, however, is only the tip of the spear in terms of what’s possible when AI is applied to software engineering. Many aspiring founders, themselves engineers, want to get in on the ground floor of this industry revolution by bringing products to market that drive productivity gains across the SDLC.
We spent several months meeting researchers and companies pushing the boundaries of AI-enabled software engineering, from code generation and testing to migrations and beyond. In this two-part series, we will share our guide for building AI-powered developer tools and discuss some areas in particular that we are most excited to see disrupted by this technology.
In this post, a handbook for founder CEOs, we cover the key business and commercial decisions that influence how startups in this space are built and a few opportunities that we think could lead to big companies.
We believe there are six factors founders should consider when trying to scale a developer productivity tool into a company: value proposition, pricing model, selling tooling vs. selling results, proof of value, recurring revenue, and scaling go-to-market.
Value proposition We generally see three ways developer productivity tools create value for their customers:
They let developers offload less desirable tasks
Engineering tasks seen as tedious, lower-impact, or lower-visibility are often seen as less desirable than those considered cool, present a high degree of technical challenge, or can lead to bonuses and promotions. In these applications, there is high motivation for engineers to bring in tools that allow them to better leverage their time by automating substantial portions of this work. Examples of this in practice include spaces like testing (Momentic, CamelQA, and Nova), documentation (Mutable.ai), code reviews (Bito.ai), and security fixes (Pixee). It’s not that these tasks are unimportant, but rather, there is less motivation for engineers to focus time in these areas versus delivering things like highly anticipated features or cost-slashing code optimizations.
They pull forward deferred work
In a constrained resource environment, engineering leaders consider many tasks to be important but not urgent and, therefore, fall below the current sprint cutline. In these situations, it is possible for a new company to come in, perform the tasks that would have been deferred to a future sprint, and share in the value created. Code hygiene and tech debt are good examples. Savvy engineering leaders know that all tech debts eventually come due but struggle to prioritize them against immediate feature work. Many try to dedicate a fixed percentage of sprint time to addressing tech debt, but this is crude and inefficient. This creates strong incentives to bring in a company like Grit.io, Second.dev, or ModelCode to handle these projects.
They upskill existing team members
A developer’s skill can be captured in many ways, but perhaps the three most relevant ones are (1) their total coding throughput, (2) the quality of their code, (3) their ability to solve hard technical problems. For some tools, the core pitch offers ways to broadly improve the engineering team’s skills along these attributes rather than accomplishing specific tasks like a code migration or a security patch. For example, coding copilots like Augment or Github Copilot cut down on the time required to plan and write code, thus increasing the effective throughput of individual devs. Code optimization tools such as Espresso AI and CodeFlash make it easier for devs at all skill levels to write optimized code, which enhances the overall quality of the codebase. Tools that are meant to aid in things like debugging investigations (like Goast or OneGrep) help all devs debug like experienced engineers. One benefit of tools addressing this value proposition is that it is relatively easy to justify a recurring contract. The tool becomes deeply ingrained in the developer’s workflow and is used often, hence a subscription or recurring license model makes a lot of sense.
Pricing model Once you’ve considered value creation, the next step is to consider value capture, which is your pricing and revenue model. Ideally, you want to price so that you win when your customer wins.
Seat-based pricing: This works well for products whose value can be clearly attributed to individual developers. Copilots, for example, are used by individual engineers and therefore make sense to price on a per-seat basis.
Outcomes-based pricing: This is ideal for products that deliver value at defined milestones and whose ROI is recognized as a step function, with most of the value accruing once the core job-to-be-done is complete. For example, a code migration tool can be priced on a per-project basis because all of the value is realized once the migration is complete.
Pay-as-you-go: Pay-as-you-go is the bread and butter of infrastructure companies, often manifesting as compute-based or storage-based pricing. This model is seen as favorable for customers who prefer not to be locked into a fixed contract, for example, paying by lines or size of code transformed for code modernization companies or bug fixes accepted for automated program repair companies. A variant of this pricing model would be to implement some kind of “credits” system that is effectively compute-based pricing. For example, you may sell a bundle of 50 “credits” that get consumed whenever users leverage AI features; this allows you to scale revenue with infrastructure costs, which can be significant for AI applications.
Tiered licenses: This is the most freeform of the pricing strategies and generally consists of several tiers of annual licenses depending on the size of the customer and their level of pain. The pricing should scale with the size of the customer (so an enterprise pays more than a scale-up) but usually does not scale linearly. This can be used to expand pricing while giving the customer predictability on their bill.
Selling tools versus selling results Some customers want the end results that a tool promises but are reluctant to allocate the corresponding financial and headcount budget needed to operationalize the tool itself. In short, they really want fish, not a fishing rod.
A concrete example of this is high-complexity migrations. A segment of the market will happily pay a consultancy like Accenture or Wipro to migrate their IBM mainframes to AWS. They outsource this work today precisely because they don’t have the willingness, skill, or bandwidth in-house to take on that project. For this segment, a tool their team could use to potentially do this migration themselves is uninteresting, as they really want to pay for outcomes.
At the same time, another segment of this market has been burned by consultants in the past and is deeply skeptical of accepting finished code from 3rd parties. These teams don’t want a “skip the line” option for things that are lower on their backlog and would much rather invest in better tooling that helps their own employees be more productive and achieve the organization’s goals faster. Selling a tool to this segment is a higher-velocity, higher-margin business relative to selling services.
Proof of Value One of the critical stages in the sales process is the proof-of-value step, in which customers verify that your tool can “walk the walk”. We can think about the complexity of proving value across three tiers.
The lowest tier of complexity would be cases where an individual developer can evaluate the tool independently and see its value without needing IT approval. An AI-enabled IDE like Cursor is a great example: a developer can try it for a personal side project and develop conviction in the usefulness of the product before taking it through the approvals necessary to allow it to be used on company data.
The next tier of complexity is when the product needs IT approval but is still relatively easy to test and validate. These are products that only show their true value once they are integrated with the company’s internal systems–think AI-assisted debugging tools like Goast or OneGrep or end-to-end testing tools like CamelQA or Blinq. Because they need to be integrated with confidential company data, IT will need to be involved to validate the security and compliance of those tools. However, the systems are relatively easy to test and validate because they are simple to integrate, show value soon after integration, and enhance pre-existing engineering workflows.
The most complex proof-of-value is one that requires IT approval and significant investment to test and validate. One example of this would be the code modernization space (companies like Second and Grit). These tools are authorized to make large, sweeping changes to codebases, and hence, IT will require extensive security reviews and controls before granting full access. Engineers will need to spend time prioritizing and assigning use cases to the AI, guiding the AI agent, and reviewing pull requests. Depending on the complexity of the modernization or migration, the value of the tool may only be clear months or quarters after the initial onboarding. Such products often command high ACVs and stickiness once trust is developed between the customer and vendor, but navigating this complex sales motion will require a skilled GTM team.
Tactically, the difficulty of proving value affects how companies should think about free trials, POCs, and paid pilots. If a product falls into the first tier (easy for an individual developer to try), it may be worthwhile to have a very generous free trial or “free-forever” tier for individuals to entice developers to kick the tires on your product. In the second tier, where IT needs to get involved, a free POC might be helpful to encourage your champion to work internal approvals to onboard your product and, given the product is easy to integrate and shows value quickly, does not dramatically increase your sales cycles or customer acquisition cost. In the third tier, a paid pilot is usually necessary to get the buy-in/skin in the game from customers and justify your support and onboarding costs.
Recurring revenue Many workflows in software engineering tend to be one-time tasks or have lower residual value relative to their initial value creation opportunity. As an example, consider the category of “prompt-to-app” tools. In this case, the initial workflow is quite magical because AI automates a substantial portion of the overhead required to set up a web app. Once you have a strong starting point or boilerplate code, though, the job is complete, and the willingness to pay becomes primarily driven by other needs like hosting and maintenance, which may command different ACVs. By contrast, AI issue triage and bug fixes lack this self-cannibalization mechanic and could have more consistent recurring revenues as new bugs and issues are always being created.
Although non-recurring contracts can be quite large, the pressure to re-earn your revenue every year leaves a company vulnerable to disastrous, unpreventable churn. Ideally, SaaS businesses want their customer revenues to expand, not decline, with contract renewals to supplement new logo growth. There are, of course, exceptions to every rule. Large contract sizes or an extremely long tail of new projects per account are both ways around non-recurring contracts. But, broadly speaking, this is an area to tread lightly.
Scaling go-to-market There are generally two ways of tackling a market: go after a select few high-ACV contracts or mass-market low-ACV contracts.
In the former case, you should plan on selling to large enterprises. You will most likely engage in a traditional 6–12 month sales cycle involving many senior architects or decision makers (usually the CTO/VP Engineering), countless security and compliance reviews, and rigid, formal procurement processes. The customer is likely to have a fair amount of legacy infrastructure and demand a high degree of support and customizability but will hopefully be less reluctant to churn and may scale revenue quickly if your product proves valuable. For your efforts, the customer should pay you on the order of $100K-1M a year. This means, however, that the problem needs to have executive-level visibility, impact, and prioritization.
In the latter case, you should plan on selling to individual developers, most likely at Series B-D high-growth companies. Your product should incorporate some element of self-service so you can grow through product-led growth and viral word-of-mouth via HackerNews or Twitter/X. This also means you should take cues from how consumer companies think about metrics, activation, and retention. The product should be easy to try with limited security/compliance overhead, as individual devs may be less motivated to jump through hoops to onboard your product and have a well-orchestrated activation and onboarding funnel to ensure folks who sign up to convert to long-term customers. Ideal pricing or contract sizes for this is likely on the order of $20–100/developer/month, or roughly a $25K–100K contract for mid-size or growth stage companies.
Given these options, when making a decision about which kind of company you want to build here are a few questions to keep in mind that might lead you down one path or another:
Does your product require executive-level approval? If so, does it have executive-level impact?
Does your problem resonate widely in the developer community? Is it something that developers would want to share with their peers?
Is the solution consistent or highly customized from customer to customer? Is it something that can be made easy to try? Is it something a developer might initially test on a side project?
Would you rather manage more account executives or developer relations engineers? Should you think about success in terms of “millions of developers” or “hundreds of companies”?
Exciting future areas The space of developer productivity tooling is quite nascent and it is difficult to predict exactly what new capabilities we will see emerge over even just the next few months. However, we thought it would be helpful to discuss four broad areas, one research and three applied, that we find particularly compelling.
Code-specific artifacts in foundation model architectures From a technical capabilities perspective, one of the foundational areas of research that we expect will garner increased attention in the next few years is around incorporating more code-specific artifacts into model architectures. Today, we use the same model architecture to learn both natural language and code. Hence, LLMs like StarCoder treat code as simply another language and rely on higher-quality training data as a primary means of improving coding performance. However, it is intuitive that the “grammar” and behavior of code is unique and should deserve special attention in model architectures. As an example, consider the architecture proposed by Ding et. al, which proposed adding additional layers to a coding model to learn intermediate execution states and code coverage, resulting in an improvement in certain code understanding tasks over competing models.
Services businesses monopolizing specialized markets Wipro’s gross revenue during the 2023–2024 fiscal year was close to $11B and we believe some portion of this revenue can be disrupted by AI. Already, we see companies like Mechanical Orchard looking at mainframe-to-cloud migration, which could be applicable in areas as diverse as manufacturing, mining, accounting, insurance, pharma, chemicals, and oil and natural gas. There are many other projects in scope for consultants that can likely be automated with AI, including vendor integrations, data migrations, service re-architectures, optimizations, and custom applications. Existing foundation models and AI agents trained mostly on modern enterprise SaaS codebases in agile engineering cultures may not be effective in these markets or for these applications, and it may be interesting to explore whether this allows new companies to be built.
SRE workflows Production outages and incidents have a number of factors that make them particularly painful. First, only a small fraction of logging, metrics, and other observability data are ever used, yet total volumes of data are increasing exponentially, making root cause analysis an even harder needle in a haystack problem. Second, the worst outages involve multiple teams and services, wherein few people understand the full context of every subsystem involved. The current approach is to create an incident bridge with hundreds of engineers. Third, incidents and outages directly affect revenue. Facebook, for instance, notoriously lost close to $60 million in revenue due to an outage in 2021. Finally, observability costs are skyrocketing, meaning customers are paying more to still suffer from outages.
There is a lot more to be said about new tooling that needs to be built for the observability space, but suffice to say that AI agents, such as the one from Cleric could be dramatically disruptive for how incidents are triaged and managed today. In particular, we find it interesting that (1) automation can ingest more data at scale and develop a holistic picture of the health of a system, (2) AI agents are able to iteratively explore complex systems and bring in functional experts as needed by integrating with Slack, and (3) AI can dynamically identify what data is the most valuable to keep versus discard while also dynamically altering the volume of data in response to particular incidents.
Validation and Verification The methods we have to validate code today primarily consist of unit testing and static analysis. However, as AI-generated (or AI-modified) code becomes the majority of our codebases, the next problems to solve will be around verifying and validating that code at scale.
We believe the next large company in this category will be one that compounds insights across a number of related domains: AI, compilers, static analyzers, profilers etc. to deliver a complete end-to-end code quality platform. We imagine the core features of this product look something like:
Integrates from the IDE through to the CI/CD to catch issues at all stages of the SDLC
Provides coverage across security, performance, and functional testing
Generates, prunes, and maintains a suite of unit and integration tests
Augments AI-driven evaluation with deterministic evaluation from static analysis, compilers, or formal methods
Leverages various compute optimizations to run evaluations efficiently
Is extensible enough to interact with or guide other AI agents that may be operating on the same codebase
Conclusion In this piece, we covered the core elements that define how an idea or problem statement can form the basis of your business model for a large, independent company. One way to summarize these learnings would be to think in terms of two guiding questions:
#1: How and when will value get created for your customer?
How will your customer measure ROI? Do you continuously deliver value over time or at discrete intervals? How will you align the value creation (ROI) and value capture (pricing)? How do you ensure the customer embraces the product and is positioned to be successful with it?
#2: Will your champion be the CTO/VP Engineering or an individual developer?
In the former case, are you delivering enough ROI and ACV to justify a long, involved sales cycle? In the latter, can you grow virally through PLG, create value for individual developers, and make sustainable margin on per-developer pricing?
If you’re building in this space or would like to discuss some of the ideas in this post more, we would love to hear from you. Reach out to us at diyer [at] innovationendeavors [dot] com or harpi [at] innovationendeavors [dot] com.