Since the launch of Airbyte 1.0 with AI Assist, hundreds of new Airbyte connectors have been built. We recently joined our partners at Airbyte for the Data Bytes meetup where we answered questions from an audience of data engineers, AI enthusiasts, and product builders about how we built AI Assist and what lessons we learned along the way.
Airbyte engaged Fractional AI to add AI Assist to the Airbyte Connector Builder to streamline API integrations by reducing the time and effort required to build a connector. This project presented a significant challenge: handling the complexity and inconsistency of unstructured developer docs. We had to build robust evaluation metrics (evals) to measure performance and systematically identify areas for improvement. You can explore our approach in more detail by reading our Airbyte case study.
The following questions and answers have been edited for length. You can also watch the full Q&A video.
Q: What is your advice for someone who wants to incorporate AI into their enterprise?
A: There are a lot of ideas for ways AI can help but things often get stuck early in the ideation process or the proof of concept phase. I think the best opportunities for AI often lie in existing manual workflows. The way I think about large language models (LLMs) is that computers can now read, write, make junior employee-level decisions, and, act as domain experts about everything. So, that would be the number one thing I would focus on. Is this a real, existing manual workflow that looks like the LLM capability set can be applied here? Is it valuable enough? Rather than, “apply AI and it can know everything about everything.”
Q: What pitfalls did you encounter in the initial attempts to build AI assist?
A: I think we failed to appreciate upfront how hard some of the pure software engineering parts of crawling and processing API documentation would be. The assumption that simply downloading and feeding documentation to an LLM like ChatGPT would suffice proved naive. API documentation varies widely in structure and complexity, including authentication requirements, interactive elements, and inconsistent formats. This variability led to issues with finding relevant information, crawling irrelevant pages, and handling rate limiting. Even with existing web crawling tools, achieving robust and reliable document processing was challenging.
Q: What are the core components of a successful AI project, including evaluations?
A: Building robust evaluation metrics (evals) is crucial. These are automated tests for your AI application where you run numerous examples and assess performance based on predefined metrics. A practical approach involves gathering existing data, building a test harness, and comparing model output with human-generated results. This process allows us to track progress and identify areas for improvement. For instance, in building API integrations, evals should cover aspects like authentication, endpoint definition, and schema accuracy. (Tools like Braintrust can be useful for building evals. -ed.)
Q: How did your evaluation criteria evolve from the beginning to the end of the project?
A: The initial evaluation focused on a small set of connectors and a limited number of metrics. As the project progressed, the evaluation expanded significantly to include a wider range of connectors and more comprehensive metrics. This iterative approach helped identify areas where the system performed well and where it needed improvement. We built a more robust and complex workflow by iteratively addressing the challenges encountered during development. I was surprised by the inclusion at the end of fallbacks like searching Google and Perplexity. This highlighted the need to accommodate unpredictable scenarios in real-world API documentation.
Q: How do you measure the success of the AI assistant and what does the output look like?
A: For this project, we built evals by comparing the model's output with existing, well-functioning connectors (ground truth). However, this comparison isn't always straightforward as variations in naming conventions and schema structures may be more or less impactful to the developer. The AI assist feature doesn't directly generate the YAML manifest. Instead, the system uses code to deterministically generate the final output, using LLM responses to answer specific questions like the authentication method. This approach leverages the LLM's strengths while maintaining control and ensuring output validity.
Q: Do you think current concepts like RAG and agents will still be relevant in a year, or will something new dominate the AI discussion?
A: When people talk about agents, I think there are multiple things they might mean. One thing they might mean is a thing that’s got a lot of autonomy — you give it a bunch of tools and let it decide. I’ve yet to see anything like that in practice for a significant system. I think the more interesting thing today is more around specialization and how you break your problem down into specific components that are experts in a very small subdomain. This is driven by the increasing complexity of AI projects and the recognition that effective solutions often involve breaking down problems into smaller, manageable components. So much of the mystery of what it’s like to build with LLMs is actually just software engineering under the hood, and I think that will drive more adoption of these types of agent systems. We’re seeing more and more of it — we’re talking about very tech-forward companies but we also see hundred-year-old big equipment manufacturers talking about these workflows in a very realistic way.
Q: Which frameworks are you using?
Very little framework code under the hood. There's some, but it's not substantial.
Q: Instead of using fallback mechanisms for finding information, have you considered combining different approaches in a single step?
A: I actually think it often starts the opposite way, starting with the larger problem and then breaking this down into smaller subcomponents. In practice, one area we’ve had to break things down is deeply nested questions. We may be asking the LLM which of these authentication methods is used, and it falls off and stops following the instructions, so we’ve had to chop it up into sub-pieces.
We also built a content moderation system for Change.org which runs a petition platform. It’s not just about “is this spam”; they try to allow a lot of content but you can’t cross their community guidelines. So what we did was create these specialist agents that each look at the content through different lenses, write out their reasoning, give confidence scores, and then we take all those different viewpoints and feed them to one bigger process that's like “okay, now that you understand all these different angles, make a final decision.” So it's combining all these different sub-viewpoints.
(If you are optimizing for latency, try Parallelization and Predicted Outputs. Note that these approaches may increase costs. -ed.)
Q: Have you encountered challenges with AI agents providing reliable confidence scores and handling scenarios like recruitment?
A: I’ll start by saying this domain sounds very hard. Hiring sounds hard and we struggle to train humans to do it today. If I struggle to get a pretty junior person to figure out how to reliably produce this output then I also struggle to see how to get an LLM to do it. The analogy that jumps to mind is this kind of problem is present for AI phone agent applications, where people are trying to put AI agents on the phone. They have to be robust to anything that people say. I don’t get the sense that anyone’s figured this out yet. A hybrid approach involving a combination of structured decision-making (like a phone tree) and LLM-based logic might be more suitable, so the LLM is trying to assess very specific narrow things at each state. Building robust evaluation based on historical data, especially examples of "off the rails" scenarios, is still essential for identifying and mitigating potential issues.
Q: Do you have any specific recommendations or tricks for improving LLM performance, beyond providing examples?
A: While providing examples can be effective, there's no one-size-fits-all solution. Experimenting with various tactics, including prompt engineering techniques, and carefully measuring their impact through evals is crucial. Addressing specific failure cases by incorporating them into the prompt and leveraging tools like Anthropic's prompt generator can also yield improvements. The key is to adopt a data-driven approach, iteratively refining prompts based on observed performance rather than relying solely on anecdotal tricks.
Additional resources:
Databytes recap of the panel
full video of the panel
Braintrust (evals)
Parallelization (OpenAI platform docs)