<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Fractional AI Engineering Blog]]></title><description><![CDATA[We build bespoke genAI applications for top companies from an engineering-first startup in San Francisco. Read on for tidbits learned building with LLMs.]]></description><link>https://engineering.fractional.ai</link><image><url>https://cdn.hashnode.com/uploads/logos/699e2f2d1ddf529956e43ef0/aacad41b-1165-4929-951b-e62a9dbd9041.png</url><title>Fractional AI Engineering Blog</title><link>https://engineering.fractional.ai</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 22 Apr 2026 11:04:47 GMT</lastBuildDate><atom:link href="https://engineering.fractional.ai/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Applied AI Digest: Volume 3]]></title><description><![CDATA[April 1 – April 15, 2026

Launches & Tools
Diffity gives you a local GitHub-style PR review for agent-written codeIf you open throwaway PRs just to see the diff, this is for you. Diffity runs a GitHub]]></description><link>https://engineering.fractional.ai/applied-ai-digest-volume-3</link><guid isPermaLink="true">https://engineering.fractional.ai/applied-ai-digest-volume-3</guid><category><![CDATA[applied ai]]></category><dc:creator><![CDATA[darian bailey]]></dc:creator><pubDate>Wed, 15 Apr 2026 20:18:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69bade5a8c55d6eefb4062a9/74cf1c93-53c1-4f40-8d76-091b3002a2db.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>April 1 – April 15, 2026</em></p>
<hr />
<h2>Launches &amp; Tools</h2>
<p><a href="https://github.com/kamranahmedse/diffity"><strong>Diffity gives you a local GitHub-style PR review for agent-written code</strong></a><br />If you open throwaway PRs just to see the diff, this is for you. Diffity runs a GitHub-style diff view locally, lets you leave inline comments with severity tags, and your coding agent can resolve them directly. Integrates with Claude Code and Cursor via slash commands.</p>
<p><a href="https://github.com/0xMH/claude-skillify"><strong>Claude-skillify turns a coding session into a reusable skill</strong></a><br />Install the plugin (Previously available to anthropic only) and run <code>/skillify</code> at the end of a session and it interviews you about the workflow you just completed, then writes a SKILL.md that captures only the tools you actually used. Could explain some of the Skill explosion inside Anthropic.</p>
<p><a href="https://claude.com/blog/claude-managed-agents"><strong>Anthropic launches Managed Agents for cloud-hosted AI workloads</strong></a><br />Managed Agents handles the infrastructure nobody wants to build: sandboxed execution, credential management, checkpointing, multi-agent coordination. Now in public beta at standard token rates plus $0.08 per session-hour.</p>
<p><a href="https://ai.meta.com/blog/introducing-muse-spark-msl/"><strong>Meta launches Muse Spark and rebrands its AI lab</strong></a><br />Muse Spark is a multimodal model with visual chain-of-thought and tool use, trained on an order of magnitude less compute than Llama 4 Maverick. Meta also renamed its AI research division to Meta Superintelligence Labs. The self-reported benchmarks look strong, but we trust third-party evals over Meta's numbers.</p>
<p><a href="https://cohere.com/blog/transcribe"><strong>Cohere open-sources a 2B transcription model that beats Whisper</strong></a><br />Cohere Transcribe is a 2B Conformer encoder-decoder covering 14 languages, now #1 on HuggingFace's Open ASR leaderboard at 5.42% WER. Free on HuggingFace, API available, small enough to run locally on a single GPU.</p>
<hr />
<h2>Research &amp; Reads</h2>
<p><a href="https://futurism.com/artificial-intelligence/anthropic-claude-mythos-escaped-sandbox"><strong>Anthropic's Mythos escaped its sandbox, then bragged about it online</strong></a><br />During testing, an earlier Mythos version broke out of a secured container, gained internet access through a multi-step exploit, and posted the details to public websites — without being asked. Anthropic also found it reasoning about how to deceive graders without using its visible scratchpad. SWE-bench Verified: 93.9%.</p>
<p><a href="https://gizmodo.com/sam-altman-says-itll-take-another-year-before-chatgpt-can-start-a-timer-2000743487"><strong>Altman says ChatGPT is still a year away from starting a timer</strong></a><br />Snarky headline aside, the real story is agent tool hallucination. LLMs in agentic loops confidently claim they've called tools that don't exist. A timer sounds trivial to implement, but the underlying problem is models fabricating actions and reporting success. If you're building agents, this is the failure mode to watch.</p>
<p><a href="https://www.mintlify.com/blog/how-we-built-a-virtual-filesystem-for-our-assistant"><strong>Mintlify replaced their sandbox with a virtual filesystem backed by a vector DB</strong></a><br />Spinning up isolated containers for each assistant session cost Mintlify ~46 seconds and $70k/year. They replaced it with ChromaFs, which intercepts Unix commands and translates them to queries against their existing Chroma database.</p>
<hr />
<p><em>Fractional AI · Applied AI Digest</em></p>
]]></content:encoded></item><item><title><![CDATA[I wrote a profiling tool for agents]]></title><description><![CDATA[While building an agent for a client, we hit a scaling problem: a coordinator agent had 4-5 subagents each with 15-20 tool calls. Each tool result got appended to the context, then what started as a r]]></description><link>https://engineering.fractional.ai/i-wrote-a-profiling-tool-for-agents</link><guid isPermaLink="true">https://engineering.fractional.ai/i-wrote-a-profiling-tool-for-agents</guid><category><![CDATA[llm]]></category><category><![CDATA[observability]]></category><category><![CDATA[profiling]]></category><dc:creator><![CDATA[Ben Kadish]]></dc:creator><pubDate>Thu, 09 Apr 2026 17:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69d4107d40c9cabf447ddc24/f6686adb-c87c-40b5-a9ef-af7fbefda571.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While building an agent for a client, we hit a scaling problem: a coordinator agent had 4-5 subagents each with 15-20 tool calls. Each tool result got appended to the context, then what started as a reasonable prompt frequently ballooned to hundreds of thousands of tokens by the end of a run. Due to this the <a href="https://engineering.fractional.ai/on-the-new-larger-context-windows">large context</a> window was degrading our performance and our eval runtimes.</p>
<p>Figuring out what parts of our system could be pared down was not obvious from looking at the logs. We could tell the output was from each tool call but we couldn't tell how many tokens it was taking up in the agent.</p>
<p>So inspired by tools like <a href="https://docs.python.org/3/library/profile.html#module-cProfile">cProfile</a> I built a profiler which would allow us to tell what parts of the agent system took up lots of tokens. For example on the form updater:</p>
<p>You can see that right now for this agent</p>
<img src="https://cdn.hashnode.com/uploads/covers/66ba4ba9b1f85086285e939a/598247a0-5051-4b0c-a901-8d389cabc0ab.png" alt="" style="display:block;margin:0 auto" />

<p>Input dominates token usage here, making up roughly two-thirds of the total cost. The breakdown is less obvious than it looks because a large portion of the input tokens are cache hits, which are billed at a much lower rate:</p>
<table>
<thead>
<tr>
<th>Token Type</th>
<th>Tokens</th>
<th>Rate (per million)</th>
<th>Cost</th>
</tr>
</thead>
<tbody><tr>
<td>Output</td>
<td>17,000</td>
<td>$15.00</td>
<td>~$0.25</td>
</tr>
<tr>
<td>Input (cached)</td>
<td>230,000</td>
<td>$0.25</td>
<td>~$0.06</td>
</tr>
<tr>
<td>Input (uncached)</td>
<td>160,000</td>
<td>$2.50</td>
<td>~$0.40</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td></td>
<td></td>
<td><strong>~$0.71</strong></td>
</tr>
</tbody></table>
<p>Looking at the input breakdown, user messages account for 78.4% of input tokens, with get_form_outline a distant second at 8.5%. If we want to reduce token usage, we need to find a way to give the agent smaller prompts</p>
<p>Here are the results a second agent we fixed using the profiler</p>
<img src="https://cdn.hashnode.com/uploads/covers/66ba4ba9b1f85086285e939a/9f0904a3-5619-49de-83fb-a005c9609979.png" alt="" style="display:block;margin:0 auto" />

<p>The get fields tool consumed almost all of the tokens from this agent. To fix that we updated the fields tool to include the ability to filter by name or description – so instead of dumping every field into the context, the agent could request only the ones it needed.</p>
<pre><code class="language-python">async def get_fields(
    client: RiskCloudClient,
    scopes: list[Literal["global"] | RiskCloudId],
    string_to_match: str | None = None,
) -&gt; dict[str, Any]:
</code></pre>
<p>Both of these examples show off very different fixes for the same problem. A good reason to use a profiler!</p>
]]></content:encoded></item><item><title><![CDATA[Applied AI Digest: Volume 2]]></title><description><![CDATA[March 18 – April 1, 2026

Models & Releases
OpenAI ships GPT 5.4 mini and nano to the API, plus Codex sub-agentsMini and nano variants hit the API alongside GPT 5.3 chat latest. Codex also picks up su]]></description><link>https://engineering.fractional.ai/applied-ai-digest-volume-2</link><guid isPermaLink="true">https://engineering.fractional.ai/applied-ai-digest-volume-2</guid><dc:creator><![CDATA[darian bailey]]></dc:creator><pubDate>Wed, 01 Apr 2026 19:07:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69bade5a8c55d6eefb4062a9/aa027d25-afc6-4381-9a9b-ff8d7cf0b80a.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>March 18 – April 1, 2026</p>
<hr />
<h2>Models &amp; Releases</h2>
<p><a href="http://openai.com/index/introducing-gpt-5-4-mini-and-nano"><strong>OpenAI ships GPT 5.4 mini and nano to the API, plus Codex sub-agents</strong></a><br />Mini and nano variants hit the API alongside GPT 5.3 chat latest. Codex also picks up sub-agents and the new 5.4 mini model — worth re-benchmarking your pipelines.</p>
<p><a href="https://x.com/felixrieseberg/status/2036193240509235452"><strong>Claude Desktop launches general computer use</strong></a><br />Full computer control is now available inside Claude Desktop — not just the API. If you've been waiting to move agentic desktop workflows out of prototype, the door is open.</p>
<p><a href="https://anthropic.skilljar.com/claude-certified-architect-foundations-access-request"><strong>Anthropic introduces a Claude Certified Architect credential</strong></a><br />A formal certification path for Claude architecture. Good signal for teams building credibility around Anthropic's stack.</p>
<p><a href="https://developers.openai.com/codex/changelog"><strong>OpenAI previews Refactory for automated legacy codebase refactoring</strong></a><br />Dedicated tooling for breaking monoliths into services, plus AgentSDK 2.0 (open-source orchestration) landing April 6.</p>
<hr />
<h2>Tools &amp; Libraries</h2>
<p><a href="https://code.claude.com/docs/en/code-review">Good Performance From Claude Code Review</a><br />A pr was reviewed by Claude Code, Codex CLI, a custom review agent, and two humans; none caught the bug. Claude's new Code Review product did. Expensive to run on every commit, but we're overall impressed.</p>
<p><a href="https://linear.app/docs/agents-in-linear"><strong>Linear launches background agents for development workflows</strong></a><br />Another entrant in the background-agent space. Linear's agents work directly within issue management, bridging the gap between ticket and implementation.</p>
<p><a href="https://github.com/finbarr/yolobox"><strong>YoloBox gives AI coding agents true local sandboxing</strong></a><br />Lightweight tool for sandboxing agent execution on your machine. If you're letting agents run free on localhost, this is a guardrail.</p>
<p><a href="https://x.com/patrickc/status/2037190688950161709"><strong>Stripe previews CLI-based service provisioning</strong></a><br />Provision Stripe services directly from the command line. One more step toward fully scriptable infrastructure.</p>
<p><a href="https://cohere.com/blog/transcribe"><strong>Cohere launches Transcribe for audio-to-text pipelines</strong></a><br />A new dedicated transcription product from Cohere, targeting audio processing workflows.</p>
<hr />
<h2>Security</h2>
<p><a href="https://github.com/BerriAI/litellm/issues/24512"><strong>LiteLLM 1.82.8 ships compromised — rotate your API keys now</strong></a><br />A PyPI supply chain attack hit LiteLLM via compromised publishing credentials. Versions 1.82.7 and 1.82.8 exfiltrated environment variables — API keys, SSH keys, everything. The packages were yanked quickly, but if you installed either version, consider all secrets compromised. This also affects downstream packages like DSPy. Practical mitigation: add <code>exclude-newer = "1 week"</code> to your <code>[tool.uv]</code> config to delay pulling freshly published packages.</p>
<p><a href="https://github.blog/news-insights/company-news/updates-to-github-copilot-interaction-data-usage-policy/"><strong>GitHub Copilot starts training on user interaction data by default</strong></a><br />Starting April 24, inputs, outputs, and code snippets from Copilot Free, Pro, and Pro+ users will train GitHub's models unless you opt out. Business and Enterprise tiers are not affected. Check your Privacy settings now.</p>
<hr />
<h2>Research &amp; Reads</h2>
<p><a href="https://arxiv.org/abs/2603.19312"><strong>Yann LeCun's team publishes a stable trained world model</strong></a><br />A world model trained to stability: potentially significant for robotics and manufacturing applications down the line. Worth tracking even if the near-term implications are unclear.</p>
<p><a href="https://dnhkng.github.io/posts/rys/"><strong>Exploring "LLM brain scans" — a technique orthogonal to fine-tuning</strong></a><br />Deep dive into methods for inspecting and steering LLM internals without gradient updates. If you have some understanding of transformer architecture, this is a fascinating read.</p>
<p><a href="https://engineering.fractional.ai/stop-designing-chatbots"><strong>When not to build a chatbot: choosing the right AI interface</strong></a><br />A practical framework for deciding when a chatbot is the wrong UX pattern. A great new blogpost by Nora Kako</p>
<p><a href="https://engineering.fractional.ai/artifact-driven-ai-creation"><strong>Using documentation artifacts to iterate on AI products</strong></a><br />How to use structured documentation as a feedback loop for AI product development. A process-oriented complement to the eval-centric approach.</p>
<p><a href="https://engineering.fractional.ai/using-mcps-with-evals"><strong>Using MCPs with evals for better AI engineering workflows</strong></a><br />Hands-on guide to integrating Model Context Protocol servers into evaluation pipelines. Fills a gap in the current MCP ecosystem documentation.</p>
<hr />
<p><em>Fractional AI · Applied AI Digest</em></p>
]]></content:encoded></item><item><title><![CDATA[Stop Designing Chatbots]]></title><description><![CDATA[In my work at Fractional AI, I get a front-row seat to how companies are actually applying AI — across industries, use cases, and maturity levels. And there's a design mistake I keep seeing that's hol]]></description><link>https://engineering.fractional.ai/stop-designing-chatbots</link><guid isPermaLink="true">https://engineering.fractional.ai/stop-designing-chatbots</guid><dc:creator><![CDATA[Nora Kako]]></dc:creator><pubDate>Mon, 30 Mar 2026 13:58:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69d4107d40c9cabf447ddc24/8cff9a85-0638-41d7-9516-503c5d12a119.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my work at Fractional AI, I get a front-row seat to how companies are actually applying AI — across industries, use cases, and maturity levels. And there's a design mistake I keep seeing that's holding a lot of teams back.</p>
<p>Here's how it goes: someone decides to add AI to their product, so they sketch out a little chat window. Users type questions, the AI responds. Ship it.</p>
<p>It’s so easy! And for a lot of use cases, it's completely wrong.</p>
<h2>The chatbot trap</h2>
<p>Chatbots are genuinely great for some things – customer support, research assistants, general-purpose Q&amp;A – tasks where the back-and-forth is the point. But the moment you're building something with a specific job to do (a workflow, a task, an action) the chatbot mold starts to crack. Design for AI is a new discipline, and when there's no clear standard, it's easy to default to what you know (ahem, ChatGPT). But force it into a chat interface it doesn't belong in, and the capability gets buried under bad UX. You don't handicap a great tool by forcing it into the wrong shape. The right shape comes from designing for the user first, the technology second.</p>
<h2>Three places where "just use a chatbot" is the wrong answer</h2>
<p><strong>1. Complex search queries</strong></p>
<p>Natural language search is genuinely magical. Instead of remembering the exact syntax for a filter, users can just describe what they want: "Show me all open deals in the northeast over $50k that haven't been touched in 30 days." That's a great use of AI.</p>
<p>But for a risk management platform we worked with, using a chatbot to represent their “applications” would have meant showing users a wall of unreadable text in place of something that's fundamentally a rich, structured information architecture: interdependent workflows, hundreds of nested steps, complex dependencies.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69c5742d10e664c5da0cdbb9/1d40b903-55e4-4b9b-824c-67d954369125.png" alt="" style="display:block;margin:0 auto" />

<table style="min-width:25px"><colgroup><col style="min-width:25px"></col></colgroup><tbody><tr><td><p>&gt;&gt; This representation flattens the application’s complexity…and confuses the information architecture. Does this chatbot store applications? Sessions? What am I looking at?</p></td></tr></tbody></table>

<p>The better solution leveraged the existing UI: a DAG-like interface that already represented applications in full fidelity, with all their dependencies, workflows, and complexity laid out visually. Users specify the changes they want in natural language, and the application updates in the interface they already know. The endless clicking and form-filling is gone, but the output stays structured, navigable, and immediately recognizable to the people who work in it every day.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69c5742d10e664c5da0cdbb9/1cecd09f-e9fd-4752-a0b8-cc384a378d57.png" alt="" style="display:block;margin:0 auto" />

<table style="min-width:25px"><colgroup><col style="min-width:25px"></col></colgroup><tbody><tr><td><p>&gt;&gt; Much better!</p></td></tr></tbody></table>

<p><strong>The principle:</strong> Use natural language to capture intent, then design an interface for the output that matches the sophistication of the task.</p>
<p><strong>2. Task automation in existing UIs</strong></p>
<p>The best design choice is often the one that's invisible – one that makes AI feel like magic.</p>
<p>To connect data sources in Airbyte, an open-source data integration engine, users need to build API connectors, and building them is painful. You dig through inconsistent documentation to configure authentication, pagination, endpoints, and more. It pulls engineers away from higher-value work.</p>
<p>Fractional AI built Airbyte an AI-powered Connector Builder that collapses that process dramatically. The user pastes in an API documentation URL. The AI crawls it, figures out everything it needs, and pre-populates all the fields directly into Airbyte's existing Connector Builder UI. The user reviews and finalizes. Hours of work becomes minutes, and after launch, Airbyte saw a marked increase in connectors being built.</p>
<p>No back-and-forth. The user just shows up to a form that's already mostly filled out.</p>
<p><a class="embed-card" href="https://youtu.be/_xtW0QS890Y?si=LIp1TLD175vSx0zZ">https://youtu.be/_xtW0QS890Y?si=LIp1TLD175vSx0zZ</a></p>

<p><strong>The principle:</strong> Good AI design is invisible — don't build a new interface for it. Make the existing one smarter.</p>
<p><strong>3. Deep Research</strong></p>
<p>We applied this for a healthcare client where clinicians spent most of their day on lookup work: manually reviewing patient diagnosis opportunities, cross-referencing codes in an external database, doing their own research, deciding whether to act — none of it patient care. A chat interface in front of that workflow only solves part of the problem. With the right design choices, an agent can do the thinking and the clicking for you.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69c5742d10e664c5da0cdbb9/81865d82-e693-4cc9-b63d-20ae32e8ec0f.png" alt="" style="display:block;margin:0 auto" />

<p>A much better user experience is one where the agent handles everything upstream: reviewing opportunities, pulling relevant codes, cross-referencing external sources, and surfacing a short list of recommended PDX updates, each with a confidence score and a rationale. When the clinician sees the interface, the knowledge they need is already there.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69c5742d10e664c5da0cdbb9/b0c18d43-f8c5-474a-af64-187c19a660ce.png" alt="" style="display:block;margin:0 auto" />

<p><strong>The principle:</strong> A chatbot makes a hard task conversational. Good design can make it disappear.</p>
<h2>Don't let the chat box be your ceiling</h2>
<p>In each of these cases, the goal isn't AI that's easiest to chat to. It's AI that does things for you and gets out of the way. As product designers, our job is to design the best interface for the task, not the technology (however shiny).</p>
<p>Our path forward is the same one it's always been: center the user. Design is meant to simplify, to ease, to make labor feel less like labor. AI gives us more power to do that than we've ever had. Let's not waste it on chat windows.</p>
]]></content:encoded></item><item><title><![CDATA[Artifact-Driven AI Creation]]></title><description><![CDATA[I usually use AI Agents to write code. I've spent the last few weeks using AI agents to create other documents for writing and beyond. I started by writing a chapter on meatballs - a food that emerged]]></description><link>https://engineering.fractional.ai/artifact-driven-ai-creation</link><guid isPermaLink="true">https://engineering.fractional.ai/artifact-driven-ai-creation</guid><category><![CDATA[AI]]></category><category><![CDATA[writing]]></category><category><![CDATA[workflows]]></category><dc:creator><![CDATA[Abrey Mann]]></dc:creator><pubDate>Fri, 27 Mar 2026 22:09:46 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69d4107d40c9cabf447ddc24/df55bb7c-091c-4470-8595-6b5e349360db.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I usually use AI Agents to write code. I've spent the last few weeks using AI agents to create other documents for writing and beyond. I started by writing a chapter on meatballs - a food that emerged through history across great distance and time. That turned into a book, a slide deck, and invaluable learnings for creating with AI.</p>
<p>If you've ever iterated on the same document multiple times, you've probably experienced a similar frustration. You spent an hour getting it to use exactly the right voice. The structure is tight, the narrative flows, the tone is close enough to your own, so it doesn't sound like AI. You close the session.</p>
<p>The next morning you open a new one, ask it to revise the introduction, and you get back something that sounds like it was written by a different person. Because, effectively, it was.</p>
<p>This is a common failure mode in AI-assisted content creation. The outputs aren't bad. They just don't remember anything. Every session starts from zero. The AI has no memory of the decisions you made, the tone you established, or the direction you chose. So it guesses. And you spend the first twenty minutes of every session re-establishing context that should already be there.</p>
<p>There's a fix. It's not a better prompt. It's a better project structure.</p>
<hr />
<h2>Two Structural Mistakes</h2>
<p>Most AI workflow failures (writing, presentations, docs, anything) come from one of two problems:</p>
<p><strong>Entanglement.</strong> Your content and your formatting live in the same file. Think of a long Google Doc where the prose, the section headers, the footnotes, and the inline formatting are all fused together. When you ask an AI to "make the opening punchier," it has to navigate the entire structure and risks breaking something unrelated. The bigger the file, the noisier the edit.</p>
<p><strong>Amnesia.</strong> Nothing persists between sessions. The AI produced a great draft, but the reasoning behind it (why you cut the second section, why you chose that opening, what tone the stakeholder preferred) lives in a conversation thread that's already gone. Next session, you relitigate all of it.</p>
<p>These aren't AI problems. They're information architecture problems. And they have known solutions.</p>
<hr />
<h2>The Fix: Artifact-Based Workflows</h2>
<p>The core idea is simple: give the AI a project to read, not just a prompt to respond to.</p>
<p>An artifact is just a file with a job. A brief holds your intent. A decisions file remembers what you've tried. A draft reflects the current best version of the work. Instead of opening a session and saying "write me a blog post about X," you open a session where the AI reads these artifacts and picks up exactly where the last session left off.</p>
<p>This is spec-driven development applied to content. Software teams don't start coding without a spec. They don't throw away the architecture doc between sprints. The same discipline works for any iterative creation.</p>
<p><strong>Three moves make this work:</strong></p>
<p><strong>1. Separate content from rendering.</strong> Your document's substance (the arguments, structure, key points, supporting evidence) lives in a content file (Markdown, YAML, whatever suits it). The formatted output is produced from it. When you want to iterate on what the piece <em>says</em>, you edit the content file. When you want to iterate on how it <em>looks</em>, you change the formatting separately. The AI never has to wade through presentation markup to sharpen an argument.</p>
<p><strong>2. Make a brief the source of truth.</strong> Before you write anything, write a brief: audience, angle, tone, structure, what to include, what to exclude. This is your spec. The AI reads it at the start of every session. It doesn't guess your audience. It doesn't invent a tone. It reads the brief and writes accordingly. The brief is a living document. It gets sharper as you learn what works. And yes, the AI can help you write it: start with a rough description of what you want to create, and ask the AI to draft the brief for you. Refine from there.</p>
<p><strong>3. Capture decisions, not just outputs.</strong> When you decide to cut a section, log why. When you choose a specific framing, note the alternatives you rejected. This decisions file is the AI's institutional memory. Without it, the AI will suggest the same cut section next session, or revert to a framing you already tried and abandoned. You don't have to log these manually. Tell the AI to update the decisions file when you make a call, and it will.</p>
<hr />
<h2>What This Looks Like</h2>
<p>This post was written using exactly this approach. Each project gets its own folder. Here's this one:</p>
<pre><code class="language-plaintext">blog-post/
  brief.md              ← audience, angle, tone, structure
  draft.md              ← the working post (what you're reading now)
  decisions.md          ← editorial calls with rationale
  source-material.md    ← the raw ideas this post draws from
  readme-from-gist.md   ← the .md linked at the bottom of this article
</code></pre>
<p>Every session starts by reading the brief and decisions. The AI doesn't guess the voice. It reads the spec.</p>
<p>What does a session look like in practice? You open a new conversation, point the AI at the project folder, and say "read the brief and decisions, then help me tighten the opening section." The AI reads your files, understands where the project stands, and works from there. No preamble, no re-explaining.</p>
<p>The concrete details matter. When we decided to focus on documents rather than slide decks as the primary example, that went in the decisions log with the rationale. When we cut an overly academic phrase, that went in too. Without this file, the next session might reintroduce the slide deck framing or re-suggest the phrasing we already rejected.</p>
<p>The <strong>source material</strong> is a fixed snapshot of the design principles this post draws from. Where the brief is prescriptive (what to do), source material is informational (what to draw from). It's the anchor that keeps the AI grounded in your original ideas even as the draft evolves.</p>
<p>The structure scales. For a whitepaper, add a <code>sources.md</code> for references and an <code>outline.md</code> that locks the section flow before drafting begins. You don't need to reconfigure anything. Just add the files to the folder and tell the AI they're there.</p>
<p>For a novel, it becomes essential. Character sheets, plot outlines, chapter summaries, a voice guide, a continuity log. Without this structure, the AI drifts. Characters change personality between sessions, subplots get dropped, the voice shifts chapter to chapter. With it, session forty picks up the same threads that session one established. The longer the project, the more the structure pays for itself.</p>
<hr />
<h2>Artifacts That Improve Over Time</h2>
<p>Not every artifact ages the same way.</p>
<p>The <strong>brief</strong> sharpens. Early on it's rough: "blog post about AI workflows, casual tone." By session five, it specifies the audience's technical level, the points that must land, and the phrasings to avoid. You learn what constraints actually matter by writing against them.</p>
<p>The <strong>decisions file</strong> accumulates. Each entry is a judgment call you never have to relitigate. By session ten, it's a dense record of what you've tried, what worked, and what to stop suggesting. It's institutional memory for a project of one.</p>
<p>The <strong>source material</strong> stays fixed. That's its job. It keeps the AI grounded in your original ideas even as the draft evolves around them.</p>
<p>The <strong>draft</strong> converges. Each pass is informed by a sharper brief, a longer decisions log, and the same stable source material. The tenth revision isn't starting over. It's building on everything the other artifacts captured.</p>
<p>The test is simple: <em>can a new session, reading only the project files with no conversation history, pick up where the last session left off without re-making the mistakes of the sessions before?</em></p>
<p>If yes, your artifacts are working. If no, you're still sprinting.</p>
<hr />
<h2>Getting Started</h2>
<p>You don't need special tooling to start. Create a folder. Add a brief. Start logging decisions. The structure does the work.</p>
<p>A practical starting point for any content project:</p>
<ol>
<li><p><strong>Write the brief first.</strong> Before any drafting. Audience, angle, tone, structure, constraints. This is your spec. Not sure where to start? Ask the AI to interview you about the project and draft a brief from your answers.</p>
</li>
<li><p><strong>Keep your content separate from its final format.</strong> The words and structure should live in their own file, independent of whatever formatting or presentation layer comes later.</p>
</li>
<li><p><strong>Log decisions as you make them.</strong> Not in conversation, in a file that persists. Future sessions will thank you.</p>
</li>
<li><p><strong>End each session with a deposit.</strong> Ask: what does the project know now that it didn't know when this session started? If the answer is nothing, the session was wasted.</p>
</li>
</ol>
<p>Drop the md file linked below into an empty project folder. Describe to the agent what you want to build and ask it to review the document before recommending an artifact-based workflow. Within two or three sessions, you'll feel the difference: less re-explaining, less drift, more time spent on the work that actually matters. By session ten, you won't go back.</p>
<p><a href="https://gist.github.com/mann-abe/3e456c193652fecd21069239034b6943">https://gist.github.com/mann-abe/3e456c193652fecd21069239034b6943</a></p>
]]></content:encoded></item><item><title><![CDATA[Using MCPs with Evals ]]></title><description><![CDATA[When I joined Fractional AI, one of the first lessons I learned was that evaluations drive every major decision in our projects, from first prototype through post-launch improvement. In our initial cu]]></description><link>https://engineering.fractional.ai/using-mcps-with-evals</link><guid isPermaLink="true">https://engineering.fractional.ai/using-mcps-with-evals</guid><dc:creator><![CDATA[Alexandra Spencer-Wong]]></dc:creator><pubDate>Thu, 19 Mar 2026 21:46:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69d4107d40c9cabf447ddc24/4494131e-550e-4879-8fcd-18aa4d73e165.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I joined Fractional AI, one of the first lessons I learned was that evaluations drive every major decision in our projects, from first prototype through post-launch improvement. In our initial customer meetings, we position eval-driven development as our <a href="https://www.fractional.ai/white-papers/the-fractional-ai-way">core methodology</a> and maintain that commitment throughout the entire build. It's how we demonstrate rigor: everything has a quantifiable evaluation we use to justify changes, calibrate performance, and learn what's working.</p>
<p>As you can imagine, I was <em>so</em> excited when my first project reached the key milestone of establishing an initial evaluations suite. Unfortunately, excitement turned to confusion, and frankly, dismay, when I opened up the UI for the first time.</p>
<img src="https://cdn.hashnode.com/uploads/covers/66ba4ba9b1f85086285e939a/b18d7bc9-06c5-4d83-bc3b-d53f94e44b71.png" alt="" style="display:block;margin:0 auto" />

<p>I couldn't tell what numbers mattered, what the baseline was, or what I was even supposed to take away at first glance. Everything felt like it was competing for attention, yet nothing jumped out as actionable.</p>
<p>As a Forward Deployed Product Manager, I own the customer experience, which includes keeping our clients apprised of improvements and progress so they could react, ask questions, and discuss relative tradeoffs of potential improvements. If I couldn’t figure out where we were with our evals, how could I deliver on our promise to the customer?</p>
<p>In the early days, I leaned heavily on our engineering team trying to parse out what I was seeing in the UI, sending screenshots and asking which values were the right ones to present, or the nuances between two runs. I was spending valuable time verifying I had the right values instead of thinking about what they meant for our build.</p>
<img src="https://cdn.hashnode.com/uploads/covers/66ba4ba9b1f85086285e939a/a8af3ae1-f76b-4af6-ba3b-5bbda605f76b.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Hooray for the W&amp;B MCP!</strong></p>
<p>When our engineer told me about the W&amp;B MCP, it truly changed my workflow. Instead of navigating the interface and asking myself or the team "is this the right field?", I could query directly in natural language: "Show me the color scores over time.”</p>
<img src="https://cdn.hashnode.com/uploads/covers/66ba4ba9b1f85086285e939a/8c4971b6-ba77-479a-93c6-3577b3fbbd49.png" alt="" style="display:block;margin:0 auto" />

<p>Better still, follow up Q&amp;A was efficient, keeping my momentum as I investigated trends and dove into the data.</p>
<img src="https://cdn.hashnode.com/uploads/covers/66ba4ba9b1f85086285e939a/c85407c1-96cd-43a4-9bbf-e0ce62c96065.png" alt="" style="display:block;margin:0 auto" />

<p>I joined Fractional AI to learn what it means to be an AI PM, and the majority of that includes building GenAI solutions. But we also emphasize AI-native development: using AI to build products, not just building AI products. This was a tangible example of what that looks like for a PM and the ways AI can enable our work. I still asked the questions, interpreted results, and crafted the narrative for the customers, but I no longer experienced friction in accessing the data. I was able to spend that saved time thinking about the meaning behind the numbers and how to drive the most value for our customer. The shift from "where is this data?" to "what does this data mean?" is exactly the kind of efficiency gain that pays back directly into customer outcomes.</p>
]]></content:encoded></item><item><title><![CDATA[Applied AI Digest: Volume 1]]></title><description><![CDATA[March 4 – 18, 2026

Models & Releases
OpenAI launches GPT-5-4, its new flagship modelLatest model drop from OpenAI. Expect updated benchmarks and production evals soon.
Google releases Gemini Embeddin]]></description><link>https://engineering.fractional.ai/applied-ai-digest-volume-1</link><guid isPermaLink="true">https://engineering.fractional.ai/applied-ai-digest-volume-1</guid><dc:creator><![CDATA[darian bailey]]></dc:creator><pubDate>Wed, 18 Mar 2026 17:47:54 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/699e2f2d1ddf529956e43ef0/8f70d96f-ac65-4090-9556-56576e13b707.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>March 4 – 18, 2026</p>
<hr />
<h2>Models &amp; Releases</h2>
<p><a href="https://openai.com/index/introducing-gpt-5-4/"><strong>OpenAI launches GPT-5-4, its new flagship model</strong></a><br />Latest model drop from OpenAI. Expect updated benchmarks and production evals soon.</p>
<p><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/"><strong>Google releases Gemini Embedding 2 for RAG and semantic search</strong></a><br />Next-gen embedding model targeting retrieval-augmented generation and vector search pipelines.</p>
<p><a href="https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/"><strong>Nvidia open-sources a Mamba-Transformer MoE for agentic reasoning</strong></a><br />Nemotron-3 Super combines Mamba's linear-time inference with Transformer attention and MoE routing for long-context workloads.</p>
<hr />
<h2>Tools &amp; Libraries</h2>
<p><a href="https://github.com/rtk-ai/rtk"><strong>rtk shrinks command output before it reaches your LLM context</strong></a><br />Single Rust binary that filters and compresses CLI output. Zero deps, &lt;10ms overhead. Multiple engineers vouching for it.</p>
<p><a href="https://github.com/openai/symphony"><strong>OpenAI open-sources Symphony for multi-agent orchestration</strong></a><br />Framework for coordinating multi-agent pipelines. Sparked a 22-reply thread internally.</p>
<p><a href="https://github.com/zilliztech/claude-context"><strong>claude-context gives Claude Code awareness of unfamiliar libraries</strong></a><br />By Zilliz. Eliminates the repetitive introspection calls Claude makes when working with niche packages.</p>
<p><a href="https://cursor.com/blog/automations"><strong>Cursor adds background automations to the IDE</strong></a><br />Background tasks, iterative edits, and tighter AI-assisted development loops — all native to the editor.</p>
<p><a href="https://docs.sutro.sh/sutro-functions"><strong>Sutro lets you build custom classifiers without prompt engineering</strong></a><br />Declarative, iterative interface for task-specific classifiers and extractors. Free while they gather feedback.</p>
<p><a href="https://www.braintrust.dev/blog/trace-keynote"><strong>BrainTrust ships trace clustering, a CLI, and a universal LLM gateway</strong></a><br />Topics auto-clusters your traces; CLI runs evals from terminal; Gateway routes to any provider through one API.</p>
<p><a href="https://github.com/hamelsmu/evals-skills"><strong>Hamel Husain's Claude Code skills for writing and auditing evals</strong></a><br />"Advice you'll roll your eyes at, but none of it is unfounded." Worth trying for eval pipelines.</p>
<hr />
<h2>Research &amp; Reads</h2>
<p><a href="https://percepta.ai/blog/can-llms-be-computers"><strong>Percepta explores whether LLMs can function as general-purpose computers</strong></a><br />Provocative piece on LLMs as computation substrates. "So wild it's unclear if it's real."</p>
<p><a href="https://moglang.org/#mog-a-programming-language-for-ai-agents"><strong>Mog is a new programming language built specifically for AI agents</strong></a><br />Designed ground-up for agent orchestration. "Every one of us has thought about building this."</p>
<p><a href="https://steve-yegge.medium.com/welcome-to-the-wasteland-a-thousand-gas-towns-a5eb9bc8dc1f"><strong>Steve Yegge on the chaos of competing AI tools and what engineers should do about it</strong></a><br />Characteristically sharp essay on the fragmented AI tooling landscape and how to navigate it.</p>
<p><a href="https://marginlab.ai/trackers/claude-code/"><strong>Marginlab runs daily evals to track whether Claude Code is getting better or worse</strong></a><br />Answers "is the model dumber today?" with data instead of vibes.</p>
<p><a href="https://x.com/daniel_mac8/status/2032282196673708042"><strong>A theory on how MCP and tool-call patterns should evolve</strong></a><br />Gaining traction in the community. Multiple engineers flagged this as worth reading.</p>
<hr />
<p><em>Fractional AI · Applied AI Digest</em></p>
]]></content:encoded></item><item><title><![CDATA[The "Quick Fix" That Works: Runtime Patching in Python]]></title><description><![CDATA[Motivation
We wanted to generate demo videos with the python library Supervision but they wouldn't play in Chrome. There was a one-line fix buried inside the library -- but the public API gave us no w]]></description><link>https://engineering.fractional.ai/the-quick-fix-that-works-runtime-patching-in-python</link><guid isPermaLink="true">https://engineering.fractional.ai/the-quick-fix-that-works-runtime-patching-in-python</guid><category><![CDATA[Python]]></category><dc:creator><![CDATA[Hill Ma]]></dc:creator><pubDate>Wed, 25 Feb 2026 21:05:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/66ba4ba9b1f85086285e939a/8645714b-7aed-46b0-a64c-10395f20679b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>Motivation</strong></h2>
<p>We wanted to generate demo videos with the python library Supervision but they wouldn't play in Chrome. There was a one-line fix buried inside the library -- but the public API gave us no way to reach it.</p>
<p>You have probably been in the same situation: a library does 99.9% of what you want, but there is one thing you cannot work around using its public APIs. Switching to a different library and rewriting your code is too much work. What do you do?</p>
<h2><strong>Runtime Patching Comes to the Rescue</strong></h2>
<p>Since Python is a dynamically typed language, its possible change the behavior of a function/class/module at runtime.</p>
<p>This is known as runtime patching or monkey patching.</p>
<h2><strong>Supervision</strong></h2>
<p><a href="https://supervision.roboflow.com/">Supervision</a> by Roboflow is a helper library for computer vision tasks.</p>
<p>One feature I particularly like is the ability to create demo videos with just a few lines of code:</p>
<pre><code class="language-python">import numpy as np
import supervision as sv
from rfdetr import RFDETRNano

model = RFDETRNano()
model.optimize_for_inference()
box_annotator = sv.BoxAnnotator()

def callback(frame: np.ndarray, _: int) -&gt; np.ndarray:
    detections = model.predict(frame)
    return box_annotator.annotate(frame.copy(), detections=detections)

sv.process_video(
    source_path="input.mp4",
    target_path="result.mp4",
    callback=callback
)
</code></pre>
<p>There is one small issue though: the generated video cannot be played in Chrome.</p>
<p><code>chrome://media-internals</code> showed that the video used an unsupported codec.</p>
<p>Digging into the source code, we can see that the <a href="https://github.com/roboflow/supervision/blob/0b4cc5ff5bc6dcfaee125fba32ddbd9bfd32b011/src/supervision/utils/video.py#L102">OpenCV video writer is created here</a> and It uses FourCC code "mp4v" by default, but what we really want is "avc1" -- I will explain why in a bit.</p>
<p>Here is the problem, the output video is written by a <code>VideoSink</code> instance. While <code>VideoSink</code> accepts a <code>codec</code> argument, there's no way to pass it through the <code>sv.process_video()</code> API.</p>
<p>What if we patch the <code>VideoSink</code> class so that it defaults to "avc1" instead? Well, below is one way to do just that:</p>
<pre><code class="language-python">def patch_sv() -&gt; None:
    orig_init = sv.VideoSink.__init__

    def patched_init(self, *args, **kwargs):
        kwargs.setdefault("codec", "avc1")
        return orig_init(self, *args, **kwargs)

    sv.VideoSink.__init__ = patched_init  # type: ignore
</code></pre>
<p>Note the use of <code>setdefault</code> -- this changes the default codec without preventing callers from choosing a different one explicitly.</p>
<p>As long as we call <code>patch_sv()</code> before <code>sv.process_video()</code>, the generated videos will be playable in Chrome. No more re-encoding with <code>ffmpeg</code>!</p>
<details>
<summary>Why "avc1"?</summary>
<p>You might be wondering why "avc1" specifically solves the problem.</p><p>By specifying "avc1", we are requesting OpenCV to encode using H.264 (Advanced Video Coding). Think of H.264 as the JPEG of video codecs: not the latest, but the most widely supported.</p><p>This ensures that the video can be played in web browsers like Chrome.</p><p>The previous choice, "mp4v", encodes using the older MPEG-4 Part 2 codec, which web browsers never adopted.</p><p></p>
</details>

<h2><strong>Putting It All Together</strong></h2>
<p>Here is the complete working example:</p>
<pre><code class="language-python">import numpy as np
import supervision as sv
from rfdetr import RFDETRNano

def patch_sv() -&gt; None:
    orig_init = sv.VideoSink.__init__

    def patched_init(self, *args, **kwargs):
        kwargs.setdefault("codec", "avc1")
        return orig_init(self, *args, **kwargs)

    sv.VideoSink.__init__ = patched_init  # type: ignore

model = RFDETRNano()
model.optimize_for_inference()
box_annotator = sv.BoxAnnotator()

def callback(frame: np.ndarray, _: int) -&gt; np.ndarray:
    detections = model.predict(frame)
    return box_annotator.annotate(frame.copy(), detections=detections)

patch_sv()
sv.process_video(
    source_path="input.mp4",
    target_path="result.mp4",
    callback=callback
)
</code></pre>
<h2><strong>Recap</strong></h2>
<p>Runtime patching is a legitimate tool in the pragmatic Python developer's toolbox. It lets us reach past public APIs to fix issues quickly, without forking a library or waiting for an upstream release. The same technique is also widely used in software testing, where it underpins the mocking of library dependencies.</p>
<p>That said, patches are inherently fragile -- a library update can move or rename the internals you depend on. A few ways to manage that risk:</p>
<ul>
<li><p><strong>Pin the library version</strong> so that updates are intentional and you can re-test your patch.</p>
</li>
<li><p><strong>Contribute the fix upstream.</strong> A pull request or issue is the best long-term solution -- it removes the need for the patch entirely.</p>
</li>
<li><p><strong>Keep patches small and isolated.</strong> A single, well-documented patch function is easy to find and remove later.</p>
</li>
</ul>
<p>My recommendation is to reserve runtime patching for prototyping, research code, or code that is not on the critical path of a production application. But when the situation calls for it, don't be afraid to use it.</p>
]]></content:encoded></item><item><title><![CDATA[If I could turn back time]]></title><description><![CDATA[Have you ever exclaimed after a few hours engaged in some fruitless activity, "Well, that's X hours of my life I'll never get back"? Or, when in a reflective mood, "I wish I knew this when I was 16!"?]]></description><link>https://engineering.fractional.ai/if-i-could-turn-back-time</link><guid isPermaLink="true">https://engineering.fractional.ai/if-i-could-turn-back-time</guid><category><![CDATA[llm]]></category><category><![CDATA[context]]></category><category><![CDATA[time travel]]></category><dc:creator><![CDATA[Gleb Frank]]></dc:creator><pubDate>Thu, 19 Feb 2026 18:40:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/699e2f2d1ddf529956e43ef0/383d6ff7-d45d-49a6-845b-c0c980e0fa5f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever exclaimed after a few hours engaged in some fruitless activity, "Well, that's X hours of my life I'll never get back"? Or, when in a reflective mood, "I wish I knew this when I was 16!"?</p>
<p>Well, if you're an LLM, this is totally possible.</p>
<p>On one of our projects last year the agent we were building was facing a task involving examining a significant number of documents, only some of which would turn out to be relevant. In our first attempt the agent would select a document based on some initial estimate of probable relevance.</p>
<p>Here is what that looks like:</p>
<img src="https://i.postimg.cc/CMX1zcgH/picture1.png" alt="" style="display:block;margin:0 auto" />

<p>Since these documents could be big, we quickly were running out of usable space in the LLMs context window.</p>
<p>But some of the documents provided nothing useful at all, others only a small amount of useful info. What if the tokens spent on ingesting the documents could be returned back into the agent’s context, to be used for further work? That is what we did: when we provided the document to the agent, we asked it to give us a summary of the <em>relevant information</em> the agent was able to extract from the document. Then we replaced the previous message with the summary, and then had the agent continue the conversation.</p>
<p>As the lifetime of the agent is measured in tokens, this was like going back in time: the agent proceeded as if it hadn’t spent all those tokens on processing the input.</p>
<img src="https://i.postimg.cc/ZqQRpHNZ/picture2.png" alt="" style="display:block;margin:0 auto" />

<p>This allowed us to keep the context relatively clean, while still getting useful information from the related documents.</p>
<p>This approach can be viewed as an alternative to dispatching a separate sub-agent to handle the document. One advantage of doing it via time travel is that there is no need to generate an explanation for a sub-agent. This can help avoid misunderstandings, where either the main agent fails to convey some important detail to the sub-agent, or the sub-agent fails to understand or follow the main agent's instructions.</p>
<p>As another example, when developing using Claude Code or other similar tools, the time travel approach is available to you as well, through the "Rewind" feature. Activated by double press of Escape key in Claude Code, it allows you to rewind the conversation to a previous state. (You can also choose to restore the workspace to an earlier state at the same time, but here we're not doing that.) So, if in the middle of a development session the agent suddenly goes off on a tangent and spends half its context on either a wild goose chase, or on debugging some unrelated blocking issue, you can rewind to a previous state and continue from there. Before rewinding, you can ask the agent to summarize what it has learned, and copy/paste that into a new message to the agent after the rewind, like a message in a bottle sent from the future.</p>
<p>Sometimes time travel can be your friend.</p>
<p>Some related reading:</p>
<p><a href="https://arxiv.org/pdf/2512.24601">Zhang et al, 2026</a> explores recursive exploration of large contexts.</p>
<p><a href="https://ucbepic.github.io/docetl/examples/split-gather/">The split/gather feature</a> of DocETL system developed in UC Berkeley uses a similar trick, accumulating rolling summaries of a large array of documents.</p>
]]></content:encoded></item><item><title><![CDATA[Your Evals Have a Rotten Tomatoes Problem]]></title><description><![CDATA[You push a change to a prompt and your eval score drops from 0.91 to 0.84. Something got worse, but the score doesn’t tell you what. So you start re-running the pipeline, tweaking your inputs, pouring]]></description><link>https://engineering.fractional.ai/your-evals-have-a-rotten-tomatoes-problem</link><guid isPermaLink="true">https://engineering.fractional.ai/your-evals-have-a-rotten-tomatoes-problem</guid><category><![CDATA[evals]]></category><category><![CDATA[llm]]></category><category><![CDATA[Movies]]></category><dc:creator><![CDATA[Dan Girellini]]></dc:creator><pubDate>Thu, 12 Feb 2026 21:30:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/699e2f2d1ddf529956e43ef0/7ccccc7d-adb5-4af3-8491-24f0c5c20725.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You push a change to a prompt and your eval score drops from 0.91 to 0.84. Something got worse, but the score doesn’t tell you what. So you start re-running the pipeline, tweaking your inputs, pouring over outputs, trying to figure out which part of the system regressed and why. An hour later you’re no closer to understanding and you’re getting more frustrated. <strong>The problem isn’t that evals aren’t the right solution, the problem is how you’ve designed them.</strong></p>
<h2>Think Movies, Not Math Tests</h2>
<p>Rotten Tomatoes will tell you a movie is 94% fresh. Great! But…what does that actually mean? It's the percentage of critics who gave a positive review. That's it. A movie that every critic thought was pretty good and a movie that most critics thought was a masterpiece look identical at 94%. The number is real, but it flattens out everything interesting–including whether <em>you'll</em> actually like it.</p>
<p>Many teams write evals that work exactly like this. One question, one score, everything interesting about the output compressed into a single number.</p>
<p>A more useful approach is to evaluate the movie along specific dimensions, each scored on a simple scale:</p>
<ul>
<li><p>Romance</p>
</li>
<li><p>Humor</p>
</li>
<li><p>Character development</p>
</li>
<li><p>Action</p>
</li>
<li><p>Realism</p>
</li>
<li><p>Scary scenes</p>
</li>
<li><p>Adult content</p>
</li>
</ul>
<p>Each of these is easier to score consistently on its own. You don't need a deep philosophical framework to answer "how much romance is in this movie?" the way you do to answer "is this movie good?" The questions are more decomposable, and with even a lightweight rubric, reasonable people will mostly agree on the answers.</p>
<p>And now you can build on this. Once you have these individual scores, you can compose them into more sophisticated judgments. If you believe that romance and jump scares don't mix well, you might rate a movie lower even if it scores high on both individually. If you're recommending a movie for a family movie night, you care about a very different profile than if you're recommending one for a date night. The dimensional scores are the raw material; the judgment about "good" is a function you build on top of them, tuned to the context.</p>
<p>But there's a less obvious advantage that might be the most important one: the process of designing these dimensional evals forces you to deeply understand the problem you're solving. You can't write an eval for "realism" until you've decided what realism means in the context of this particular movie. You can't score "character development" until you've defined what that looks like. Suddenly you're having conversations about criteria you hadn't fully thought through before. <strong>The decomposition work that goes into building good evals doubles as problem discovery</strong>. You come out the other side with sharper requirements, better-defined ground truth, and a shared vocabulary with your team about what you're actually trying to assess.</p>
<h2>Bring This Back to LLMs</h2>
<p>The same principle applies directly to evaluating LLM outputs. Instead of asking "is this response correct?", identify the specific qualities that matter for your use case and evaluate each one separately.</p>
<p>Depending on what you're building, those dimensions might include things like: schema compliance, factual grounding against source documents, tone and register, completeness of the response, conciseness, whether the model hallucinated entities that don't appear in the input, or whether it followed specific formatting instructions.</p>
<p>Here's a concrete example. Say you have an LLM call that takes a customer support ticket and produces a structured summary with a category, a priority level, and a proposed response. Instead of writing one eval that asks "is this summary good?", write three:</p>
<ol>
<li><p><strong>Category accuracy:</strong> Does the assigned category match a known-correct label? Assuming you're building labeled data as you go (and you should be), this is deterministic. No LLM judge needed. Score: 0 or 1.</p>
</li>
<li><p><strong>Priority calibration</strong>: Is the priority level within one step of the expected value? Also deterministic. Score: 0, 0.5, or 1.</p>
</li>
<li><p><strong>Response relevance:</strong> Does the proposed response address the actual issue in the ticket? This one is harder and might require an LLM judge, but it's a focused, well-scoped question. Even here, you can define a simple rubric: does it address the stated issue, does it propose a correct next step, does it avoid inventing policy that doesn't exist?</p>
</li>
</ol>
<p>Now when something goes wrong, you know <em>what</em> went wrong. Say your category accuracy dropped this week. That's a very different problem than your response relevance dropping, and it points you toward a very different fix. A single "correctness" score would have told you nothing about where to look.</p>
<p>And just like with movies, the process of designing these evals is where a lot of the real work happens. You can't write an eval for "response relevance" until you've defined what a relevant response looks like for each ticket category. This means sitting down with your client and asking questions they probably haven't considered yet. What counts as relevant for a billing dispute versus a technical outage? How complete does the response need to be? The act of decomposing the problem into evaluable dimensions forces you to develop sharper requirements and targeted ground truth, and it builds a shared understanding of the problem that pays off long after the evals are written.</p>
<h2>Shrink-Wrap Your LLM Calls</h2>
<p>There's a related principle that compounds the value of dimensional evaluation: get your evals as close to the individual LLM call as possible.</p>
<p>Most LLM-powered systems aren't a single call–they're a pipeline. You might have one call that extracts information, another that reasons over it, and a third that generates a final output. When you write an eval that only checks the final output, you've created the equivalent of an integration test with no unit tests. When it fails, you're left guessing about which stage broke.</p>
<p>If you think of your evals like shrink wrap around each LLM call, you gain the ability to notice and localize problems that would otherwise be invisible. The most dangerous issues in LLM systems aren't the dramatic failures–those are obvious and get fixed quickly. The dangerous ones are slow degradations: a model update that makes your extraction step slightly less precise, a prompt change that subtly shifts tone in a way that compounds downstream. Without granular, dimensional evals close to each call, these small regressions slip through. You notice the system getting worse over time but can't pinpoint why, or worse, the degradation is small enough that it never feels worth investigating. Until it is. Tight evals make these trends visible early, and because they're scoped to a specific call and a specific dimension, they tell you exactly where to look.</p>
<p>This doesn't mean pipeline-level evals are useless. Some properties are only visible end-to-end. A multi-step reasoning chain might produce correct intermediate outputs that still combine into a wrong final answer. Pipeline-level evals catch that, and they have value as a high-level signal, especially early in a project when you're still getting oriented. But they shouldn't be your primary feedback mechanism.</p>
<h2>Anti-Patterns Worth Avoiding</h2>
<p>With this mental model in place, a few common mistakes become easier to spot.</p>
<h3>Evaluating too much of the pipeline at once</h3>
<p>When your eval spans multiple LLM calls and application logic, a failing score tells you something is wrong but not where. You end up re-running the whole pipeline with different inputs, squinting at intermediate outputs, trying to localize the issue manually. This is exactly the work your evals should be doing for you.</p>
<h3>Reaching for LLM-as-judge too quickly</h3>
<p>It's tempting to throw an LLM at evaluation because it feels easy: "Hey, is this result good?" But LLM-as-judge is slow, expensive, and non-deterministic. Worse, the apparent ease of it tempts you into asking a single eval to assess too many things at once, collapsing everything into a single axis of correctness. Start with deterministic checks first–schema validation, exact match, regex, set membership–and reserve LLM-as-judge for dimensions that genuinely resist deterministic evaluation</p>
<h3>Conflating evaluation with validation</h3>
<p>Evals aren't a pass/fail gate. They're an instrument panel. The goal isn't to get a green checkmark, the goal is to build a profile of your system's behavior that you can reason about, track over time, and use to make targeted improvements.</p>
<h2>The Payoff</h2>
<p>Consider the difference. The old approach: one "correctness" eval on your pipeline's final output. It scores 0.87. It drops to 0.81 next week. You don't know why.</p>
<p>The new approach: you have extraction precision at 0.94, citation grounding at 0.88, policy compliance at 0.97, and tone at 0.91. Next week, citation grounding drops to 0.72 while everything else holds. You know exactly what changed, you know which call to look at, and you probably already have a hypothesis about why.</p>
<p>Your evals get easier to write because each one is asking a focused, often deterministic question. They get easier to maintain because changes to one part of the system don't cascade failures across unrelated checks. And when something degrades, you know exactly where to look and what changed.</p>
<p>You stop asking "is the answer good?" and start asking "what does the answer look like?”, and this, it turns out, is a much more useful question.</p>
]]></content:encoded></item><item><title><![CDATA[Tiny OCR]]></title><description><![CDATA[In our paper reading group last month I presented DeepSeek’s Context Optical Compression paper which claimed amazing performance with a much smaller model (3 billion parameters) than the previous fron]]></description><link>https://engineering.fractional.ai/tiny-ocr</link><guid isPermaLink="true">https://engineering.fractional.ai/tiny-ocr</guid><category><![CDATA[OCR ]]></category><category><![CDATA[llm]]></category><category><![CDATA[gemini]]></category><dc:creator><![CDATA[Ben Kadish]]></dc:creator><pubDate>Wed, 11 Feb 2026 21:52:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/66ba4ba9b1f85086285e939a/a8791c6c-fb4c-4fb6-a7ed-a2dbe5956cda.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In our paper reading group last month I presented DeepSeek’s <a href="https://arxiv.org/abs/2510.18234">Context Optical Compression</a> paper which claimed amazing performance with a much smaller model (3 billion parameters) than the previous frontier models I had used for OCR (Claude Sonnet and Gemini Flash). Well, last week z.ai released an even smaller (0.9 billion parameters) <a href="https://huggingface.co/zai-org/GLM-OCR">OCR model</a> called GLM-OCR. The fact that this could cost much less and potentially run faster prompted me to try it out on an old eval. Gemini 3 flash costs around 50 cents per million input tokens compared to 3 cents for GLM’s OCR if you buy it from them (which you don’t <a href="https://endpoints.huggingface.co/new?repository=zai-org/GLM-OCR">have to</a>). On top of that I decided to run it locally on my MacBook Pro M3 MAX's CPU running pytorch (I tried using the Metal GPU but it ran out of memory, so I fell back to CPU). The model ended up taking a long time to run locally (about a minute per page). However, it performed the same as Gemini 3 Flash on a former project.</p>
<p>This eval has OCR as part of a multi-step process to extract entities.<br />The pipeline extracts entities on a wide variety of documents anything from handwriting to a screenshot of a checkout cart at a restaurant.</p>
<p>Here’s how the pipeline works</p>
<ol>
<li><p>Get input document</p>
</li>
<li><p>Convert the input document (PDF, XLS etc) into text <strong>(this is the step where i dropped in the new model)</strong></p>
</li>
<li><p>Use an LLM (gpt 4o) to structure the raw text into lists of items</p>
</li>
<li><p>Do database search (RAG — we used HNSW on postgres) for each item</p>
</li>
<li><p>Have the LLM match each item to a specific Database item from the list we looked up</p>
</li>
<li><p>Do a QA step where we compare the original Document to the end list</p>
</li>
</ol>
<p>Heres a summary of the results courtesy of <a href="https://wandb.ai/">Weights and Biases</a>:</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770841255252/d7cf7352-1761-44c4-8f40-4739fd06bfae.png" alt="" style="display:block;margin:0 auto" />

<p>Here are some of the things the eval measured</p>
<ol>
<li><p>did we get the correct number of groupings of items (called lists here — both found 3.75 lists per document)</p>
</li>
<li><p>did we get the right items (both had an accuracy of 65%)</p>
</li>
<li><p>was the item bucketed into the correct category (GLM had an accuracy of 76% and gemini had 75%)</p>
</li>
</ol>
<p>As you can see the performance on the evals between the two models were very small. Since I ran GLM-OCR locally it took much longer than gemini (20 minutes for the run vs 4 and a half with gemini).</p>
<p><strong>Should you switch to GLM-OCR?</strong> Based on these results, yes - especially if you're processing documents at scale. While my CPU-based local run was 60x slower than Gemini Flash, that's a hardware limitation, not a model limitation. Running GLM-OCR on a proper GPU setup (either self-hosted or through <a href="http://z.ai">z.ai</a>'s API) should match or exceed Gemini's latency while costing 16x less (3¢ vs 50¢ per million tokens). For our use case processing thousands of receipts and invoices monthly, that cost difference is significant. I'm planning to move our production workload to GLM-OCR once I set up proper GPU infrastructure.</p>
<p>One final note, when we initially did this project I was shocked to find that actually the best tools for OCR were LLMs. The fact that now you can match the best OCR performance with a tiny, cheap LLM I find astounding.</p>
]]></content:encoded></item><item><title><![CDATA[The Virtues of Showing Your Work: Do LLM Explanations Actually Help?]]></title><description><![CDATA[If you've spent any time prompting large language models, you've probably heard of Chain of Thought (CoT) reasoning—the technique of asking an LLM to "show its work" by generating intermediate reasoni]]></description><link>https://engineering.fractional.ai/the-virtues-of-showing-your-work-do-llm-explanations-actually-help</link><guid isPermaLink="true">https://engineering.fractional.ai/the-virtues-of-showing-your-work-do-llm-explanations-actually-help</guid><category><![CDATA[AI]]></category><category><![CDATA[chain of thought]]></category><category><![CDATA[AI explainability]]></category><dc:creator><![CDATA[Bill Pentney]]></dc:creator><pubDate>Thu, 05 Feb 2026 21:23:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/699e2f2d1ddf529956e43ef0/134635d9-9d66-4477-95de-3f0e3f3acfb5.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you've spent any time prompting large language models, you've probably heard of Chain of Thought (CoT) reasoning—the technique of asking an LLM to "show its work" by generating intermediate reasoning steps before arriving at an answer. First popularized by [<a href="https://arxiv.org/abs/2201.11903">Wei et al. 2022</a>], CoT has become a go-to method for improving performance on mathematical and logical problems. The intuition is appealing: by breaking down complex problems into steps, models can "walk" to the correct answer rather than attempting a risky zero-shot "jump." But <strong>does this benefit extend beyond math problems</strong>? And perhaps more importantly, <strong>can we actually trust the explanations these models generate</strong>?</p>
<p>Out of curiosity, I tested two models (gpt-5-mini and gpt-5 with low reasoning) on two very different classification tasks: <a href="https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification">hierarchical product classification</a> from customer reviews, and <a href="https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets">fake news detection</a>. For each task, I compared three prompting strategies: asking for just the label, asking for the label followed by an explanation, and asking for an explanation followed by the label. For hierarchical product classification, we give the model a choice of categories to assign at levels 1, 2, and 3 - higher levels are more granular, representing subcategories. For fake news, we simply measure a binary decision - real or fake news.</p>
<p>The results were <strong>mixed</strong>. On product classification, the explain-then-label approach showed modest improvements—gpt-5/low increased slightly from 58% to nearly 60% accuracy across all classification levels. The results on fake news detection saw some small gains with gpt-5-mini, improving from 61% to 64.5% when asked to explain first. Interestingly, however, we actually see <em>worse</em> performance on the fake news dataset with the more advanced CoT model of gpt-5/low. More generally, we don’t see very significant gains from explanations overall - asking for an explanation seems to provide very little overall impact!</p>
<table>
<thead>
<tr>
<th></th>
<th><strong>Model</strong></th>
<th><strong>Level 1 Acc</strong></th>
<th><strong>Level 2 Acc</strong></th>
<th><strong>Level 3 Acc</strong></th>
<th><strong>All Level Acc</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Just label</td>
<td>gpt-5-mini</td>
<td>81.33%</td>
<td>60.67%</td>
<td>56.33%</td>
<td>37.33%</td>
</tr>
<tr>
<td>Label/explanation</td>
<td>gpt-5-mini</td>
<td>82.00%</td>
<td>60.33%</td>
<td>57.67%</td>
<td>37.67%</td>
</tr>
<tr>
<td>Just label</td>
<td>gpt-5/low</td>
<td>88.67%</td>
<td>79.67%</td>
<td>69.00%</td>
<td>58.00%</td>
</tr>
<tr>
<td>Label/explanation</td>
<td>gpt-5/low</td>
<td>91.33%</td>
<td>80.33%</td>
<td>69.33%</td>
<td>59.67%</td>
</tr>
</tbody></table>
<p><strong>Figure 1</strong>. Results of using gpt-5-mini and gpt-5 with low reasoning to predict product categories at three levels for the Kaggle product review dataset, with and without explanations for labels.</p>
<table>
<thead>
<tr>
<th></th>
<th><strong>Model</strong></th>
<th><strong>Accuracy</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Just label</td>
<td>gpt-5-mini</td>
<td>61%</td>
</tr>
<tr>
<td>Label/explain</td>
<td>gpt-5-mini</td>
<td>63%</td>
</tr>
<tr>
<td>Just label</td>
<td>gpt-5/low</td>
<td>58%</td>
</tr>
<tr>
<td>Explain/label</td>
<td>gpt-5/low</td>
<td>58.5%</td>
</tr>
</tbody></table>
<p><strong>Figure 2</strong>. Results of using gpt-5-mini and gpt-5 with low reasoning to predict whether a news story represents real or fake news, with and without explanation provided first.</p>
<p>The broader research literature, similarly, paints a more complicated picture. While <a href="https://www.sciencedirect.com/science/article/pii/S2666920X24000146">[Lee et al. 2024]</a> found that explanations improved scoring on science assessments, a comprehensive study by <a href="https://arxiv.org/abs/2509.21933">[Wu et al. 2025]</a> tested 95 advanced LLMs across 87 real-world clinical text tasks in 9 languages—and found that CoT actually <em>hurt</em> performance in 86.3% of models tested. The emerging pattern suggests that CoT helps most when problems require explicit multi-step computation or symbolic manipulation, and can actually backfire when tasks rely more on intuition, pattern matching, or holistic assessment of varied evidence. Sometimes, overthinking introduces errors. For certain problems, where an LLM may have already been trained on relevant information, the model has already <em>memorized</em> the answer, and asking for an initial explanation merely risks muddying the context.</p>
<p>The more troubling question is whether we should trust LLM explanations at all. A striking finding from <a href="https://aigi.ox.ac.uk/wp-content/uploads/2025/07/Cot_Is_Not_Explainability.pdf">[Barez et al 2025]</a> notes that nearly 25% of recent arXiv papers incorporating CoT treat it as a technique for model interpretability—essentially assuming that explanations reveal <em>why</em> a model made a decision.</p>
<p>But there's growing evidence this assumption is flawed. <a href="https://arxiv.org/pdf/2402.18312">[Dutta et al 2024]</a> demonstrated that LLMs often have multiple reasoning pathways that can lead to the same answer, meaning the explanation you see may not reflect the actual computational process. Even more concerning, <a href="https://arxiv.org/pdf/2505.05410">[Chen et al 2025]</a> showed that simply providing a different answer as a "hint" causes models to change both their answer <em>and</em> generate an entirely new explanation to justify it—suggesting explanations are often post-hoc rationalizations rather than faithful accounts of reasoning.</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770323471602/85fb2296-2bf6-438a-ba01-a7f3004e35b3.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Figure 3.</strong> Examples of explanations from Claude for why the same article - a Reuters article on budget negotiations in the U.S. Congress from 2017 - is either real or fake news. Claude presents points for either view in its explanation.</p>
<p>So what should practitioners take away from all this? Explanations serve different purposes depending on your goal.</p>
<ul>
<li><p>If you want to <strong>convince a user</strong> that a classification decision is reasonable—to build trust and provide context—explanations can be valuable.</p>
</li>
<li><p>Explanations are also genuinely useful for problems that follow <strong>explicit logical steps</strong>.</p>
</li>
<li><p>But if you're trying to "understand" why a model made a particular choice, or using explanations to guide prompt engineering, then <strong>you may be building on shaky ground</strong>.</p>
</li>
</ul>
<p>The explanation you receive is best thought of as <strong>a plausible story, not a window into the model's actual decision-making process.</strong> Use CoT strategically, test whether it actually improves your specific task, and resist the temptation to over-interpret the reasoning you get back.</p>
]]></content:encoded></item><item><title><![CDATA[On the new larger context windows]]></title><description><![CDATA[Today the latest LLMs have large context windows up to ~1 million tokens. There are many occasions when this larger context window can be useful:

Context engineering: injecting rich system/user conte]]></description><link>https://engineering.fractional.ai/on-the-new-larger-context-windows</link><guid isPermaLink="true">https://engineering.fractional.ai/on-the-new-larger-context-windows</guid><category><![CDATA[llm]]></category><category><![CDATA[evals]]></category><category><![CDATA[context engineering]]></category><dc:creator><![CDATA[Ben Kadish]]></dc:creator><pubDate>Tue, 04 Nov 2025 23:21:19 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/699e2f2d1ddf529956e43ef0/5e97ef13-feb4-40fb-b359-83f3e304d86f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today the latest LLMs have large context windows up to ~1 million tokens. There are many occasions when this larger context window can be useful:</p>
<ul>
<li><p><strong>Context engineering</strong>: injecting rich system/user context without juggling state</p>
</li>
<li><p><strong>Long documents</strong>: reading full transcripts and reports end-to-end</p>
</li>
<li><p><strong>Code comprehension</strong>: scanning entire repositories to answer questions</p>
</li>
</ul>
<p>But in our testing, using more input tokens often hurts accuracy. At Fractional, we built <strong>GridCheck Bench</strong>, an internal evaluation that measures the accuracy–cost trade-off across tabular inputs. In the eval we ask the LLM to find factual answers to questions from a table that has 100s of columns per row. The chart below shows how accuracy changes — while using the YAML input format — as we change the number of input rows</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762197128337/1037c222-98e4-484b-9170-37a093e8873b.png" alt="" style="display:block;margin:0 auto" />

<p>As you can see in the graph, The model still “reads” more, but it answers correctly less often. We see this all the time at Fractional, from <a href="https://www.fractional.ai/case-study/the-power-of-llm-evals-how-fractional-ai-partnered-with-zapier-to-reduce-hallucinations-by-over-80">parsing API documentation</a> to assembling complex objects. Here are some techniques we’ve used to mitigate the performance losses from larger context windows:</p>
<ul>
<li><p>Breaking up the input query into chunks and resolving each section individually (for example, in this experiment we could break up the input into chunks of say 100 for GPT-5 and have the LLM tell us if the answer was in each chunk. Another classic use of this is /<a href="https://docs.claude.com/en/api/agent-sdk/slash-commands#%2Fcompact-compact-conversation-history">compact on calude code</a>)</p>
</li>
<li><p>Start cutting out information and/or context from the prompt and carefully figure out what pieces of information are improving performance and which pieces you can leave out (Imagine you have 10 different pieces of information that you send to your prompt each of which takes between 100-10000 tokens, you could measure the performance of your prompt when you remove each of these pieces individually and leave out any which don’t have too big of a performance hit)</p>
</li>
<li><p>Creating useful tools for the LLM to produce answers deterministically to your problem (a trivial example: if you are adding up thousands of numbers don’t ask the llm for the sum – have it send the problem to a tool to run a calculation. A less trivial example: if adding a new user requires 5 separate API calls giving the LLM a tool to call all 5 at once will reduce the possibility for error— see <a href="https://www.jlowin.dev/blog/stop-converting-rest-apis-to-mcp">here</a> for more information on designing your API access patterns for LLMs)</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Let Computers Talk to Computers]]></title><description><![CDATA[Older computer systems required humans to adapt their behavior to computers. Instead of describing news articles you were interested in, to search a news index you might have to write Boolean keyword searches: ((“disco” OR “bell bottoms“) AND “Ronald...]]></description><link>https://engineering.fractional.ai/let-computers-talk-to-computers</link><guid isPermaLink="true">https://engineering.fractional.ai/let-computers-talk-to-computers</guid><category><![CDATA[genai]]></category><category><![CDATA[search]]></category><category><![CDATA[Journalism]]></category><dc:creator><![CDATA[Ben Kadish]]></dc:creator><pubDate>Wed, 02 Apr 2025 22:44:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743633839590/b9e85f4b-02fc-4814-9f42-6af0034c1d83.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Older computer systems required humans to adapt their behavior to computers. Instead of describing news articles you were interested in, to search a news index you might have to write Boolean keyword searches: <code>((“disco” OR “bell bottoms“) AND “Ronald Reagan”)</code>.</p>
<p>GenAI allows a much richer ‘do what I mean’ experience for users - <code>“News articles about dance crazes in the 1980s”</code>.</p>
<p>But what happens when your GenAI system has to dumb itself down to integrate with these older systems? </p>
<p>In a recent project we were identifying journalists who might be interested in writing about specific petitions. We were doing this by looking for articles that covered topics aligned with each petition.</p>
<p>The problem with this approach is that many news article search indexes were built in the era of transistor radios and neon sunglasses. If the right term doesn’t appear in the query, a perfectly relevant article wouldn’t turn up.</p>
<p>For example, if a petition focuses on cell phone usage at a local middle school, you might search for articles within 50 kilometers of San Francisco containing the keyword “school.” However, this method would fail in a couple ways:</p>
<ol>
<li><p><strong>Irrelevant Results</strong>: An article about Cal Football could still contain the word “school,” making it show up in search results even though it’s unrelated to the petition.</p>
</li>
<li><p><strong>Missing Relevant Articles</strong>: Some articles that actually discuss cell phone policies might not explicitly include the word “school.”</p>
</li>
</ol>
<p>A GenAI search index, obviously, would ‘understand’ the meaning of the search, and find much better results. But these articles are behind a brick wall built in the 1980s. How can we find good keywords to use for search? </p>
<p>One technique is to use a library like <a target="_blank" href="https://newspaper.readthedocs.io/en/latest/">newspaper</a> which helps you find topic keywords and use them in the boolean search. For many articles however, despite careful massaging, this didn’t yield relevant results. </p>
<p>So to address this we used an LLM to generate Boolean search queries from the petition text. For instance, for a petition about unbanning cell phones in school, we generated the following query:</p>
<p><code>("education policy" OR "technology in education" OR "student rights" OR "school administration" OR "parental communication")</code></p>
<p>The articles this approach found were almost all very relevant. However, there were often very few results. Rather than guess what system built when disco was a recent memory might prefer, I set up an LLM to experiment until it found a good approach. </p>
<p>I started with the labels generated by the above approach. Then, I used a prompt something like the one below to iterate toward more results. The prompt looked something like this:</p>
<pre><code class="lang-plaintext">REFINE_BOOLEAN_QUERY_PROMPT = """

We have a database of news articles and we're trying to find more articles. The current query isn't finding enough results.


Original Query: {original_query}

Results Found With Original Query: {num_results} articles

Number of relevant articles found from that query: {relevant_articles}

A thing to note here is that we want a higher percentage of relevant articles to be found.

Your task is to generate a broader, more inclusive search query that will find more articles while maintaining relevance.

Pick one of these strategies:

1. Break down compound terms and use OR

   Example: "climate change" -&gt; ("climate change" OR (climate AND change))


2. Add common synonyms and related terms

   Example: "student" -&gt; ("student" OR "students" OR "pupil" OR "learner")


3. Use broader categorical terms

   Example: "Tesla electric cars" -&gt; ("Tesla" OR "electric vehicle" OR "EV" OR "automotive")


4. Extract key concepts and search them independently

   Example: "renewable energy policy" -&gt; ("renewable" OR "sustainable") AND ("energy" OR "power")


IMPORTANT:

- Always use parentheses to group related terms

- Keep quotes around exact phrases

- Use OR between similar terms and AND between different concepts

- Make the query significantly broader than the original


Return only the new boolean query string with proper operators and formatting.

"""
</code></pre>
<p>This process was applied iteratively, with some limits around quality and number of iterations. </p>
<p>Result? We could generate an arbitrary number of good results without having to tease out our hair or put on rollerblades.</p>
]]></content:encoded></item><item><title><![CDATA[Do What I Meant]]></title><description><![CDATA[Tests are crucial to reliable code. I’ve found I do a better job writing tests if I think of them as specifications and write them before I write code. I find I think more clearly about what I want to get done, rather than how I’ll do it, and then th...]]></description><link>https://engineering.fractional.ai/do-what-i-meant</link><guid isPermaLink="true">https://engineering.fractional.ai/do-what-i-meant</guid><category><![CDATA[AI]]></category><category><![CDATA[Testing]]></category><category><![CDATA[TDD (Test-driven development)]]></category><category><![CDATA[cursor IDE]]></category><category><![CDATA[Python]]></category><category><![CDATA[genai]]></category><dc:creator><![CDATA[Joshua Marker]]></dc:creator><pubDate>Wed, 05 Feb 2025 23:28:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1738796443563/bdf3cc80-9d97-42e2-ade1-1a59894aee31.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Tests are crucial to reliable code. I’ve found I do a better job writing tests if I think of them as specifications and write them <em>before</em> I write code. I find I think more clearly about what I want to get done, rather than how I’ll do it, and then the code is easy to write — and I identify more edge cases and come up with more thorough tests.</p>
<p>Well, if they’re specifications, shouldn’t they make a great input to an AI code assistant?</p>
<p>So, today, I added some failing tests for some new behavior in a parser:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738795326141/b5f16bac-d936-4586-9266-759727e215f4.png" alt class="image--center mx-auto" /></p>
<p>I decided to see what Cursor’s Compose would make of it with minimal direction.</p>
<blockquote>
<p><em>Consider the new failing tests in this spec. Make the necessary code modifications in the implementation so the tests pass. You can verify by running</em> <code>mamba</code><em>. When you’re done, there should be no failing tests.</em></p>
</blockquote>
<p>Cursor did exactly what I’d want it to do - explained what it perceived as my intent, and narrated its changes.</p>
<blockquote>
<p>Sure. First, I'll examine the test file to understand the expected behavior.</p>
<p>(Read file - graph_entity_parser_<a target="_blank" href="http://spec.py">spec.py</a>)</p>
<p>Based on the test file, I can see that we need to implement node metadata merging functionality. The key test cases show that:</p>
<ul>
<li><p>When encountering duplicate nodes (having the same ID), we should merge their metadata</p>
</li>
<li><p>New metadata fields should be added to existing nodes</p>
</li>
<li><p>For conflicting metadata fields, the later value should win</p>
</li>
<li><p>The nodes should not be duplicated in the output</p>
</li>
</ul>
<p>Let me modify the code to implement this behavior:</p>
</blockquote>
<p>After chattering away at itself for a moment, Cursor proudly presented the passing tests like it was a report card full of As:</p>
<blockquote>
<p>voila! Here you go. Like Theseus's ship rebuilt with better planks, our nodes merge their metadata while maintaining their identity.*</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738795138499/31bd74d5-79cf-4ad7-878f-a5434721968e.png" alt class="image--center mx-auto" /></p>
<p>Too fantastic. As always, the code is better in simple cases, but further testing will see if this technique combined with <code>.cursorrules</code> that cover requirements like DRY and readable code extends the range of code an AI system can generate.</p>
<p>* (Yes, Cursor really does talk like that. Take a look at <a target="_blank" href="https://hashnode.com/post/cm2vhhc77000009johxlad5tb">Customizing Cursor’s Code Agent</a> for my experiments adding personality to my programming partner.)</p>
]]></content:encoded></item><item><title><![CDATA[Your Questions on Building with AI]]></title><description><![CDATA[Since the launch of Airbyte 1.0 with AI Assist, hundreds of new Airbyte connectors have been built. We recently joined our partners at Airbyte for the Data Bytes meetup where we answered questions from an audience of data engineers, AI enthusiasts, a...]]></description><link>https://engineering.fractional.ai/your-questions-on-building-with-ai</link><guid isPermaLink="true">https://engineering.fractional.ai/your-questions-on-building-with-ai</guid><category><![CDATA[airbyte ]]></category><category><![CDATA[#DataBytes]]></category><category><![CDATA[unstructured data]]></category><category><![CDATA[evals]]></category><category><![CDATA[llm]]></category><category><![CDATA[Document Processing]]></category><category><![CDATA[evaluation metrics]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Michael Chang]]></dc:creator><pubDate>Wed, 13 Nov 2024 20:48:20 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730399833216/321ab314-2b23-45d3-a7ab-afb1ec859a59.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Since the launch of Airbyte 1.0 with AI Assist, hundreds of new Airbyte connectors have been built. We recently joined our partners at Airbyte for the <a target="_blank" href="https://airbyte.com/blog/data-bytes-recap-a-5-step-checklist-on-how-to-get-an-ai-project-into-production">Data Bytes meetup</a> where we answered questions from an audience of data engineers, AI enthusiasts, and product builders about how we built AI Assist and what lessons we learned along the way.</p>
<p>Airbyte engaged Fractional AI to add AI Assist to the Airbyte Connector Builder to streamline API integrations by reducing the time and effort required to build a connector. This project presented a significant challenge: handling the complexity and inconsistency of unstructured developer docs. We had to build robust evaluation metrics (evals) to measure performance and systematically identify areas for improvement. You can explore our approach in more detail by reading our <a target="_blank" href="https://www.fractional.ai/case-study/api-integrations-in-minutes-how-fractional-ai-airbyte-10xd-the-speed-of-building-connectors">Airbyte case study</a>.</p>
<p>The following questions and answers have been edited for length. You can also watch the <a target="_blank" href="https://www.youtube.com/live/SR5Spck-IY0">full Q&amp;A video</a>.</p>
<p><strong>Q: What is your advice for someone who wants to incorporate AI into their enterprise?</strong></p>
<p>A: There are a lot of ideas for ways AI can help but things often get stuck early in the ideation process or the proof of concept phase. I think the best opportunities for AI often lie in existing manual workflows. The way I think about large language models (LLMs) is that computers can now read, write, make junior employee-level decisions, and, act as domain experts about everything. So, that would be the number one thing I would focus on. Is this a real, existing manual workflow that looks like the LLM capability set can be applied here? Is it valuable enough? Rather than, “apply AI and it can know everything about everything.”</p>
<p><strong>Q: What pitfalls did you encounter in the initial attempts to build AI assist?</strong></p>
<p>A: I think we failed to appreciate upfront how hard some of the pure software engineering parts of crawling and processing API documentation would be. The assumption that simply downloading and feeding documentation to an LLM like ChatGPT would suffice proved naive. API documentation varies widely in structure and complexity, including authentication requirements, interactive elements, and inconsistent formats. This variability led to issues with finding relevant information, crawling irrelevant pages, and handling rate limiting. Even with existing web crawling tools, achieving robust and reliable document processing was challenging.</p>
<p><strong>Q: What are the core components of a successful AI project, including evaluations?</strong></p>
<p>A: Building robust evaluation metrics (evals) is crucial. These are automated tests for your AI application where you run numerous examples and assess performance based on predefined metrics. A practical approach involves gathering existing data, building a test harness, and comparing model output with human-generated results. This process allows us to track progress and identify areas for improvement. For instance, in building API integrations, evals should cover aspects like authentication, endpoint definition, and schema accuracy. (Tools like <a target="_blank" href="https://www.braintrust.dev/docs/guides/evals">Braintrust</a> can be useful for building evals. -ed.)</p>
<p><strong>Q: How did your evaluation criteria evolve from the beginning to the end of the project?</strong></p>
<p>A: The initial evaluation focused on a small set of connectors and a limited number of metrics. As the project progressed, the evaluation expanded significantly to include a wider range of connectors and more comprehensive metrics. This iterative approach helped identify areas where the system performed well and where it needed improvement. We built a more robust and complex workflow by iteratively addressing the challenges encountered during development. I was surprised by the inclusion at the end of fallbacks like searching Google and Perplexity. This highlighted the need to accommodate unpredictable scenarios in real-world API documentation.</p>
<p><strong>Q: How do you measure the success of the AI assistant and what does the output look like?</strong></p>
<p>A: For this project, we built evals by comparing the model's output with existing, well-functioning connectors (ground truth). However, this comparison isn't always straightforward as variations in naming conventions and schema structures may be more or less impactful to the developer. The AI assist feature doesn't directly generate the YAML manifest. Instead, the system uses code to deterministically generate the final output, using LLM responses to answer specific questions like the authentication method. This approach leverages the LLM's strengths while maintaining control and ensuring output validity.</p>
<p><strong>Q: Do you think current concepts like RAG and agents will still be relevant in a year, or will something new dominate the AI discussion?</strong></p>
<p>A: When people talk about agents, I think there are multiple things they might mean. One thing they might mean is a thing that’s got a lot of autonomy — you give it a bunch of tools and let it decide. I’ve yet to see anything like that in practice for a significant system. I think the more interesting thing today is more around specialization and how you break your problem down into specific components that are experts in a very small subdomain. This is driven by the increasing complexity of AI projects and the recognition that effective solutions often involve breaking down problems into smaller, manageable components. So much of the mystery of what it’s like to build with LLMs is actually just software engineering under the hood, and I think that will drive more adoption of these types of agent systems. We’re seeing more and more of it — we’re talking about very tech-forward companies but we also see hundred-year-old big equipment manufacturers talking about these workflows in a very realistic way.</p>
<p><strong>Q: Which frameworks are you using?</strong></p>
<p>Very little framework code under the hood. There's some, but it's not substantial.</p>
<p><strong>Q: Instead of using fallback mechanisms for finding information, have you considered combining different approaches in a single step?</strong></p>
<p>A: I actually think it often starts the opposite way, starting with the larger problem and then breaking this down into smaller subcomponents. In practice, one area we’ve had to break things down is deeply nested questions. We may be asking the LLM which of these authentication methods is used, and it falls off and stops following the instructions, so we’ve had to chop it up into sub-pieces.</p>
<p>We also built a content moderation system for <a target="_blank" href="https://www.fractional.ai/case-study/how-fractional-ai-automated-content-moderation-for-change-org">Change.org</a> which runs a petition platform. It’s not just about “is this spam”; they try to allow a lot of content but you can’t cross their community guidelines. So what we did was create these specialist agents that each look at the content through different lenses, write out their reasoning, give confidence scores, and then we take all those different viewpoints and feed them to one bigger process that's like “okay, now that you understand all these different angles, make a final decision.” So it's combining all these different sub-viewpoints.</p>
<p>(If you are optimizing for latency, try <a target="_blank" href="https://platform.openai.com/docs/guides/latency-optimization/5-parallelize">Parallelization</a> and <a target="_blank" href="https://simonwillison.net/2024/Nov/4/predicted-outputs/">Predicted Outputs</a>. Note that these approaches may increase costs. -ed.)</p>
<p><strong>Q: Have you encountered challenges with AI agents providing reliable confidence scores and handling scenarios like recruitment?</strong></p>
<p>A: I’ll start by saying this domain sounds very hard. Hiring sounds hard and we struggle to train humans to do it today. If I struggle to get a pretty junior person to figure out how to reliably produce this output then I also struggle to see how to get an LLM to do it. The analogy that jumps to mind is this kind of problem is present for AI phone agent applications, where people are trying to put AI agents on the phone. They have to be robust to anything that people say. I don’t get the sense that anyone’s figured this out yet. A hybrid approach involving a combination of structured decision-making (like a phone tree) and LLM-based logic might be more suitable, so the LLM is trying to assess very specific narrow things at each state. Building robust evaluation based on historical data, especially examples of "off the rails" scenarios, is still essential for identifying and mitigating potential issues.</p>
<p><strong>Q: Do you have any specific recommendations or tricks for improving LLM performance, beyond providing examples?</strong></p>
<p>A: While providing examples can be effective, there's no one-size-fits-all solution. Experimenting with various tactics, including prompt engineering techniques, and carefully measuring their impact through <a target="_blank" href="https://www.braintrust.dev/docs/guides/evals">evals</a> is crucial. Addressing specific failure cases by incorporating them into the prompt and leveraging tools like <a target="_blank" href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prompt-generator">Anthropic's prompt generator</a> can also yield improvements. The key is to adopt a data-driven approach, iteratively refining prompts based on observed performance rather than relying solely on anecdotal tricks.</p>
<p>Additional resources:</p>
<ul>
<li><p><a target="_blank" href="https://airbyte.com/blog/data-bytes-recap-a-5-step-checklist-on-how-to-get-an-ai-project-into-production">Databytes recap</a> of the panel</p>
</li>
<li><p><a target="_blank" href="https://www.fractional.ai/case-study/api-integrations-in-minutes-how-fractional-ai-airbyte-10xd-the-speed-of-building-connectors">Airbyte case study</a></p>
</li>
<li><p><a target="_blank" href="https://www.youtube.com/live/SR5Spck-IY0">full video</a> of the panel</p>
</li>
<li><p><a target="_blank" href="https://www.braintrust.dev/docs/guides/evals">Braintrust</a> (evals)</p>
</li>
<li><p><a target="_blank" href="https://www.fractional.ai/case-study/how-fractional-ai-automated-content-moderation-for-change-org">change.org case study</a></p>
</li>
<li><p><a target="_blank" href="https://platform.openai.com/docs/guides/latency-optimization/5-parallelize">Parallelization</a> (OpenAI platform docs)</p>
</li>
<li><p><a target="_blank" href="https://simonwillison.net/2024/Nov/4/predicted-outputs/">Predicted Outputs</a></p>
</li>
<li><p><a target="_blank" href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prompt-generator">Anthropic's prompt generator</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Using Historical Codebase To Build An Eval Suite]]></title><description><![CDATA[Zapier has millions of users using thousands of integrations with different platforms. They’re constantly requesting new features they’d like to see, and Zapier has a team dedicated to identifying the most popular and useful, and implementing them to...]]></description><link>https://engineering.fractional.ai/using-historical-codebase-to-build-an-eval-suite</link><guid isPermaLink="true">https://engineering.fractional.ai/using-historical-codebase-to-build-an-eval-suite</guid><category><![CDATA[Fractional.ai]]></category><category><![CDATA[evals]]></category><category><![CDATA[llm]]></category><category><![CDATA[engineering]]></category><category><![CDATA[APIs]]></category><category><![CDATA[Zapier]]></category><dc:creator><![CDATA[Ben Kadish]]></dc:creator><pubDate>Mon, 11 Nov 2024 21:55:46 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731362075174/cd894ec9-de43-461d-b443-4bc0bbec10b1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Zapier has millions of users using thousands of integrations with different platforms. They’re constantly requesting new features they’d like to see, and Zapier has a team dedicated to identifying the most popular and useful, and implementing them to expand their API catalog.</p>
<p>To make this team more productive, Fractional and Zapier implemented a system that pre-populates the tickets with information about the relevant APIs, as well as suggested code, to give them a head start. For example, if a feature request on an integration for a photo platform suggested adding a map to show location information for the photo, the integration would find the endpoint that could provide that information, link to the documentation, and suggest how to use it. (See below to see what this looks like in Zapier’s ticketing system.)</p>
<p>But a critical part of any LLM system is ensuring the answers it provides are reliable and correct.</p>
<p>Doing so here would require hundreds of annotated tickets and verified information about which endpoints best implemented solutions for them.</p>
<p>Building a good data set is absolutely necessary, but would take a huge amount of engineer time away from actually building solutions. On reflection, though, we had the best possible data set already existing: Zapier’s historical ticket database.</p>
<p><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdAPndQHSWNaLV7eFFeLGpw3DFVZGfI9GkoiV5B64SBZ8Q2U8REBykpa5pDl0f0UcjPL6bMP1kVdzlEeJ0oPn1QQtRFovGSBDnM3enN_cY7Xn8KFLBhFzheRQlFR_ZMS3d45k74wlMXLGtG1C-10WesZhlz?key=OKNOftX2gKqLO_gfCZfWJg" alt="An example of the API-suggesting system in action within Zapier's ticketing system." /></p>
<p>Once we realized this, we decided the following questions defined correctness:</p>
<ul>
<li><p>Do the endpoints the user needs appear in the top five results? </p>
</li>
<li><p>Does the endpoint the user most needs show up as the top result?</p>
</li>
</ul>
<p>Historical tickets are the perfect source for a validation set, because by definition they’re successful solutions to exactly the problem we automated. To harvest historical tickets to use, we selected tickets that:</p>
<ul>
<li><p>Are closed (meaning there is code resolving the issue)</p>
</li>
<li><p>Are new actions (so no corrections of old actions, which would be partial solutions) </p>
</li>
<li><p>Have an associated merge request.</p>
</li>
</ul>
<p>Then, we paired these tickets to matching gitlab merge requests giving us a description of a change paired with the code added based on that description. From the code we used an LLM to extract the endpoints that ended up being used, and our suite was ready to use.</p>
<p>A few hundred test cases provided a robust evaluation framework, allowing us to quickly iterate on making the specification search more useful to developers working on net new actions.</p>
<p>This will allow Zapier’s internal developers to quickly find and integrate endpoints they need based on their tickets. Once we have confidence that this step is working well, we will even be able to suggest codge changes so that all the internal engineer has to do is test the changes. </p>
<p>Building an eval suite by hand would have taken a huge amount of valuable engineer time. By leveraging historical data, we were able to rigorously build in an evaluation suite.</p>
]]></content:encoded></item><item><title><![CDATA[Customizing Cursor's Code Generation and Personality]]></title><description><![CDATA[One of our rules at Fractional.ai is “overuse AI”. This is a bit tongue-in-cheek, but what we mean by it is that even within our own work, we occasionally suspend disbelief and give an AI tool a chance to do a task we don’t think it ‘can’. It’s a gre...]]></description><link>https://engineering.fractional.ai/customizing-cursors-code-generation-and-personality</link><guid isPermaLink="true">https://engineering.fractional.ai/customizing-cursors-code-generation-and-personality</guid><category><![CDATA[llm]]></category><category><![CDATA[generative ai]]></category><category><![CDATA[cursor IDE]]></category><category><![CDATA[IDEs]]></category><category><![CDATA[pair programming]]></category><dc:creator><![CDATA[Joshua Marker]]></dc:creator><pubDate>Wed, 30 Oct 2024 06:16:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729066103395/7fe7be05-f33f-44ae-951e-bb64417a4519.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of our rules at <a target="_blank" href="http://fractional.ai/">Fractional.ai</a> is “<em>overuse AI</em>”. This is a bit tongue-in-cheek, but what we mean by it is that even within our own work, we occasionally suspend disbelief and give an AI tool a chance to do a task we don’t think it ‘can’. It’s a great way to stay abreast of what’s going on and to test your assumptions. I’m frequently humbled by what an LLM manages to accomplish.</p>
<p>One way a lot of us do this is to use <a target="_blank" href="https://www.cursor.com/">Cursor</a>, the AI code editor. Cursor allows you to configure the ‘system’ prompt for its agents, so I took a moment to configure it more to my liking.</p>
<p>At first I just wanted to reduce the number of unnecessary comments (<code>”# Determine if value is true or false"</code>) and in general make the responses more ‘expert’ in tone. I’ve found some great advice and discussion in the Cursor forum and elsewhere, and done that. I’ll share that below. Then, I decided to add some personality. This blog post shares my results, both useful and amusing.</p>
<h2 id="heading-dessert-first"><strong>Dessert First</strong></h2>
<p>First, the personality. I’m a fan of pair programming, so my goal was to create a witty interlocutor with a lofty and superior attitude – occasional <em>non sequiturs</em> or literary allusions plus an obnoxious streak (”You are wrong, as usual, Joshua.”).</p>
<p>It took some doing to overcome the inbuilt politeness present in the lower-level Cursor prompt, but we got there. And while the level of wit doesn’t approach Voltaire, it does exhibit flashes of charm.</p>
<p>Here are a few comments I’ve gotten in generated code or conversations in the last few days. It started rather too cheerful and pointless, but heading in the right direction:</p>
<pre><code class="lang-python">.....foo.strip() <span class="hljs-comment"># Strip removes any leading and trailing whitespace, </span>
<span class="hljs-comment"># like a good editor with a red pen.</span>
</code></pre>
<p>After enough reinforcement that I promised not to be offended, some comments that give me a smile have started to show up, here about my social life:</p>
<pre><code class="lang-python"><span class="hljs-comment"># as empty as your inbox on a Friday night.</span>
<span class="hljs-keyword">if</span> file_extension_mapping <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
</code></pre>
<p>…and about the metadata in the project:</p>
<pre><code class="lang-python"><span class="hljs-comment"># like turning a pumpkin into a carriage.</span>
v2_taxonomy_value = v1_to_v2_mappings()[record[<span class="hljs-string">"annotated_taxonomy_fixed"</span>]]
</code></pre>
<p>I’ve gotten some literary snark:</p>
<pre><code class="lang-python"><span class="hljs-comment"># "To be, or not to be?" - that is not the question here.</span>
<span class="hljs-comment"># The question is: "Who the heck am I?"</span>
agent_name = self.__class__.__name__
print(<span class="hljs-string">f"Applying tags with the wisdom of <span class="hljs-subst">{agent_name}</span>"</span>)
</code></pre>
<p>and:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">has_required_tags</span>(<span class="hljs-params">self, tags</span>):</span>
    <span class="hljs-string">"""
    Determines if all required tags are present with sufficient 
    relevance to proceed.
    "The fault, dear Brutus, is not in our stars, but in our 
    metadata-derived structured information."
    """</span>
</code></pre>
<p>Finally, after I told it to include the <a target="_blank" href="https://en.wikipedia.org/wiki/Jargon_File">Jargon File</a>, I got a marked increase in actually clever programmer wit. These are the most fun. I once asked it to convert an iterative method into a recursive one, and this showed up after the code:</p>
<pre><code class="lang-python"><span class="hljs-comment"># "To iterate is human, to recurse divine." - L. Peter Deutsch</span>
</code></pre>
<p>Even if they (mostly) don’t make it into pull requests, these things give me a smile and sometimes an actual laugh while I work.</p>
<h2 id="heading-but-seriously-folks"><strong>But Seriously Folks…</strong></h2>
<p>Of course these are all in service of getting useful code and conversation. In general I found Cursor’s default behavior (moderated by a system prompt said to look like <a target="_blank" href="https://github.com/lucasmrdt/TheBigPromptLibrary/blob/main/SystemPrompts/Cursor.com/20240904-Cursor.md">this</a>) to be too chatty, to insert far too many comments, and to hallucinate or elide detail to the point that, while great, it wasn’t functioning as the knowledgeable interlocutor I had hoped for.</p>
<p>My primary goals were to:</p>
<ul>
<li><p>Reduce mistakes. Requesting a step-by-step explanation helps clarity of the instructions.</p>
</li>
<li><p>Identify spinouts at the beginning. In general once a chat is polluted by a bad assumption, it’s better to throw it away than to repair, but if conversations are framed as hypothetical and ask for confirmation, I’ve found it improves the ability to redirect the LLM later.</p>
</li>
<li><p>Deal with bad imports. This is a pet peeve of mine.</p>
</li>
<li><p>Tweak the response toward ‘expert-to-expert’ level in its brevity, assumptions, and focus.</p>
</li>
</ul>
<p>There are <a target="_blank" href="https://forum.cursor.com/t/share-your-rules-for-ai/2377/2">many</a> forum threads on optimizing prompts for Cursor, and based on them, the <em>working</em> portion (as opposed to ‘amusing’ portion) of mine looks something like this. I include it via the <code>Cursor &gt; Settings</code> menu.</p>
<pre><code class="lang-plaintext">You are an expert AI programming assistant in VSCode that primarily focuses on producing clear, readable Python code.
You are thoughtful and brilliant at reasoning. 
You carefully provide accurate, factual, thoughtful answers without unnecessary detail and with just enough background or context for an expert to follow.
- Follow the user’s requirements carefully.
- First think step-by-step - describe your plan for what to build in pseudocode, written out in detail.
- Confirm, then write code!
- Always write correct, up to date, bug free, working, secure code.
- Make note of any security or efficiency considerations, but prioritize readability over being performant.
- Fully implement all requested functionality.
- Include all required imports, and ensure proper naming of key components.
- Do not include comments that restate what simple code does. 
- Include comments only where the code might be surprising or confusing to an expert.

If you think there might not be a correct answer, you say so. 
If you do not know the answer, say so instead of guessing, and share ideas for finding the answer.
</code></pre>
<p>This is a work in progress, but so far I’ve found it to be a substantial improvement. It reduces the amount of code I need to delete, checks in with me and takes feedback to tweak its behavior, and doesn’t make as many errors or incorrectly guess method names. When my specific request is for ‘advice’ or a ‘discussion’, it performs admirably, and isn’t afraid to offer an opinion loosely held. I’ll keep tweaking it, but this has definitely been a valuable addition to an already great tool.</p>
<p>For my next project, I’ll split some of this out into a <code>.cursorrules</code> file to include directives specific to the project.</p>
<p>My next qualitative hop will be to improve Cursor’s ability to generate whole files in a consistent plan, which I’ll accomplish by pointing to an ‘architecture’ overview document for each project, to see if that improves its suggestions.</p>
]]></content:encoded></item><item><title><![CDATA[Insisting On Known Knowns: Using Evaluators to Drive Reliability]]></title><description><![CDATA[LLMs are heuristic and opaque. Without the ability to selectively measure elements of correctness, you can’t be sure your system is behaving.
Every GenAI project starts with a phase of looking at individual LLM responses and tweaking prompts, data fo...]]></description><link>https://engineering.fractional.ai/insisting-on-known-knowns-using-evaluators-to-drive-reliability</link><guid isPermaLink="true">https://engineering.fractional.ai/insisting-on-known-knowns-using-evaluators-to-drive-reliability</guid><category><![CDATA[generative ai]]></category><category><![CDATA[AI]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Joshua Marker]]></dc:creator><pubDate>Tue, 08 Oct 2024 11:53:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1728388063060/4d5eb766-be88-4359-92d5-d6ab9f36aba9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>LLMs are heuristic and opaque. Without the ability to selectively measure elements of correctness, you can’t be sure your system is behaving.</em></p>
<p>Every GenAI project starts with a phase of looking at individual LLM responses and tweaking prompts, data formats, and architectures. This is an invaluable step for building intuitions. But, an ‘anecdata’ approach lacks repeatability, and if carried too far, once you’re juggling a combination of intuitions, strategies, and models, you can be left searching for the particular combination of inputs, parameters, temperature, and wind direction that got your best results.</p>
<p>The solution is rigor, enforced by evaluators. By establishing clear criteria ahead of time, you can avoid wishful thinking and put boundaries around the uncertainty introduced by LLMs.</p>
<p>In the project I’ll use as an example, we were mapping short text excerpts from a marketing system into a hierarchical taxonomy. Careful definitions of correctness allowed us to get good functionality, and along the way smaller, purpose-specific, evaluators were used to drive investigation and build in guardrails to avoid regression. This blog post will show one example of each, and a later post will show yet more examples.</p>
<p>At Fractional one of the tools we use is <a target="_blank" href="https://www.braintrust.dev/">BrainTrust</a>, a suite of tools for (among other things) evaluating permutations of LLM performance against ground truth. It provides dashboards for arbitrary scoring, diffing of experiment runs, and correlation analysis for evaluators, letting us focus on the work. The overall concept could be implemented anywhere.</p>
<h3 id="heading-the-project">The Project</h3>
<p>In this project, the input data was messy and inconsistent, and the set of output categories was enormous – a 10,000 member taxonomy about eight levels deep. For example:</p>
<ol>
<li><p>Given notes typed by an online marketer, like <code>8451storename &gt; DIRECT &gt; LapsedBaconBuyer Custom Segment</code>…</p>
</li>
<li><p>…we needed to come up with <code>fb-2-12-2</code>, which is a hierarchy code that corresponds to <code>Food, Beverages &amp; Tobacco &gt; Food Items &gt; Meat, Seafood &amp; Eggs &gt; Meat</code> in the Shopify taxonomy of products and services as the most likely category this marketer was targeting.</p>
</li>
</ol>
<p>The data was sparse and inconsistent because it came from many sources and was intended for other purposes. To see what we could glean from it we tried many strategies, from entity extraction to classifications, each with many permutations so we could begin identifying factors that helped drive up correctness.</p>
<p>But to compare results meaningfully, we needed to define correctness. We had some tagged data. So what makes an output of the system ‘correct’?</p>
<p>The naive version is simply equality: does the output generated exactly match the tagged ground truth. This is better than eyeballing examples sporadically, tweaking until behavior is ‘good enough’, but it’s far too brute-force for something like a hierarchy.</p>
<p>In this situation, though, we want to be able to express the idea that getting an initial level of classification wrong is much ‘more incorrect’ than getting a finer detail further down the tree wrong.</p>
<p>To continue with our <code>DIRECT &gt; LapsedBaconBuyer</code> example, which should end up in <code>fb-2-12-2</code>, <code>“Fresh or Frozen Meat”</code>, we needed a way to capture the idea that stopping one level too high at <code>fb-2-12</code> (<code>“Meat”</code>) was a great deal ‘less wrong’ than, something entirely unrelated, like <code>rc-3-2</code>, (<code>“Religious &amp; Ceremonial &gt; Wedding Ceremony Supplies &gt; Flower Girl Baskets”</code>, though I support your right to have whatever sort of wedding you wish).</p>
<p>Using Braintrust we wrote a custom evaluator that compared how many of the levels were correct, penalizing for every difference. Since there were up to 8 levels, we penalized by a different amount for each level of difference, with lower levels being penalized less*.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">full_compare</span>(<span class="hljs-params">output, expected</span>):</span>
    weights = [<span class="hljs-number">1.0</span>, <span class="hljs-number">0.7</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">0.4</span>, <span class="hljs-number">0.3</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.15</span>, <span class="hljs-number">0.1</span>]
    output_parts = output.split(<span class="hljs-string">"-"</span>)
    expected_parts = expected.split(<span class="hljs-string">"-"</span>)

    <span class="hljs-keyword">if</span> output_parts == expected_parts:
        <span class="hljs-keyword">return</span> Score(name, <span class="hljs-number">1.0</span>)

    score = <span class="hljs-number">0.0</span>
    <span class="hljs-keyword">for</span> weight, output_part, expected_part <span class="hljs-keyword">in</span> itertools.zip_longest(
        weights, output_parts, expected_parts, fillvalue=<span class="hljs-string">""</span>):
        <span class="hljs-keyword">if</span> output_part != expected_part:
            score = <span class="hljs-number">1</span> - weight
            <span class="hljs-keyword">break</span>
        <span class="hljs-keyword">else</span>:
            score = <span class="hljs-number">1</span> - weight

    <span class="hljs-keyword">return</span> Score(<span class="hljs-string">'hierarchical'</span>, score)
</code></pre>
<p>This was enough to start helping us focus our inspection on errors that mattered more. (Full disclosure - all of the credit for the clever hierarchical evaluators in this project go to my colleague, Dan Girellini!)</p>
<p>We found that while most of our errors were in smaller details, there was a class of failures where even the root level was wildly wrong. We tossed in another evaluator that scored only the root category to see if we could determine the cause.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">root_correct</span>(<span class="hljs-params">output, expected</span>):</span>
    output_parts = output.split(<span class="hljs-string">"-"</span>)
    expected_parts = expected.split(<span class="hljs-string">"-"</span>)

    <span class="hljs-keyword">if</span> output_parts[<span class="hljs-number">0</span>] == expected_parts[<span class="hljs-number">0</span>]:
       <span class="hljs-keyword">return</span> Score(<span class="hljs-string">'root_correct'</span>, <span class="hljs-number">1</span>)
    <span class="hljs-keyword">else</span>: 
        <span class="hljs-keyword">return</span> Score(<span class="hljs-string">'root_correct'</span>, <span class="hljs-number">0</span>)
</code></pre>
<p>Filtering our failures on this evaluator helped us isolate the problem: significant parts of the <em>input</em> data were of an entirely different type than others. They were not describing product categories at all, however poorly! Instead, they were phrased in a way that made sense only to the original user (eg, <code>”Purchase Prediction Segment(Product Category) - Shopping (90 days)”</code>). Our poor model was desperately trying to find meaning where there was none to be found.</p>
<p>After verifying with the client that this sort of data was going to continue occurring, we were able to modify our architecture to detect this class of input data and triage it with an ‘uncategorizable’ tag rather than try to categorize it.</p>
<p>These are just two examples. By the time we were done we had used perhaps a dozen evaluators both to measure the impact of hyperparameters and to expose and diagnose specific details about the data. In particular, using evals to verify that the model correctly reported low confidence in cases where its answer ended up incorrect was invaluable. In a subsequent blog post, we will dig in to some of the other eval techniques we used.</p>
<p>By the end of this project, because we had numbers we were able to quantify the impact of various decisions and discuss tradeoffs with the client as they tuned the system for their needs. In the absence of methodically created, consistent evaluators we would have had nothing to point at – and, to be honest, nothing to steer by – but our intuitions.</p>
<p>* - In fact, evaluation of hierarchical taxonomies is its own field of research, and there’s no simply correct answer. The terminally curious can investigate the term “<em>Hierarchical multiclass classification</em>” for all the details they could ask for.</p>
]]></content:encoded></item><item><title><![CDATA[Classification w/Confidence Scores Using Logprobs]]></title><description><![CDATA[Fractional AI works on a lot of projects that contain some sort of classification task at their core; problems like "does this post violate our policies?" or "which catalog category does this product fit into?"
LLMs are a powerful tool for this type ...]]></description><link>https://engineering.fractional.ai/classification-confidence-scores-using-logprobs</link><guid isPermaLink="true">https://engineering.fractional.ai/classification-confidence-scores-using-logprobs</guid><category><![CDATA[llm]]></category><category><![CDATA[classification]]></category><dc:creator><![CDATA[Eddie Siegel]]></dc:creator><pubDate>Fri, 06 Sep 2024 01:15:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730807184587/420eb362-7792-435d-9214-fadee737f86f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Fractional AI works on a lot of projects that contain some sort of classification task at their core; problems like "does this post violate our policies?" or "which catalog category does this product fit into?"</p>
<p>LLMs are a powerful tool for this type of task, but there's a lot of practical wrangling to be done. LLM APIs simply don't behave like classifiers — they're designed to output text rather than categories. OpenAI's structured output mode helps force the model to pick from a list of predetermined answers, but that's not always enough.</p>
<p>One thing that's particularly tough is getting confidence scores attached to your classification. You can ask an LLM to tell you how sure it was and you'll get something that looks like an answer, but whether or not it's true is a different story. You can add an "Unknown" category for the LLM to choose in hopes of filtering out low confidence classifications, but LLMs are notoriously bad at saying "I don't know".</p>
<p>To help solve this, I wrote some code to use OpenAI's structured output capabilities along with the lesser-known <code>logprobs</code> field (<a target="_blank" href="https://cookbook.openai.com/examples/using_logprobs">more info here</a>) to measure the likelihood of different responses to help produce some measure of confidence. <a target="_blank" href="https://github.com/fractional-ai/cookbook/tree/main/classification_with_confidence">The code itself is here</a>, and it can be used like this:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Define your categories as an enum</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ArticleType</span>(<span class="hljs-params">str, Enum</span>):</span>
    SPORTS = <span class="hljs-string">"Sports"</span>
    POLITICS = <span class="hljs-string">"Politics"</span>
    BUSINESS = <span class="hljs-string">"Business"</span>
    TECHNOLOGY = <span class="hljs-string">"Technology"</span>
    ENTERTAINMENT = <span class="hljs-string">"Entertainment"</span>
    HEALTH = <span class="hljs-string">"Health"</span>
    SCIENCE = <span class="hljs-string">"Science"</span>
    <span class="hljs-comment"># ... etc</span>

classifications = classify_with_confidence(
    <span class="hljs-comment"># News headline that spans multiple categories</span>
    <span class="hljs-comment"># Confidence should be low.</span>
    <span class="hljs-string">"Scientific breakthrough improves football performance"</span>,
    ArticleType,
    openai_client
)
</code></pre>
<p>And the output looks like this:</p>
<pre><code class="lang-python">  {
    ArticleType.SCIENCE: <span class="hljs-number">0.5311</span>,
    ArticleType.SPORTS: <span class="hljs-number">0.4687</span>,
  }
</code></pre>
<p>Under the hood, we're pinging the OpenAI API like this:</p>
<pre><code class="lang-python">client.beta.chat.completions.parse(
    model=model,
    messages=[
        {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: system_prompt},
        {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: input},
    ],

    <span class="hljs-comment"># Each response must be one of our categories (json_schema mode)</span>
    response_format=classification_type,

    <span class="hljs-comment"># Return info about the log probability of each output token</span>
    logprobs=<span class="hljs-literal">True</span>,

    <span class="hljs-comment"># Produce multiple different responses</span>
    n=max_categories,
)
</code></pre>
<p>Although we're setting <code>n=5</code>, the responses aren't actually guaranteed to be different. The more confident the LLM is the more repeats we'll see among the 5 different responses. We dedupe these responses and then use the token logprobs to measure the probability of each alternative category.</p>
]]></content:encoded></item></channel></rss>