Build Better AI Agent Tools Using Evaluations

PLUS - SemTools Supercharges CLI Agents for Document Search

Sep 15, 2025

Welcome back to AlteredCraft's Delta Notes! Thank you for joining us as we explore the latest developments in AI engineering. This edition calls out practical techniques for building better AI agents, from Anthropic's evaluation methods to new MCP tools that enhance coding accuracy. We're also examining why startups might want to train their own models and how semantic search is transforming CLI-based document analysis. Let's explore what's new in the world of AI development.

TUTORIALS & CASE STUDIES

Build Better AI Agent Tools Using Evaluations

This is an image depicting how an engineer might use Claude Code to evaluate the efficacy of agentic tools.

Estimated read time: 15 min

Anthropic shares proven techniques for creating effective tools that AI agents can use reliably. Learn how to prototype, evaluate, and optimize your Model Context Protocol (MCP) tools by collaborating with Claude Code. Key principles include choosing focused tools over many generic ones, returning meaningful context, and prompt-engineering tool descriptions for maximum agent performance.

Build Custom AI Research Agents for Tech Intelligence

Estimated read time: 10 min

Learn how to create specialized AI agents that scout tech forums and aggregate millions of texts to deliver personalized insights. This tutorial demonstrates building a research agent using prompt chaining with small and large language models, data caching strategies, and structured workflows that outperform generic ChatGPT searches for developer-specific tech trend analysis.

Practical Guide to Choosing Open Source LLMs

Estimated read time: 10 min

This guide explains how to select the right open source LLM for your specific project needs, moving beyond benchmark scores to practical considerations like hardware constraints, deployment complexity, and real-world performance. It introduces AI Sheets for side-by-side model testing with actual data, covers VRAM requirements for different model sizes, and compares inference providers including Groq and Cerebras for production deployments.

OpenAI Studies Why Language Models Hallucinate

Estimated read time: 8 min

OpenAI's new research explains that language models hallucinate because evaluation methods reward guessing over admitting uncertainty. For developers building RAG systems and AI agents, this insight suggests prioritizing confidence scores and uncertainty handling in production applications. The paper challenges common misconceptions, showing that smaller models can actually be better calibrated than larger ones.

Disciplined AI Software Development Methodology

Estimated read time: 15 min

A structured methodology for collaborating with AI on software projects addresses common issues like code bloat and architectural drift. The four-stage approach uses systematic constraints including 150-line file limits, mandatory benchmarking infrastructure, and focused component implementation. Includes practical tools, example projects, and model-specific guidance for developers using AI assistants.

TOOLS

SemTools Supercharges CLI Agents for Document Search

Estimated read time: 8 min

LlamaIndex introduces SemTools, a CLI toolkit that enhances coding agents like Claude Code with semantic search capabilities. Testing on 1000 ArXiv papers showed that combining Unix tools with semantic search significantly improves document analysis accuracy and detail compared to traditional grep-based approaches, making CLI-based agents a powerful alternative to custom RAG implementations.

Chroma's MCP Server Enhances AI Coding Accuracy

Estimated read time: 3 min

Chroma introduces a Model Context Protocol (MCP) server that improves AI coding performance by exposing source code dependencies. The Package Search MCP Server provides three tools for semantic and regex-based code exploration, reducing hallucinations in AI-generated code.

Google Launches MCP Servers for Cloud AI Integration

Estimated read time: 8 min

Google has released open-source Model Context Protocol (MCP) servers that enable AI assistants to interact with Google Cloud services using natural language. The gcloud-mcp server allows developers to automate complex cloud workflows, execute gcloud commands through AI agents, and integrate with popular tools like Claude Desktop, Cursor, and Gemini CLI. Additional servers include observability-mcp for accessing logs, metrics, and traces.

Agent OS Transforms AI Coding Agents Into Trusted Developers

Estimated read time: 15 min

Agent OS is a free open-source system that revolutionizes how AI coding agents understand and work with your codebase. By providing three layers of context—standards, product details, and specs—it ensures AI agents write code that matches your style consistently. The framework replaces chaotic prompting with structured workflows, supporting both Claude Code and Cursor while remaining completely customizable to your team's development practices.

Claude API Adds Web Fetch Tool with Security Safeguards

Estimated read time: 5 min

Anthropic introduces a new web fetch tool for Claude API that allows LLMs to retrieve content from URLs while implementing security measures against prompt injection attacks. The tool restricts URL access to prevent data exfiltration, only allowing fetches from user-provided URLs or previous search results, not arbitrary Claude-generated URLs. Developers can further enhance security using domain allow-lists.

NEWS & EDITORIALS

AGI Evangelism, Scaling Costs, and Developer Tradeoffs

Estimated read time: 6 min

TechCrunch’s interview with Karen Hao dissects how OpenAI’s AGI-first race reshaped the field: prioritizing speed and scale over algorithmic advances, safety, and efficiency. For developers, it spotlights rising infra spend, data/energy strain, and externalized labor harms—while arguing for domain-specific systems (e.g., AlphaFold) that deliver measurable value with less compute. Practical takeaway: optimize for targeted capability, not maximal model size.

Why AI Startups Must Train Their Own Models

Estimated read time: 8 min

AI startups currently rely on lab APIs, but the barrier to training custom models is rapidly collapsing. With DeepSeek achieving o1-level reasoning for just $6M through distillation, companies like Cursor are already transitioning from API wrappers to proprietary models. The key insight: whoever controls the Token Factor Productivity and gathers user interaction data will dominate the AI application landscape.

Mistral AI Secures €1.7B for Enterprise AI Solutions

Estimated read time: 3 min

Mistral AI announced a €1.7B Series C funding round led by ASML, valuing the company at €11.7B. The investment will accelerate development of custom decentralized frontier AI solutions for complex engineering challenges. This partnership signals growing enterprise demand for tailored AI models beyond general-purpose LLMs, offering developers opportunities in specialized industrial AI applications.

Altered Craft

Discussion about this post

Ready for more?