Designing a 70-tool Claude tool-use catalog
Forward's agent loop calls Claude Sonnet 4.6 with a catalog of 70 tools spanning Procore reads, Autodesk drawing search, OneDrive file retrieval, and field-calculation primitives. Here's what we learned organizing it for production.
The shape of the problem
Forward is an SMS / iMessage bot for commercial construction project managers. A foreman texts “status on RFI 142” or “latest revision of sheet A-401” and gets back the answer in 8–15 seconds, sourced from the connected systems (Procore, Autodesk Construction Cloud, Microsoft OneDrive). Every response cites the data it came from inline.
To do this, our agent loop calls Claude Sonnet 4.6 with a catalog of 70 tools. They span:
- Procore reads: RFIs, submittals, drawings, daily logs, cost codes, change orders, schedule activities, sub directory, project list, photos.
- Autodesk Construction Cloud reads: drawing search, sheet metadata, revision history.
- OneDrive reads: document retrieval by name + by content search.
- Field calculations: NEC, IBC, IMC, ASHRAE, ASPE primitives (wire size, voltage drop, HVAC load, ramp slope, etc.).
- Code lookups: spec section text, OSHA citations, IECC climate zones.
- Mutations (gated by a PM approval queue): create RFI, post daily log, upload photo, draft change order.
Why 70 tools, not 1
We could have shipped a single “query Procore” tool that takes a natural-language query and returns whatever Procore data is relevant. That’s the retrieval-augmented-generation (RAG) shape. We tried it first. It failed in three places:
- Latency: vector-embedding queries against live Procore data added 2–4 seconds to every response. SMS users tolerate ~10 seconds; the RAG layer alone burned 30% of that budget.
- Freshness: a vector store goes stale. Construction data changes hourly — RFI status, daily logs, COR updates. A stale RAG answer is worse than a slow API call.
- Citation: the model could produce plausible-looking source citations to documents that didn’t support the answer. With direct tool calls, the citation is structurally the tool response payload, so the model can’t fabricate it.
The 70-tool catalog forces the model to make explicit reads against live Procore APIs. The trade-off: tool descriptions need to be airtight so the model knows when to use which.
Prompt caching is non-negotiable
With 70 tools, the system prompt + tool catalog is ~28K tokens. At Sonnet 4.6 input pricing, every message would cost ~$0.21 in input tokens alone. With Anthropic prompt caching, the system prompt + tool catalog is cached after the first call — subsequent calls in the 5-minute TTL pay the cached read rate (10% of the input rate). After warm-up:
First message (cache miss): $0.21 input + $0.018 output Subsequent (cache hit): $0.021 input + $0.018 output Cache hit rate in production: 93% Effective per-message cost: $0.034 At 100 messages/PM/month: $3.40 per PM/month
At $3.40/PM/month in inference, the unit economics work even on a $29/seat plan. Without caching, they don’t.
Tool grouping by access pattern
We group tools by which APIs they hit, not by which feature they implement. Within the Procore group:
procore_list_rfis— for “show me open RFIs”-shaped queries.procore_get_rfi— for “status on RFI 142”-shaped queries.procore_search_rfis— for “RFIs about cable tray”-shaped queries.
Three tools for one entity, because the user’s intent shapes the API call. A single tool with optional params would force the model to figure out which params to set; three tools with non-overlapping descriptions make the choice explicit.
Tool descriptions that actually work
Tool descriptions in our catalog follow a template:
<one-sentence purpose>
When to use this: <intent shape — phrased exactly how a user
might describe their need>
When NOT to use this: <the adjacent tool that's a better fit,
with a one-line distinguisher>
Returns: <field-by-field schema, no surprises>The “When NOT to use this” line is the most important. The model’s default behavior is to grab the first tool whose description plausibly matches. Telling it explicitly when to not use a tool cut our tool- selection error rate by ~40%.
The project-disambiguation problem
A texter on a multi-project GC says “RFI 142”. Which project’s RFI 142? We solve this before the tool call, in a project-resolver step:
- Each phone number belongs to a tenant + a project (a “dedicated project line”) or to a tenant & multi-project pool (a shared line).
- Dedicated-line message → project is implicit. Tool calls auto-scope.
- Shared-line message → resolver runs a quick classifier: recent project (last 5 messages from this phone number), explicit project tag in the message, or single-project tenant.
- If ambiguous, the bot replies asking which project before firing any tool call. Costs one round-trip; cheaper than a wrong answer.
Approval queue for mutations
Forward never writes to Procore from a Claude tool call directly. Mutations go through a draft state:
- Model issues e.g.
procore_draft_daily_logwith the proposed content. - Server creates a draft row, returns a draft ID to the model, and SMSes the PM a one-tap approval link.
- PM taps “Approve” on the dashboard or texts back “ok”.
- A separate worker process picks up the approved draft and applies it to Procore via the real write API.
This is non-negotiable for production. The cost of an accidental write to Procore (wrong daily log on the wrong project) is much higher than the friction of a one-tap approval.
Failure modes we hit
- Tool-selection drift after long conversations — mitigated by a conversation-window cap (the model sees only the last 8 turns).
- Over-reliance on cached project context — the bot would assume the current project when a user switched topics. Mitigated by explicit project-resolver step on every message, not every conversation.
- API rate limits — Procore caps at 3,600 req/hr per token. Mitigated by per-tool batched reads + 60s LRU cache on the most expensive reads (sheet metadata).
- Long tool responses overflowing context — tools that can return large lists (RFIs, submittals) accept a
limitparam and ship metadata + summary as default; the model requests detail on specific items if needed.
Practical takeaways
- Tool catalogs are a system-prompt-shaped problem. Treat tool descriptions like UX copy: tested, iterated, removed when they don’t work.
- Group by access pattern, not by feature. The model picks tools by intent shape, not by your product taxonomy.
- Prompt caching is the difference between economic and uneconomic on large catalogs. Architect for it from day 1.
- Mutations go through an approval queue. Always.
- Disambiguate before you call. A clarifying question is cheaper than a confidently-wrong answer.
Forward’s demo line is live and answers Procore questions in real time — text any of the queries above to +1 (682) 300-6750. No signup needed. If you’re building something similar over a different vertical API, happy to talk shop; founder email is josh@getforward.xyz.
Try Forward right now
Drop your email above for early access — or skip the form and text +1 (682) 300-6750 from your phone. The live demo answers anything you can ask a project manager in plain English — no signup needed.