Overview
- Role
- UX Lead
- Team
- 5, me as UX Lead + 4 full-stack engineers
- Organization
- Tecolote Research, defense contractor
- Timeline
- 2025, 5-day sprint
- Live app
- Internal tool (not publicly accessible)
- Tools
- Figma, JavaScript, Node.js, Chart.js, Perch (internal Claude Sonnet wrapper), Ollama fallback
The Problem
Defense acquisition programs produce Selected Acquisition Reports (SARs) packed with cost, schedule, and performance data. The data sits trapped in static PDF files.
Program managers and cost analysts need to identify trends and compare programs. To do this, they dig through hundreds of reports by hand. Each SAR holds rich data, but the data stays isolated in one document at a time. Seeing patterns across programs is close to impossible.
Analysis that should take hours stretches into weeks. Patterns stay buried. Decision makers work with fragmented visibility into acquisition trends.
"What should take hours of analysis stretches into weeks of manual work. Critical patterns remain invisible."
Discovery and Research
You learn how analysts work by watching them work. I ran semi-structured interviews with cost analysts across several defense programs.
Three findings shaped the build.
Finding 1: Search patterns are predictable.
Analysts look for the same things across reports: cost growth, schedule slippage, milestone comparisons. They need cross-program views, not document-by-document reading.
Finding 2: The bottleneck is access, not analysis skill.
These are expert users. The problem is the time they spend locating and extracting data before analysis starts.
Finding 3: Trust requires traceability.
Every data point needs a source. Analysts will not adopt a tool that gives answers without showing where the answers came from.
Constraints and Adaptation
We scoped the sprint around the F-35 program. The original flow was narrow and templated: upload the latest SAR, view changes in executive summary and total funding between 2017 and 2021. Initial stack was C#, Razor, and Perch.
Two days in, the plan broke. Working through F-35 data with our subject matter expert, we hit subprogram nesting and inconsistent SAR formatting across submission years. The AI failed to reliably locate comparable funding fields. Our SME called the pivot. He had the program knowledge to see we would spend the whole sprint fighting the data instead of proving the concept.
We switched to the DDG 1000 Zumwalt-class destroyer program. Cleaner data, no subprograms, consistent structure across submission years. At the same time, the team hit Razor limitations and moved the stack to JavaScript and Node.js. Two pivots in one day.
Design Approach
The core decision was a hybrid interface. The whiteboard version was pure template: upload, pick years from a dropdown, click Analyze. Clean, but the template forced every analyst through the same question shape. In testing I watched analysts try to ask things the template did not cover. I pushed for conversational entry with scaffolded guardrails:
- A free-text question field at the top
- Example chips showing the kinds of questions the tool handled well
- Optional refinement dropdowns for submission year and SAR section
Analysts asked freely. The interface taught them the tool's range without forcing them to read a manual.
Trust anchored before the model speaks
The team designed a Quick Glance panel that populates on upload: ship count, delivery percentage, budget expended, key milestones, program costs. I did not push for this one. In testing, it did critical work. Analysts had a grounded reference to sanity-check AI output against before they read a generated response. Trust started before the model spoke.
Document-grounded responses
Every AI response references the source PDF content. Analysts see the answer and verify it against the original report.
Designing around AI unreliability
Three failure modes showed up in testing.
- The model reversed correct answers when challenged. In one test it narrated "I second-guessed myself when you asked if I was sure," and then gave the wrong answer.
- The model produced different output formats for identical queries. Sometimes a table, sometimes an ASCII chart, sometimes a flat refusal.
- The model referenced previously uploaded files when answering questions about a new one.
We designed around each failure. Source references next to every response. Spec-style prompts to enforce consistent output. The active document always explicit in the UI.
Prompt Engineering
- UploadSAR PDF
- AskFree text + chips
- Prompt specFormat rules
- ModelPerch / Ollama
- RenderTable + chart
The biggest shift in my thinking: I stopped treating prompts as instructions and started treating them as component specs. To get consistent output from the AI, I had to specify the output the same way I would specify a Figma component. Exact hex codes. Exact layout structure. Exact typography. Exact semantic color rules.
SAR Schedule Analysis Results Format Prompt, excerpted. The full prompt ran several hundred lines.
"When analyzing SAR documents for schedule-related queries, present results in the following standardized format with exact styling. Analysis Summary Box: light blue background (#d1ecf1) with teal left border (#17a2b8). Schedule color coding: red (#dc3545) for significant delays and milestone slips, teal italic (#17a2b8) for delivered items, progressive blue gradient for assessment years from #5B9BD5 to #1E3A8A. Chart.js line chart with borderWidth 3, pointRadius 6, Y-axis integer years with callback formatter, tooltip converts decimal years to MMM YYYY format..."
The prompt specified CSS variables, Chart.js configuration, responsive breakpoints, tooltip callback functions, and semantic color meanings. The prompt was longer than most components I have designed. Once I moved to spec-style prompts, output became consistent across runs. Analysts learned the visual language once and read every result fluently.
How It Works
- Upload. You drop a SAR PDF. Quick Glance populates with ship count, delivery percentage, budget expended, key milestones, and program costs.
- Ask. You type a question, pick an example chip, or narrow by year and section.
- Analyze. The Node.js backend extracts PDF text and routes the query through Perch (Claude Sonnet) with the format-specific super prompt. Ollama stands by as a local fallback.
- Read. The response renders in the format the prompt specified: comparison table, key changes box, schedule evolution chart. Source references tie every claim back to the PDF.
Results and Impact
By the Day 5 showcase, the tool did what we set out to prove. Upload a SAR, ask a question, get a sourced answer in seconds. A cross-year schedule comparison that would normally take an analyst hours of manual PDF review ran end to end in minutes on stage. The demo held on Perch with Ollama standing by. Nothing fell over.
The project validated conversational query with structured prompting for defense acquisition data. Analysts in this domain are trained to distrust ungrounded AI output by default. The tool earned trust anyway.
What I Learned
The prompt is a design artifact. I stopped treating prompts as instructions and started treating them as component specs. Hex codes, callback functions, semantic rules. Once the prompt was specified with the same rigor as a Figma component, output became reliable. Staff-level design work in an AI context means taking ownership of the layer where language meets rendering.
Trust is designed before the model speaks. The Quick Glance panel was not my call. Watching it work taught me something. Analysts did not trust AI output more because the AI got better. They trusted it more because they had a grounded reference to check it against. In regulated domains, the pre-AI moment of the interface matters as much as the AI moment.
Constraint sharpens decisions. We pivoted the program, the stack, and the interface inside five days. Every constraint forced a clearer call. The F-35 data was too messy, so we picked a cleaner program and proved the concept. Razor was painful, so we switched stacks and kept moving. The AI would not produce consistent output, so I rewrote the prompt as a spec. Staff design is not about knowing everything. It is about building the conditions where the person who does know something is heard, and moving fast when they are.