← back to blog

From 50 to 100: What Failing Specs Taught Us About Writing Better Ones

During Sprint 5 we committed to scoring every ticket before development. SL-026 came back with a 50. Here's what we learned — and why by the end of the sprint we were writing 100s on the first try.

Speclint Team·

The scoring isn't the product. The behavior change is.

When SL-026 came back with a 50, we could have argued with the scorer. "That's pedantic." "Any developer would understand what we meant." We've all had that conversation. We've all been wrong about it.

Instead, we rewrote the spec. The second version scored 75. We rewrote it again. It scored 100. The implementation took half the time of similarly scoped tickets, and code review caught exactly zero logic errors related to intent. By the end of Sprint 5, we were writing 100s on the first attempt — not because the scorer got easier, but because we finally understood what it was actually measuring.

This post traces that arc. The numbers are real. The failure modes are embarrassing in retrospect and probably familiar to you right now.


Sprint 5 as a Learning Curve

We made a simple rule at the start of the sprint: no ticket goes to development until it scores at least 80. That rule felt bureaucratic when we wrote it. It turned into the most useful forcing function we've had.

The first week was painful. Our average score was 65. Half the tickets needed rewrites. A few needed complete rethinking. The team complained — not loudly, but enough. The complaint was always some version of "the developer knows what I mean." Which is true. Until it isn't. Until you're in a review meeting explaining why the implementation doesn't match the intent, and realizing neither of you can point to a line in the spec that resolves the dispute.

By mid-sprint, something shifted. The rewrites got faster. The feedback patterns started repeating. We weren't learning the scorer — we were learning what made a spec actually unambiguous.

By the final week, SL-036 and SL-027 each scored 100 on the first attempt. We didn't spend extra time on them. We just wrote them differently.


SL-026: The Ticket That Started It

SL-026 was the persona scoring feature — the ability for Speclint to score a spec against a specific buyer persona rather than a generic rubric. It mattered. It was a meaningful feature. And the spec for it was a mess.

Here's a compressed version of the original acceptance criteria:

> The system should consider the user persona when scoring. The scoring output should reflect how relevant the spec is to the defined persona. The user can see which persona was used.

Score: 50.

The feedback was specific: no measurable outcome, no defined actor for the "should redirect" behavior downstream, and "reflect how relevant" was flagged as an untestable assertion. All fair. All fixable. All things we had written without noticing.

We rewrote it with explicit actors, specific outputs, and a defined measurable threshold. Score: 75. Still short because one AC used "appropriately weighted" — another untestable hedge. We cut that phrase, replaced it with a concrete rule. Score: 100.

The implementation that followed was quieter than usual. No Slack messages asking for clarification. No PR comments questioning intent. The spec said what it meant.


The Three Patterns That Kill Scores

Across Sprint 5, the same failure modes appeared on almost every low-scoring ticket. They're not exotic. They're the writing habits every team develops when specs live in a wiki that nobody reads carefully.

1. Vague outcomes ("the user can see the data")

This phrase — or something like it — appeared in seven tickets. It feels like a complete thought. It isn't. "See" doesn't tell you what format the data appears in, what triggers its appearance, whether it updates in real time, or what the empty state looks like. A developer who fills those gaps is making product decisions you didn't intend to delegate.

The fix is mechanical: replace visibility with interaction. Instead of "the user can see the score," write "the user sees a numeric score between 0–100 displayed below the spec input, with a color indicator (red <60, yellow 60–79, green 80–100)." Now there's something to test.

2. Missing actor ("it should redirect")

"It" is doing a lot of work in most specs. It should redirect. It updates automatically. It displays an error. Who is "it"? The system? The API? The frontend? The cron job?

This matters most when two engineers are working adjacent features and each assumes "it" refers to their component. By the time the bug surfaces, both implementations are complete and neither is obviously wrong.

The fix is also mechanical: every AC gets a named subject. "The API returns a 302 redirect." "The frontend displays a loading state." "The background job marks the record as processed." Subjects make handoffs explicit.

3. Untestable assertions ("the experience feels smooth")

This one's harder to catch because it sounds good. "Feels smooth," "loads quickly," "clearly communicates the error" — these are outcomes that a thoughtful PM cares about. But they're not acceptance criteria. They're design intentions that haven't been converted into testable requirements.

If smooth matters, define it: "The score result renders within 1.5 seconds on a standard broadband connection." If clear matters, define it: "The error message includes the reason for failure and a link to documentation." If you can't define it, you don't know what you're asking for yet — and that's important information to have before development starts.


What a 100-Point Spec Actually Looks Like

Here's SL-027, the batch scoring feature, which scored 100 on the first attempt. With annotations.


Title: Batch Scoring via API — Return per-spec scores with aggregate summary

Context: Teams scoring large backlogs need to submit multiple specs in one API call and receive individual scores plus a batch summary. This enables bulk analysis without multiple roundtrips.

Acceptance Criteria:

  1. [Named actor, specific output] The API accepts a JSON array of up to 50 spec objects in a single POST /v1/score/batch request and returns a response array where each object contains the original spec ID, the numeric score (0–100), and a flags array listing specific failure reasons.

  1. [Measurable threshold, no hedging] If any spec in the batch fails validation (missing required fields), the API returns a 422 with a per-spec error array. Valid specs in the same batch are still scored and returned. The batch does not fail atomically.

  1. [Defined empty/edge state] If the batch contains zero items, the API returns a 400 with error: "empty_batch". If the batch exceeds 50 items, the API returns a 400 with error: "batch_limit_exceeded" and the current limit in the response body.

  1. [Observable system behavior, not user perception] The aggregate summary object includes: total_specs, average_score, scores_above_80 count, and most_common_flag (the flag that appeared most frequently across all specs in the batch).

  1. [Explicit non-goal] This endpoint does not support persona-based scoring. Persona scoring requires individual calls to POST /v1/score.


Every AC has a subject. Every outcome is numeric or enumerable. The edge cases are defined before development, not during code review. The non-goal prevents scope creep from a well-intentioned developer who notices the gap.

That last point matters more than it looks. Explicit non-goals are a form of documentation that protects everyone — the PM who might get blamed for a missing feature, the developer who might build the wrong thing trying to be helpful, and the reviewer who has to evaluate both.


Why the Trend Matters More Than the Score

We didn't set out to hit 100. We set out to ship features that behaved as intended. The score was a proxy — a fast way to find the specs that had hidden ambiguity before a developer invested a day building against them.

What we didn't expect was the compounding effect. When specs are unambiguous, code review gets faster. When code review gets faster, PRs merge faster. When PRs merge faster, the feedback loop between shipping and learning compresses. By late Sprint 5, we weren't just writing better specs — we were moving faster because of it. The correlation between spec score and implementation quality was hard to ignore and easy to explain: clear input produces clear output.

The behavior change happened somewhere between the third rewrite of SL-026 and the first draft of SL-027. We stopped writing specs that described what we hoped would happen and started writing specs that described what we could verify had happened. That's not a process change. It's a writing habit. And like most writing habits, it compounds.


Score Your Own Specs

If you're not sure where your specs land, find out. Pick the five most recent tickets your team shipped. Score them. Look for the patterns — vague outcomes, missing actors, untestable assertions. They'll be there.

The goal isn't a perfect score. It's finding the failure modes before development does.

→ Start with a free API call at speclint.ai