There is a particular moment on any mid-market MSP's L2 shift where a senior engineer — someone we're paying $180,000 a year to think hard about hard problems — is tabbing between fourteen browser windows trying to figure out which customer's Exchange is the one that's actually down. This is not a productivity problem. It is a design problem. And for a long time we treated it as the former.

When we started building what became Espira.io, the working assumption in our team was that the answer to tool sprawl was a better dashboard. A single pane of glass. One more console, but authoritative this time — the screen that would replace the other thirteen.

We were wrong in a specific and instructive way. This is the story of that wrongness.

What we thought the problem was

Every MSP founder has the same wall chart. It looks like this: on the y-axis, tickets per engineer per month. On the x-axis, number of customers. The line is supposed to be flat — that's the promise of the MSP model, that you add customers faster than you add engineers. In practice, the line bends upward at about the fifty-customer mark, and the reason is almost always that your top performers have started spending more time finding the right view than they do fixing the right thing.

So naturally, we thought: build the view. Aggregate every signal from every tool into one place. Make the dashboard so good that engineers live there and never tab out. We called this the federation thesis, and we spent roughly four months building the first version of it.

What happened next was predictable in retrospect. We shipped the federation dashboard to our own NOC first — the right way to dogfood — and watched our engineers use it for three weeks. It was beautiful. Every customer's status, every open ticket, every host up/down, all on one screen.

Our MTTR got worse.

Why a beautiful dashboard made things slower

The thing we missed is the difference between information density and decision density. A federated dashboard gave engineers more information per square inch. It did not give them more decisions per square inch. And in the work of an MSP L2, a decision is always the unit that matters.

Imagine an engineer gets paged. The page tells them: "Customer Synergi, VM PRD-APP-02, CPU steady state at 99% for 12 minutes." In the federation dashboard, that engineer now has to: (1) navigate to Synergi's tenant view, (2) find PRD-APP-02 in the VM list, (3) pivot to that VM's host, (4) pivot to the host's datastore, (5) open the most recent change log, (6) correlate with last night's patch window, and then (7) decide whether to roll back, isolate, or wait.

Every one of those steps is a query. The federation dashboard made each query fast. But the engineer is still running seven queries to reach a decision they could have reached in two if the dashboard had been designed around the decision instead of the data.

The federation dashboard made each query fast. But the engineer is still running seven queries to reach a decision they could have reached in two if the dashboard had been designed around the decision instead of the data.

This is the same mistake every "single pane of glass" vendor makes. You can't collapse fourteen tools into one by putting fourteen views on one screen. You collapse them by asking: what decision is the engineer trying to make, and what's the shortest path to that decision?

The rewrite: inverting the architecture

Version two of Espira.io inverted the flow. Instead of aggregating state and letting engineers query it, we started with the set of decisions an engineer actually makes during an incident and worked backwards to the UI.

For an L2 engineer working an incident, the decisions are roughly:

  1. Is this real? (Is the alert a false positive, a known flake, or something genuinely new?)
  2. Is this known? (Have we seen this exact signature on this customer or another customer recently?)
  3. Is this correlated? (Is something else we changed or deployed responsible?)
  4. What's the runbook? (If we've decided what to do, what does the organization say "do this" looks like?)
  5. Can I execute it safely? (Do I have the rights, the window, and the rollback?)

Espira.io v2 is organized around those five decisions. The dashboard still federates — there's a VM Status donut, a Host Status donut, a datastore radar — but those views exist to support decision #1 (is this real?) and then get out of the way. Decisions 2 through 5 happen in a different surface: Insight AI, our correlation layer, which reasons over cross-tenant history the moment an alert surfaces and proposes a suggested runbook before the engineer has finished opening the ticket.

The tool sprawl problem turned out not to be "too many tabs." It was "too many queries per decision." When we designed around decisions instead of dashboards, the fourteen tools genuinely did collapse — but they collapsed into a flow, not a screen.

The measurable result

Six months after the v2 cutover, our NOC's MTTR on P1 tickets dropped from 18 minutes to 11 minutes. Our L2 engineers report opening, on average, 3 distinct tools during an incident — down from the eleven we measured pre-v2. (Fourteen was the theoretical number, pulled from our customer environment audits; our own NOC was slightly below that on a good day.) The engineers report the work feeling less exhausting. That is the metric that matters, and it is the one we cannot put in a slide.


What we'd tell another platform team

Three things, if we were starting over.

One: resist the aesthetic pull of the dashboard. Dashboards are the most interviewable artifact in operations software — they look great in demos, they render nicely in decks, and they give buyers something to point at. But dashboards are artifacts of a query mindset. If you're building for people who run operations, you're building for a decision mindset. These are different architectural problems and they produce different UIs.

Two: measure the queries per decision, not the time per query. "How long does it take to load this view?" is the wrong question. "How many views does an engineer load before they can decide what to do?" is the right one. The first is engineering-tractable and misses the point. The second is vague and, in our experience, the only one that correlates with MTTR.

Three: the AI layer is load-bearing but not central. Insight AI is what closes the gap between "the data is there" and "the engineer has decided." But it's load-bearing in the way a door frame is load-bearing — without it the wall falls, but it doesn't define the room. The room is defined by the decisions. The AI helps get you to them faster.

If you're running an MSP and you've been told the answer to tool sprawl is a better dashboard, the honest answer is: maybe. But before you buy one, count the queries an engineer runs between alert and action. If the number is more than three, no dashboard is going to save you. You need to redesign around the decision.

That's what we did. The number used to be fourteen. It's three.

KA
WRITTEN BY
Karl Adriaenssens
Lead architect at Espira Solutions. Previously ran HPC engineering for University of Alaska Fairbanks and designed cloud migration strategy for the City of Grand Junction. Writes occasionally about the places where platform design and operations meet.