When a QA or SRE team gets the first batch of load-test numbers, the question is always the same — "is this normal?" Here's how percentiles became the core of SLOs in industry, and how to read them.
Why p95 / p99 instead of the average?
Sort 100 response times: p95 is the 95th — meaning 95% of requests are faster than that value. The mean gets dragged by outliers, but for user experience, the slowest 5% are the ones who complain.
Example: 99 requests at 100ms, one at 10,000ms. Mean = 199ms looks fine, but p99 = 10,000ms reflects the reality that 1% of users waited 10 seconds. Set SLOs on the mean and you'll never breach — and you'll never fix the things that hurt most.
What is Apdex? How do you set T?
Apdex buckets each request: satisfied (< T), tolerated (T to 4T), frustrated (> 4T). Formula: (satisfied + tolerated / 2) / total. Score range: 0–1.
Common T values in industry:
- Backend APIs: T = 200–500ms
- Initial web page load: T = 1s (aligns with the LCP target)
- Interactive actions (form submit, search): T = 500ms
- Heavy operations (reports, uploads): T = 3–5s
Apdex < 0.7 generally means UX is degrading noticeably; < 0.5 means most requests are out of the tolerance zone.
Three curves you'll see in the wild
Flat: p50 / p90 / p95 are close, e.g. 120 / 180 / 200ms. System is healthy; no outliers.
Long-tail: p50 = 200ms but p95 = 2000ms. Most requests fly, a few crawl. Usual suspects: slow SQL, flaky external APIs, GC pauses. Priority: chase the tail, not the average — lowering the mean here barely helps anyone.
Avalanche: p50, p90, p95 all jump together — 800 / 1500 / 3000ms. The whole system is saturating, typically CPU or a connection pool. Treatment: add capacity or rate-limit.
Common mistakes
Tiny samples + p99: at n = 100, p99 is just the slowest single request — a single outlier shifts everything. Aim for at least 1,000 samples before quoting p99.
SLO based on the mean: virtually every industry SLO standard (Google SRE Book, AWS Well-Architected) uses p95 or p99, not the mean.
Not specifying which percentile algorithm: linear interpolation (R-7, Excel's default and this tool's default) vs nearest-rank (R-1) differ by milliseconds for small samples. Document which one your report uses.
Comparing across mismatched sample sizes: 100 samples at 3 a.m. vs 10,000 samples at peak — different distributions, not comparable. Bucket into equal sample sizes or fixed time windows first.
Wire it into monitoring
Use this tool to read a single test run quickly. For ongoing monitoring, push the same metrics into Grafana / Datadog / New Relic and set alert rules:
p95 > 500ms for 5 min→ Slack notificationApdex < 0.7 for 10 min→ PagerDuty page
Workflow: when JMeter / k6 finishes, paste the response-time column here to eyeball the numbers; if something is on fire, dive into Grafana to find the root cause. That's the standard QA / SRE loop.
Try it now: copy the response-time column from your last JMeter Summary Report into the tool above — p95, p99, and Apdex in 30 seconds.