SRE, Infra, & Reliability

This guide is for SRE and Infrastructure leaders who need clearer signals on change risk, operational load, and how delivery patterns intersect with reliability outcomes. It focuses on using Metrics → Delivery, Teams → Iterations, Developer Coaching, and gitStream (if enabled), plus reliability/incident metrics where configured.

TL;DR – SRE / Infra / Reliability:

Use Metrics → Delivery to find flow patterns that increase change risk.
Use Teams → Iterations (Completed) to track unplanned work and reliability-driven scope shifts.
Use Developer Coaching to spot workload patterns that signal operational strain.
If configured, pair incident / reliability metrics with Delivery trends to strengthen your story.
Use gitStream to standardize safe-change behavior with low noise.

Start here in 15 minutes

Pick one reliability-critical service or team.
In Metrics → Delivery, set the time window to the last 4–8 weeks.
Scan for:
- Spikes in PR size.
- Periods with slower Review or Deploy Time.
Open Teams → Iterations → Completed for the same team and:
- Estimate how much work was unplanned (operational / incident-driven).
Write a one-line summary:
“When X happens in delivery, we see more reliability load / incidents.”
Use that summary to propose one experiment (e.g., smaller PRs or extra review on a service).

Who this guide is for

This path is for people who:

Own or influence availability, incident response, and change management.
Need to show how delivery practices affect reliability and operational load.
Partner with DevEx, Platform, QA/Release, and PMO.

What you likely care about

Are change patterns increasing reliability risk?
Is operational work visible and linked to planning, or hidden as “background noise”?
Where is unplanned reliability work eroding feature capacity?
Which low-noise standards reduce risk without slowing flow?

Before you begin

Git integration and key repos are connected.
Teams, services, and ownership are clear enough to slice metrics by team or area.
If available, incident / reliability metrics are configured and mapped to teams/services.
Developer Coaching is enabled for relevant teams (where available).
gitStream is enabled on at least some reliability-critical repos (if your org uses it).

Step 1: Use Delivery metrics to identify risky change patterns

Goal: Connect reliability issues to concrete delivery behavior.

Where: Metrics → Delivery

Select a team or service that has seen incidents or reliability concerns.
Choose a timeframe that includes recent incidents (e.g., last 4–8 weeks).
Review:
- Cycle Time stage trends (especially Review and Deploy Time).
- PR size patterns and any spikes in large, late changes.
- Any visible trends around rushes to deploy before cutoffs.
Mark 1–2 concrete risk signals, such as:
- “Frequent large PRs merged shortly before deploy.”
- “Review Time compressed when incident backlog is high.”

Step 2: Make reliability work visible in Iterations

Goal: Show how unplanned reliability work affects delivery capacity.

Where: Teams → Iterations (Completed)

Open the last few completed iterations for teams covering critical services.
Review:
- Unplanned work that came from incidents / reliability tasks.
- Scope removed or delayed because of operational load.
- Patterns across iterations (e.g., every sprint loses 20–30% of capacity to incidents).
Use these patterns to:
- Quantify reliability work in terms of lost feature capacity.
- Make the case for more SRE capacity or automation.

Step 3: Use Developer Coaching to spot operational strain

Goal: Find hotspots where a few people carry too much reliability burden.

Where: Developer Coaching (if enabled)

Look for contributors who:
- Handle a disproportionate share of reviews or critical PRs.
- Frequently appear in incident/operational work.
Compare those hotspots with:
- High Cycle Time or Rework in their services.
- Known incident trends.
Use this to justify:
- Spreading knowledge via pairing, documentation, or ownership changes.
- Targeted automation or standards for high-risk areas.

Step 4: Pair reliability metrics with delivery trends (if configured)

Goal: Tell a clean “change → incident → improvement” story.

Identify periods or services with higher incident volume or failure signals.
Overlay those periods with:
- Spikes in large or rushed PRs.
- Increased unplanned work in Iterations.
Capture 1–2 specific narratives per quarter to bring to leadership and DevEx/QA:
- “When we tightened review standards and reduced oversized PRs, incidents dropped the next month.”

Step 5: Use gitStream to standardize safe-change patterns

Goal: Turn reliability learnings into guardrails.

Where: gitStream Hub (if enabled)

Start with patterns that directly reduce risk:
- Flagging changes in critical services for extra review.
- Protecting against massive PRs in sensitive areas.
- Encouraging AI review or additional checks for high-risk files.
Roll guardrails out to a few services, then expand once teams are comfortable.
Use Delivery and incident trends to verify impact.

TL;DR – SRE / Infra / Reliability:

Start here in 15 minutes

Who this guide is for

What you likely care about

Before you begin

Step 1: Use Delivery metrics to identify risky change patterns

Step 2: Make reliability work visible in Iterations

Step 3: Use Developer Coaching to spot operational strain

Step 4: Pair reliability metrics with delivery trends (if configured)

Step 5: Use gitStream to standardize safe-change patterns

Recommended operating rhythm

Weekly

Monthly / per release

Recommended next articles

Comments