This is the “production reality” half of the deployment runbook: how to scale safely, keep the environment governable, and avoid the most common causes of MTTR spikes.
1) Gradual Expansion (Production)
1.1 Telemetry + tuning before scaling
Before you expand scope beyond the pilot rings, validate three things with evidence:
-
Stability
-
Performance
-
Noise
TAC rule: if you can’t show stability + p95 performance + noise baseline, you’re not ready to scale.
1.2 Exceptions management (governance)
An exception must be:
-
Scoped by Virtual Group (never global by default)
-
Justified (incident / validated false positive / business requirement)
-
Time-bounded with a review date (and owner)
Avoid: “global permanent exceptions” for a single application.
Prefer: function-based scoping (e.g., Dev vs Finance) and the smallest possible exception surface.
2) Continuous Operations (Day-2)
2.1 Recommended operational cadence
A cadence that keeps the environment “boring” (in a good way):
2.2 Controlled upgrades (no component drift)
Golden rule
Why (TAC view): upgrade + module change at the same time multiplies variables and makes RCA unreliable when something breaks.
Best practice
3) Policy Best Practices (engineering-grade)
3.1 Enforcement strategy by maturity
Practical note: “start restrictive” only works if you have triage capacity and governed exceptions. In many orgs, the fastest path is:
start stable → harden quickly by waves.
3.2 Group-based policy (AD / Virtual Groups)
Group policies by:
-
Risk (high-risk / privileged)
-
Function (dev, finance, third-party)
-
Technology (VDI, macOS, specialized endpoints)
This prevents an ungovernable monolithic policy.
3.3 User experience and ticket reduction
3.4 Documentation and change control
Every policy change should capture:
-
Reason (incident / false positive / audit requirement)
-
Scope (which groups)
-
Expected impact (what could break)
-
Rollback plan (how to revert safely)
-
4) TAC-Style Runbooks (must exist before go-live)
4.1 “Installed but not visible / policy not applied”
Checklist:
-
Is the endpoint in the correct group?
-
Is the Deployment Policy hitting the target?
-
Portal connectivity constraints (proxy/DNS/SSL inspection)?
-
Is the client version compatible with the tenant/policies?
-
4.2 “Performance degraded”
Process:
4.3 “False positive on a critical app”
Process:
-
Collect evidence (hash, path, signer, behavior)
-
Create a granular exception (group + app) with expiration
-
Validate in a small ring, then expand
-
5) High-Value Recommendations (incident prevention)
-
Do not change modules during upgrades
-
Ring-based upgrades via Deployment Policy
-
Air-gapped/offline: plan packages and manual updates (no improvisation)
-
FDE: plan keys/recovery/helpdesk workflows before mass encryption
-
VPN + Endpoint on the same host: validate interoperability and impact (latency, split tunneling, DNS)
-
6) Validation metrics (what Security and IT both need)
-
Coverage: % endpoints active + blades enabled
-
Health: crash/incident rate per 100 endpoints
-
Performance: CPU/IO p95 at peak
-
Efficacy: unique detections, meaningful blocks, response time
-
Operations: endpoint MTTR, ticket volume per wave
-
Governance: number of active exceptions + average age (stale exceptions = risk)
-
7) Official references
-
sk154072 — Harmony Endpoint Client Deployment and Upgrade Best Practice
-
sk182659 — Harmony Endpoint Onboarding Best Practices
-
Infinity Portal Administration Guide