Threat Emulation in production: how to run “zero-day control” without becoming a bottleneck

WiliRGasparetto

Threat Emulation in production: how to run “zero-day control” without becoming a bottleneck

Threat Emulation (TE) is one of the strongest controls in the Threat Prevention stack — and also one of the most commonly mis-operated. In the field, I see two extremes:

“It’s enabled, but it doesn’t protect anything critical” (traffic never enters the pipeline, broad bypass, overly permissive mode, no governance)
“It’s protecting, but it became a bottleneck” (latency, timeouts, emulation failures, ticket storms)

This post is about operating TE as a pipeline: technical flow, delivery modes, failure handling, governance, and practical gates.

1) What Threat Emulation really is (in practice)

Threat Emulation is behavior-based sandbox analysis for files, designed for unknown/zero-day threats. The real value is pre-execution: preventing a “live” file from reaching users without a reliable verdict.

The question TE answers best:

“Does this file execute malicious behavior in a realistic environment?”

2) End-to-end technical flow (the pipeline you must see)

2.1 Interception and file copy

The gateway intercepts the transfer and creates a copy of the file to submit to TE.
TE only protects what actually enters the pipeline.

Critical point: if relevant traffic bypasses inspection where applicable (for example, HTTPS download paths outside the enforced inspection scope, broad bypass rules, or delivery paths that never traverse the gateway), TE never sees the file.

2.2 Submission to the TE engine (cloud or on-prem)

The file copy is sent to:

Cloud sandbox (ThreatCloud/TE cloud), or
On-prem TE (appliance/local service)

User experience depends on the delivery mode (Section 3).

2.3 Multi-environment sandbox execution

The file is executed/analyzed across multiple environments (different OS/app stacks), increasing detection and reducing evasion.

Typical behavioral signals include:

process chain creation
filesystem/registry modifications
persistence mechanisms (run keys, tasks, services)
outbound callbacks / network activity
secondary downloads / dropper behavior

2.4 Verdict and action

Malicious: block/prevent per policy + event + artifacts (hash/IOC)
Benign: release per delivery mode
Inconclusive/failure: this is where operational risk lives (Section 5)

3) What defines success: delivery mode (Maximum Prevention vs Rapid Delivery)

This is a deliberate security vs UX decision.

3.1 Maximum Prevention (pre-delivery)

The file is not delivered until a verdict is returned.

Where it fits

privileged users, finance/legal, jump hosts
higher exposure segments
low tolerance for “first execution” risk

Cost

higher perceived latency
higher sensitivity to timeouts/emulation failures

3.2 Rapid Delivery (post-delivery)

The file is delivered immediately; TE analyzes in parallel.
Here TE becomes more risk telemetry than deterministic blocking.

Where it fits

productivity is the top priority
higher latency to cloud sandbox
accepted residual risk with strong compensating controls

4) Threat Extraction as the bridge between security and productivity

Threat Extraction solves the biggest Maximum Prevention pain point:

TE analyzes the original file
Extraction delivers a sanitized version first (e.g., remove macros/active content, convert to PDF)

Operational rule

Extraction keeps business running
TE decides “release original” vs “block”

5) Where environments break (and why this becomes bypass or incidents)

5.1 “TE catches nothing”

Common causes:

relevant traffic never enters the pipeline
broad category/domain bypass
TE enabled but not applied to the real risk flows

5.2 “TE became a bottleneck”

Common causes:

Maximum Prevention applied broadly without rings
timeouts (latency/link saturation)
high file volume spikes (updates, DevOps, VDI)
aggressive policy for unsupported files

5.3 Emulation failure becomes an operational backdoor

Failure handling determines real risk:

Fail-open (deliver on failure) → less friction, higher exposure
Fail-closed (block on failure) → higher security, requires governance/tuning

TAC point: decide and document this explicitly — don’t let defaults decide for you.

6) Blueprint that works in production

6.1 Ring-based rollout (mandatory)

Ring 0: IT/SecOps
Ring 1: business pilot
Ring 2: gradual expansion

Gates

events per user under control
top blocks make sense
exceptions have owner/expiry
latency within acceptable bounds

6.2 Decision matrix (security vs UX)

general users: Extraction/convert + TE with controlled tolerance
critical groups: Maximum Prevention
Dev/IT: Rapid Delivery or policy by file type/volume (with telemetry)

6.3 Exception governance (to avoid policy rot)

Every exception needs:

owner
justification
minimal scope (group/app/domain)
expiry/review
evidence of impact

7) Minimal evidence pack for TAC-grade discussion

If you want real help here, share (anonymized):

gateway version + Jumbo take
TE location (cloud/on-prem)
mode (Maximum/Rapid)
whether Extraction is enabled
symptoms (latency? failure? bypass?)
timestamp + 2–3 example events
impacted apps/sites

8) Questions for the community

Do you run Maximum Prevention for everyone or segment by risk? What gates do you use to expand?
How do you handle emulation failures: fail-open, fail-closed, or hybrid per group?
What was your biggest Extraction win: reduced hold time, reduced macros, or fewer exceptions?

Are you a member of CheckMates?