Re: Quantum Maestro Troubleshooting in Practice

WiliRGasparetto · ‎2026-02-19

Operational runbook (MHO + SGMs + traffic + VPN) with field commands

If you run Quantum Maestro in production, you’ve probably seen the pattern: issues that “look like VPN” or “look like policy” often turn out to be Security Group health, a single divergent SGM, a physical/link problem (cable/port/optics), or an unstable uplink. The key to reducing MTTR is discipline: evidence + commands, without skipping layers.

Below is a practical “copy-and-run” runbook, with good vs bad interpretation.

1) Mental model

MHO (Orchestrator): controls the Security Group (inventory, health, ports, fabric).
SGMs: run the dataplane (sessions, inspection, VPN, state).
Typical symptom patterns:
- Unhealthy SG → everything becomes a symptom (policy/VPN/traffic).
- Unhealthy single SGM → intermittent behavior (“sometimes it works”).

2) clish vs gclish (why this becomes a real incident)

clish

Local node context.
Useful for point inspection, but risky for configuration in Maestro because it can introduce drift (one member behaving differently).

gclish

Global Security Group context.
Operational rule:
- use gclish when the intent is global consistency (uniform validation/collection/adjustment);
- use clish only when you need to inspect/act on a specific member in a controlled way.

A recurring field root cause: a change made with clish on a single member → the SGM starts handling traffic differently → intermittent symptoms that are hard to reproduce.

3) Fast triage start with the Security Group

3.1 Global Maestro / Security Group health

On the MHO:

orch_stat -all

What this proves:

whether all SGMs are present/operational
whether any member is degraded/missing
signals of port/fabric issues

Good: all members OK, stable links, no critical port down.
Bad: missing/degraded member, unstable links → fix the foundation before analyzing VPN/policy.

3.2 Security Group sanity check

asg diag verify

What this proves: high-level SG consistency and quick integrity checks.
Bad: critical alerts → return to orch_stat -all and isolate the failing member/port.

3.3 Capacity before taking member-level actions

asg perf -v

What this proves: whether the SG has enough headroom (CPU/memory) to absorb load during isolation/actions.
Bad: SG near its limits → avoid disruptive actions.

3.4 Reconcile state (use with care)

hcp -r all

Note: commonly used in playbooks to recover internal state/handshakes, but it should not be the first “blind” step.

4) Physical and link health (where most “bugs” actually start)

When you see intermittency, “traffic disappears,” or only some users/flows fail, first prove whether there is physical/L1–L2 instability.

4.1 Inventory/port-map quick reference

orch_stat -p

or

cat /etc/maestro.json

Use this to confirm interface/port mapping in the Maestro context.

4.2 Counters and drops (all members)

g_all netstat -ni

What to look for: increasing RX-ERR/TX-ERR/drops.
If these counters climb, they often explain VPN flapping, broken sessions, and “policy is OK but traffic fails.”

4.3 Per-interface physical errors (CRC/symbol errors)

ethtool -S <interfacename>

Good: no CRC/errors increasing.
Bad: CRC/symbol errors → treat as L1/L2 (cable/optics/port/switch) before focusing on VPN.

4.4 Real link flap (carrier)

asg_ifconfig | grep carrier | grep -v "carrier: 0"

Bad: carrier oscillation → intermittent behavior is highly likely.

4.5 Hardware health (sensors)

g_all cpstat -f sensors os

What this proves: thermal/power/fan conditions can lead to instability and erratic behavior.

4.6 Maestro port state

show maestro port <port>

Confirms the port’s state/configuration in the Maestro domain.

5) The turning point: “no log” — does the traffic exist in the SG dataplane?

This step quickly separates “problem before the gateway” from “problem inside the gateway.”

5.1 Prove the session/connection on the SG

Example (intentionally generic IPs):

asg search -v 10.10.40.25 \* 203.0.113.50 443 tcp

Interpretation:

No output: traffic likely is not reaching the SG (or it’s taking a different path). Return to L1/L2/L3 and capture at the correct point.
Output present: traffic exists in the dataplane; you now have a basis to correlate with NAT, routing, policy, and VPN.

If the connection does not “exist” for the SG, changing policy/VPN is usually wasted effort.

6) Single-SGM failure: how to investigate and restore consistency

Typical symptom: intermittent failures, “some flows drop,” “works after some time.”

6.1 Controlled action to reintegrate a suspected member (when needed)

On the suspected SGM:

clusterXL_admin down
clusterXL_admin up

Risk: medium (sessions anchored to that member can be impacted).
Pre-condition: confirm headroom with asg perf -v.

6.2 Check state and drift indicators

cphaprob list
tail $FWDIR/log/blade_config

What to look for:

cphaprob list: HA/cluster participation/state signals and inconsistencies
blade_config: alerts and errors that indicate configuration drift

Closing

Maestro troubleshooting requires discipline: start with SG health, then prove traffic exists, then validate physical stability, and only then go deeper. If you follow this sequence with objective commands, “phantom incidents” drop sharply—and troubleshooting becomes engineering, not guesswork.

israelfds95 · ‎2026-02-19

very good, very useful

WiliRGasparetto · ‎2026-02-19

thk's Israel

WiliRGasparetto · ‎2026-02-20

Throughout my career I’ve learned to start with the fundamentals, because 90% of problems are solved there. In Maestro’s case, it’s no different — most of the issues I’ve resolved happened because the analyst didn’t know how to differentiate between clish and gclish, which ended up causing misconfigurations.

the_rock · ‎2026-02-20

Awesome work. Btw, I could not agree more with what you said. I cant even count how many times I been on calls with people and it usually turned out to be something so simple at the end that solved the issue.

Best,
Andy
"Have a great day and if its not, change it"

WiliRGasparetto · ‎2026-02-20

I’ve already seen troubleshooting cases that lasted for days turn out to be just a simple VLAN issue. Usually, people miss the basics, focus on the more complex aspects, and forget to check the fundamentals.

Serge_Wuethrich · ‎2026-02-20

Good and useful guideline in general. I just would like to point out, that few of your commands (asg diag, asg perf, asg search) do not exist anymore in R82 and have been moved to insights or cluster-cli.

Check the release notes for more changes regarding Maestro:

https://sc1.checkpoint.com/documents/R82/WebAdminGuides/EN/CP_R82_RN/Content/Topics-RN/Software-Chan...

WiliRGasparetto · ‎2026-02-20

Thank you very much for the tip. I still haven’t had the opportunity to work with R82 on Maestro, so I’ll take a look. It will be good for me to understand the differences between them and, perhaps, even update the title of this topic to R81.20, since it may indeed be obsolete in R82.

Dom_Galvao · ‎2026-02-20

very good content with practical examples.

emmap · ‎2026-02-23

This guide seems to conflate the MHOs and the SMO. The SMO is not an orchestrator, it's an SGM.

I don't think there's a file called /etc/maestro.json. For a port inventory at the MHO you would use orch_stat -p, or the MHO WebUI in R82+.

That 'last resort' of just deleting the security group with no follow up is terrible advice. What's going on there, you're just going to remove the group entirely and give up? Please review this and make sure you're not suggesting steps that will cause massive problems. There are many other things that can be attempted in a troubleshooting context before going nuclear here.

_Val_ · ‎2026-02-23

@emmap /etc/maestro.json file is mentioned in sk164712.

About removing a security group, I agree that would be a very bad move in a production environment.

My understanding is, @WiliRGasparetto is writing this based on his lab trials.

emmap · ‎2026-02-23

Yep, you're right that is a file, my mistake.

WiliRGasparetto · ‎2026-02-23

I’m going to remove that step, and I’ll look for better approaches. I included it only as a last-resort option when there was truly no solution and in coordination with Check Point TAC, but presenting it as a standard solution was a bad idea. Thank you very much for the feedback.

WiliRGasparetto · ‎2026-02-23

I also added the command `orch_stat -p` as the first option and then the verification with `cat /etc/maestro.json`. I found your point very helpful, @emmap .

batata · ‎2026-02-27

Nice

WiliRGasparetto · ‎2026-03-02

thk's