Scope: Multi-Domain Security Management (MDS + DMS). Focus on HA behavior, sync/replication, evidence-driven troubleshooting, and safe operational actions.
Operational thesis
In MDSM, most incidents look like “SmartConsole/publish/sync problems,” but MTTR typically spikes because:
-
engineers collect logs in the wrong context (MDS vs Domain), and/or
-
they perform disruptive actions (restart/failover) before collecting minimum evidence.
1) Architecture mental model (what matters in production)
1.1 MDS vs DMS (key concepts)
1.2 HA reality (where teams commonly misread behavior)
-
DMS HA is per-domain Active/Standby: typically one Active DMS per domain and one or more Standby peers.
-
MDS HA (infrastructure layer): commonly deployed as Primary/Secondary (design dependent) with domain distribution for load.
TAC rule: “MDS up” does not mean “Domain up.” Always validate per domain.
2) Rule #1: Context correctness (mdsenv) — without this you’re blind
2.1 List domains and overall health
✅ To list domains and overall process status:
mdsstat
2.2 What mdsenv actually does (correction)
mdsenv <DomainName>
2.3 TAC verification: prove you are in the correct Domain
Immediately after switching context, validate:
echo $FWDIR
echo $MDS_FWDIR
Practical interpretation
-
echo $FWDIR must reflect the current context you intend to troubleshoot.
-
If you don’t validate context, you will tail the wrong logs and chase ghosts.
3) “15-minute triage” runbook (evidence-first)
3.1 Baseline health (MDS level)
mdsstat
What you’re looking for:
Do not default to “wait 10 minutes and restart.” First determine which domain/layer is failing and why.
3.2 Enter the affected domain
mdsenv <DomainName>
echo $FWDIR
3.3 Live log tails (while reproducing)
From the correct domain context:
tail -F $FWDIR/log/fwd.elg
tail -F $CPDIR/log/cpd.elg
Management logs (correction: cpm.elg may not exist everywhere)
$FWDIR/log/cpm.elg is not guaranteed in all versions/architectures. In some cases, management-side logging you need may be in other files (e.g., asm.elg).
TAC rule: first confirm what exists:
ls -lh $FWDIR/log | egrep "cpm|asm"
Then tail the relevant file:
tail -F $FWDIR/log/<existing_file>.elg
4) Stop/Start a specific Domain (surgical and correct)
Correction: use Domain name (not IP)
Use the Domain/DMS name as shown in mdsstat:
mdsstop_customer <DMS_Name>
mdsstart_customer <DMS_Name>
mdsstat
TAC discipline
5) Sync/replication drift (when to suspect it, how to approach)
Typical symptoms:
-
objects/policy appear inconsistent between peers
-
policy install behaves differently across peers
-
behavior diverges after change/upgrade
TAC approach
-
confirm that changes were actually Published (unpublished changes will not replicate)
-
compare logs and state on both peers (Active vs Standby)
-
avoid “restart as a sync method” — use the HA/sync workflow in SmartConsole when drift is confirmed
6) Backup/Restore in HA (don’t create a disaster)
Commands:
mds_backup
mds_restore
Mandatory post-restore validation (correction)
After mds_restore, validate health before releasing the platform:
mdsstat
Then validate critical domains:
mdsenv <DomainName>
echo $FWDIR
# verify key processes and relevant logs in the proper context
7) Kernel debug (advanced) — never run broad, always filter
Do not run “wide open” debug in production. Prefer module-scoped debug:
fw ctl debug -m <module> <flags>
Example (generic drop visibility, short window):
fw ctl debug -m fw + drop
fw ctl kdebug -f
# reproduce for 30–60s
fw ctl debug 0
TAC rule: unfiltered kernel debug + long duration = performance risk + low signal + messy RCA.
8 )Recurring pitfalls (what causes repeated escalations)
-
Tailing logs without mdsenv <Domain> → wrong context, wrong evidence
-
Assuming mdsenv lists domains → it doesn’t; mdsstat does
-
Restarting a domain prematurely → masks root cause (storage/locks/connectivity)
-
Restoring without validating mdsstat → partial recovery → cascading incidents
-
Running broad kernel debug in production → instability + unusable data