1) Architecture mental model (what matters in production)

WiliRGasparetto · ‎2026-03-27

Scope: Multi-Domain Security Management (MDS + DMS). Focus on HA behavior, sync/replication, evidence-driven troubleshooting, and safe operational actions.

Operational thesis

In MDSM, most incidents look like “SmartConsole/publish/sync problems,” but MTTR typically spikes because:

engineers collect logs in the wrong context (MDS vs Domain), and/or
they perform disruptive actions (restart/failover) before collecting minimum evidence.

1) Architecture mental model (what matters in production)

1.1 MDS vs DMS (key concepts)

MDS (Multi-Domain Server): hosts the multi-domain framework and multiple Domain Management contexts.
DMS (Domain Management Server / Domain): manages objects/policy/gateways for a specific domain.

1.2 HA reality (where teams commonly misread behavior)

DMS HA is per-domain Active/Standby: typically one Active DMS per domain and one or more Standby peers.
- “Failover” is operational/manual (promotion of Standby to Active). Do not assume automatic role switching.
MDS HA (infrastructure layer): commonly deployed as Primary/Secondary (design dependent) with domain distribution for load.

TAC rule: “MDS up” does not mean “Domain up.” Always validate per domain.

2) Rule #1: Context correctness (mdsenv) — without this you’re blind

2.1 List domains and overall health

✅ To list domains and overall process status:

mdsstat

2.2 What `mdsenv` actually does (correction)

mdsenv without arguments shows the current context; it does not list domains.
To switch into a specific Domain context:

mdsenv <DomainName>

2.3 TAC verification: prove you are in the correct Domain

Immediately after switching context, validate:

echo $FWDIR
echo $MDS_FWDIR

Practical interpretation

echo $FWDIR must reflect the current context you intend to troubleshoot.
If you don’t validate context, you will tail the wrong logs and chase ghosts.

3) “15-minute triage” runbook (evidence-first)

3.1 Baseline health (MDS level)

mdsstat

What you’re looking for:

whether the issue is global (MDS plane) vs isolated to one domain
domains with abnormal state
key daemons not running

Do not default to “wait 10 minutes and restart.” First determine which domain/layer is failing and why.

3.2 Enter the affected domain

mdsenv <DomainName>
echo $FWDIR

3.3 Live log tails (while reproducing)

From the correct domain context:

tail -F $FWDIR/log/fwd.elg
tail -F $CPDIR/log/cpd.elg

Management logs (correction: `cpm.elg` may not exist everywhere)

$FWDIR/log/cpm.elg is not guaranteed in all versions/architectures. In some cases, management-side logging you need may be in other files (e.g., asm.elg).

TAC rule: first confirm what exists:

ls -lh $FWDIR/log | egrep "cpm|asm"

Then tail the relevant file:

tail -F $FWDIR/log/<existing_file>.elg

4) Stop/Start a specific Domain (surgical and correct)

Correction: use Domain name (not IP)

Use the Domain/DMS name as shown in mdsstat:

mdsstop_customer <DMS_Name>
mdsstart_customer <DMS_Name>
mdsstat

TAC discipline

collect minimum evidence first (status + logs + timestamp window)
treat this as a controlled change (can impact publish/install for that domain)

5) Sync/replication drift (when to suspect it, how to approach)

Typical symptoms:

objects/policy appear inconsistent between peers
policy install behaves differently across peers
behavior diverges after change/upgrade

TAC approach

confirm that changes were actually Published (unpublished changes will not replicate)
compare logs and state on both peers (Active vs Standby)
avoid “restart as a sync method” — use the HA/sync workflow in SmartConsole when drift is confirmed

6) Backup/Restore in HA (don’t create a disaster)

Commands:

mds_backup
mds_restore

Mandatory post-restore validation (correction)
After mds_restore, validate health before releasing the platform:

mdsstat

Then validate critical domains:

mdsenv <DomainName>
echo $FWDIR
# verify key processes and relevant logs in the proper context

7) Kernel debug (advanced) — never run broad, always filter

Do not run “wide open” debug in production. Prefer module-scoped debug:

fw ctl debug -m <module> <flags>

Example (generic drop visibility, short window):

fw ctl debug -m fw + drop
fw ctl kdebug -f
# reproduce for 30–60s
fw ctl debug 0

TAC rule: unfiltered kernel debug + long duration = performance risk + low signal + messy RCA.

8 )Recurring pitfalls (what causes repeated escalations)

Tailing logs without mdsenv <Domain> → wrong context, wrong evidence
Assuming mdsenv lists domains → it doesn’t; mdsstat does
Restarting a domain prematurely → masks root cause (storage/locks/connectivity)
Restoring without validating mdsstat → partial recovery → cascading incidents
Running broad kernel debug in production → instability + unusable data

Are you a member of CheckMates?

Multi-Domain (MDS/DMS) — HA, Sync, and Advanced Troubleshooting (Field Runbook)

Operational thesis

1) Architecture mental model (what matters in production)

1.1 MDS vs DMS (key concepts)

1.2 HA reality (where teams commonly misread behavior)

2) Rule #1: Context correctness (mdsenv) — without this you’re blind

2.1 List domains and overall health

2.2 What mdsenv actually does (correction)

2.3 TAC verification: prove you are in the correct Domain

3) “15-minute triage” runbook (evidence-first)

3.1 Baseline health (MDS level)

3.2 Enter the affected domain

3.3 Live log tails (while reproducing)

Management logs (correction: cpm.elg may not exist everywhere)

4) Stop/Start a specific Domain (surgical and correct)

Correction: use Domain name (not IP)

5) Sync/replication drift (when to suspect it, how to approach)

6) Backup/Restore in HA (don’t create a disaster)

7) Kernel debug (advanced) — never run broad, always filter

8 )Recurring pitfalls (what causes repeated escalations)

2.2 What `mdsenv` actually does (correction)

Management logs (correction: `cpm.elg` may not exist everywhere)