Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
WiliRGasparetto
MVP Diamond
MVP Diamond

Multi-Domain (MDS/DMS) — HA, Sync, and Advanced Troubleshooting (Field Runbook)

Scope: Multi-Domain Security Management (MDS + DMS). Focus on HA behavior, sync/replication, evidence-driven troubleshooting, and safe operational actions.

Operational thesis

In MDSM, most incidents look like “SmartConsole/publish/sync problems,” but MTTR typically spikes because:

  1. engineers collect logs in the wrong context (MDS vs Domain), and/or

  2. they perform disruptive actions (restart/failover) before collecting minimum evidence.

 

1) Architecture mental model (what matters in production)

1.1 MDS vs DMS (key concepts)

  • MDS (Multi-Domain Server): hosts the multi-domain framework and multiple Domain Management contexts.

  • DMS (Domain Management Server / Domain): manages objects/policy/gateways for a specific domain.

1.2 HA reality (where teams commonly misread behavior)

  • DMS HA is per-domain Active/Standby: typically one Active DMS per domain and one or more Standby peers.

    • “Failover” is operational/manual (promotion of Standby to Active). Do not assume automatic role switching.

  • MDS HA (infrastructure layer): commonly deployed as Primary/Secondary (design dependent) with domain distribution for load.

TAC rule: “MDS up” does not mean “Domain up.” Always validate per domain.

 

2) Rule #1: Context correctness (mdsenv) — without this you’re blind

2.1 List domains and overall health

To list domains and overall process status:

mdsstat

2.2 What mdsenv actually does (correction)

  • mdsenv without arguments shows the current context; it does not list domains.

  • To switch into a specific Domain context:

mdsenv <DomainName>

2.3 TAC verification: prove you are in the correct Domain

Immediately after switching context, validate:

echo $FWDIR
echo $MDS_FWDIR

Practical interpretation

  • echo $FWDIR must reflect the current context you intend to troubleshoot.

  • If you don’t validate context, you will tail the wrong logs and chase ghosts.

 

3) “15-minute triage” runbook (evidence-first)

3.1 Baseline health (MDS level)

mdsstat

What you’re looking for:

  • whether the issue is global (MDS plane) vs isolated to one domain

  • domains with abnormal state

  • key daemons not running

Do not default to “wait 10 minutes and restart.” First determine which domain/layer is failing and why.

3.2 Enter the affected domain

mdsenv <DomainName>
echo $FWDIR

3.3 Live log tails (while reproducing)

From the correct domain context:

tail -F $FWDIR/log/fwd.elg
tail -F $CPDIR/log/cpd.elg

Management logs (correction: cpm.elg may not exist everywhere)

$FWDIR/log/cpm.elg is not guaranteed in all versions/architectures. In some cases, management-side logging you need may be in other files (e.g., asm.elg).

TAC rule: first confirm what exists:

ls -lh $FWDIR/log | egrep "cpm|asm"

Then tail the relevant file:

tail -F $FWDIR/log/<existing_file>.elg

 

4) Stop/Start a specific Domain (surgical and correct)

Correction: use Domain name (not IP)

Use the Domain/DMS name as shown in mdsstat:

mdsstop_customer <DMS_Name>
mdsstart_customer <DMS_Name>
mdsstat

TAC discipline

  • collect minimum evidence first (status + logs + timestamp window)

  • treat this as a controlled change (can impact publish/install for that domain)

 

5) Sync/replication drift (when to suspect it, how to approach)

Typical symptoms:

  • objects/policy appear inconsistent between peers

  • policy install behaves differently across peers

  • behavior diverges after change/upgrade

TAC approach

  • confirm that changes were actually Published (unpublished changes will not replicate)

  • compare logs and state on both peers (Active vs Standby)

  • avoid “restart as a sync method” — use the HA/sync workflow in SmartConsole when drift is confirmed

 

6) Backup/Restore in HA (don’t create a disaster)

Commands:

mds_backup
mds_restore

Mandatory post-restore validation (correction)
After mds_restore, validate health before releasing the platform:

mdsstat

Then validate critical domains:

mdsenv <DomainName>
echo $FWDIR
# verify key processes and relevant logs in the proper context

 

7) Kernel debug (advanced) — never run broad, always filter

Do not run “wide open” debug in production. Prefer module-scoped debug:

fw ctl debug -m <module> <flags>

Example (generic drop visibility, short window):

fw ctl debug -m fw + drop
fw ctl kdebug -f
# reproduce for 30–60s
fw ctl debug 0

TAC rule: unfiltered kernel debug + long duration = performance risk + low signal + messy RCA.

 

8 )Recurring pitfalls (what causes repeated escalations)

  • Tailing logs without mdsenv <Domain> → wrong context, wrong evidence

  • Assuming mdsenv lists domains → it doesn’t; mdsstat does

  • Restarting a domain prematurely → masks root cause (storage/locks/connectivity)

  • Restoring without validating mdsstat → partial recovery → cascading incidents

  • Running broad kernel debug in production → instability + unusable data

0 Kudos
0 Replies

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events