Using vpn debug + kernel evidence (only when needed) — CheckMates-style
⚠️ Operational impact warning: VPN debug—and especially kernel debug—can generate high log volume and impact performance. Use a controlled window, ideally with console access, and always apply filters before collecting kernel evidence.
1) The right mindset: what you must prove (and at which layer)
Before turning on debug, define the hypothesis and the evidence you expect:
Layer 0 — IKE / NAT-T Connectivity
-
Does the peer respond on UDP/500 (IKE) and, if NAT exists, UDP/4500 (NAT-T)?
-
Any upstream block? Asymmetric routing? ISP ALG interference?
Layer 1 — IKE Negotiation (Phase 1 / IKE SA)
-
Compatible proposals (encryption / integrity / DH group)?
-
Authentication matches (PSK / certificate / ID)?
-
Certificate chain / CRL / time sync (clock skew) OK?
Layer 2 — Child SA / Phase 2 (Traffic Selectors)
Layer 3 — Data Plane
2) Preparation (before debug)
2.1 Minimum checklist (prevents unnecessary debug)
-
Verify time/NTP (common root cause for certificate/IKE failures).
-
Confirm policy is installed and rules allow UDP/500, UDP/4500, ESP (proto 50) between peers.
-
Confirm route to the peer and return path (especially multi-ISP).
-
Confirm Site-to-Site NAT exemption and ensure there is no “surprise NAT” in the path.
2.2 Define the debug “target”
Use real IPs from the case. Example safe test IPs (documentation ranges):
-
Remote peer (public): 203.0.113.10
-
Local (public): 198.51.100.5
-
Local network: 10.10.10.0/24
-
Remote network: 10.20.20.0/24
3) Standard evidence pack (CheckMates-ready): 3 sessions + consistent logs
Session A — VPN debug (user space)
-
Enter Expert.
-
Normalize/truncate before starting:
vpn debug trunc ALL=5
-
Enable VPN debug:
vpn debug on
-
(Optional, very useful) Enable timestamps:
vpn debug timeon
-
Follow logs in real time (separate tab is ideal):
tail -f $FWDIR/log/vpnd.elg
tail -f $FWDIR/log/ike.elg
vpn debug writes evidence to vpnd.elg and ike.elg under $FWDIR/log/.
When to use ikefail
If the issue is intermittent and you want to see failures or error events in IKE trading
vpn debug ikefail
Session B — Network capture (tcpdump) for IKE/NAT-T/ESP
Goal: prove “Did the peer respond?”, “Did it migrate to NAT-T?”, “Is ESP flowing?”
tcpdump -ni <wan_iface> -vvv "(udp port 500 or udp port 4500 or proto 50)"
Practical interpretation
-
Only UDP/500 outbound and nothing returns → upstream block / routing / ACL.
-
UDP/500 then UDP/4500 → NAT-T in action (expected when NAT exists).
-
ESP outbound but not inbound → return path / upstream ACL / ISP / asymmetric routing / remote drop.
-
ESP both directions but application fails → likely Phase 2 / selectors / NAT exemption / policy.
Session C — (Optional) IKE monitor (snoop) when you must “see the conversation”
If you need deeper visibility:
vpn debug mon
This generates a snoop capture (for example ikemonitor.snoop) for analysis.
⚠️ Sensitive data warning: can record sensitive information (e.g., XAUTH/password) depending on the scenario. Use only when necessary and handle per policy.
To stop:
vpn debug moff
4) When (and how) to use Kernel Debug without hurting the environment
Golden rule: never run kernel debug “blind.” Filter by VPN peer or 5-tuple, then capture.
4.1 Filter by VPN peer (fastest for VPN incidents)
fw ctl set int simple_debug_filter_off 1
fw ctl set str simple_debug_filter_vpn_1 "203.0.113.10"
simple_debug_filter_vpn_<N> filters kernel debug output by VPN peer.
4.2 Filter by 5-tuple (flow-specific issues)
Example (test HTTPS between two hosts):
fw ctl set int simple_debug_filter_off 1
fw ctl set str simple_debug_filter_saddr_1 "10.10.10.50"
fw ctl set str simple_debug_filter_daddr_1 "10.20.20.80"
fw ctl set int simple_debug_filter_dport_1 443
fw ctl set int simple_debug_filter_proto_1 6
The filter logic supports AND (same index) and OR (different indices), including covering both directions if needed.
4.3 Disable filters at the end (mandatory hygiene)
fw ctl set int simple_debug_filter_off 1
5) End-to-end “copy/paste” sequence
Start (Session A)
vpn debug trunc ALL=5
vpn debug timeon
vpn debug on
Network capture (Session B)
tcpdump -ni <wan_iface> -vvv "(udp port 500 or udp port 4500 or proto 50)"
(Optional) IKE monitor (Session C)
vpn debug mon
Reproduce the issue
Stop (Session A)
vpn debug off
vpn debug timeoff
Stop monitor (if used)
vpn debug moff
If you enabled kernel debug filters
fw ctl set int simple_debug_filter_off 1
6) What to look for in the logs (evidence-driven diagnosis)
6.1 IKE (Phase 1 / IKE SA) failures
Common patterns:
-
“no proposal chosen” → proposal mismatch (encryption/integrity/DH).
-
“invalid ID / ID mismatch” → peer identity mismatch (DN/FQDN/IP).
-
“peer not responding” → network/ACL/routing/NAT-T/ISP (prove with tcpdump).
How to close the case with evidence
-
If tcpdump shows no reply on UDP/500/4500 → not “VPN config,” it’s connectivity/path.
-
If the peer replies and negotiation stops at a consistent point → correlate timestamps in ike.elg/vpnd.elg to identify the exact failure.
6.2 Phase 2 / Traffic Selector failures
Typical signs:
-
IKE comes up, but traffic doesn’t pass.
-
ESP is missing or only one direction.
-
Sessions match the wrong behavior due to NAT/routing.
Proof points
7) VSX (when applicable)
Before collecting anything, switch to the correct VS:
vsenv <VSID>
Then run the steps above (each VS has its own contexts/logs).