Re: Standby member no internet

melcu · ‎2024-09-28

Hi Mates,

So it's been discussed a lot but my story is a little bit different. I have a client with a bunch of Active/Standby ClusterXL clusters in which the Standby member cannot access he internet at all.

Long story short: I almost ran out of search keywords in this forum and on google regarding the issue. First of all, sk43807 was followed line-by-line with no luck. then fwha_forw_packet_to_not_active 1/0 - no change at all and this is why! - please see the diagram. There is more than 1 interface but you get the picture.

Both members are running only on private IP addresses. All traffic is NAT hidden behind a public IP address and the CORE router knows to route the /32 of that public IP address to the VIP address of the cluster. When the ACTIVE node (doesn't matter, fw1 or fw2) sends any packets it's NAT-ed behind that public IP address and sent on it's way. The return traffic is forwarded by the router to the VIP which and everything works (as VIP is bounded to the Active member).

When the Standby member tries to access everything I can see (and I'm very sorry but I cannot put real captures here due to IP address privacy) that packets that originates from Standby are forwarded to the Active member over the SYNC interface. The Active member then matches the traffic to it's rulebase, applies NAT and packets go out to CORE and then to internet. The return traffic is funny. It arrives on the Active member and there vanishes. It's not dropped (fw ctl zdebug +drop) , it simple vanishes and is not forwarded to the Standby member (which is a function by design I presume).

So eventually I've lost all my hops in making this work.

Any help or guidance will really be apreciated.

Wish all the best,

_Val_ · ‎2024-09-29

Forwarding outbound traffic through the sync interface is correct in this case. Return packet disappearing without forwarding back is not. Check if you have any associated drops. Also, if stuck, please open a support request.

AkosBakos · ‎2024-09-29

Hi @melcu

Run am #fw ctl zdebug + drop on the standby member. Maybe will appear something meaningful.

A

----------------
\m/_(>_<)_\m/

melcu · ‎2024-09-29

I've created a lab with the exact IP schema from the above diagram. Pinging google DNS I can see this on the Active member, but nothing on the standby member. See the attached.

Later edit: Also interfaces and routes are identical.

[Expert@gw01:0]# ip ro ls
10.134.0.0/24 dev eth1 proto kernel scope link src 10.134.0.11
10.144.70.0/24 dev eth0 proto kernel scope link src 10.144.70.21
10.144.0.0/16 via 10.144.70.1 dev eth0 proto routed
default via 10.134.0.1 dev eth1 proto routed

[Expert@gw02:0]# ip ro ls
10.134.0.0/24 dev eth1 proto kernel scope link src 10.134.0.12
10.144.70.0/24 dev eth0 proto kernel scope link src 10.144.70.22
10.144.0.0/16 via 10.144.70.1 dev eth0 proto routed
default via 10.134.0.1 dev eth1 proto routed

Getting interfaces with or without topology goes well. At least in my lab as I don't have any access in their environment. Here I can do whatever I want as it's a lab .. no impact 🙂

the_rock · ‎2024-09-29

Can you send output of below command from expert mode of both members please?

Andy

ip r g 8.8.8.8

the_rock · ‎2024-09-29

@melcu

For the context this is what I see in my lab cluster and fw2 is standby currently.

Andy

[Expert@CP-FW-01:0]#
[Expert@CP-FW-01:0]# ip r g 8.8.8.8
8.8.8.8 via 172.16.10.1 dev eth0 src 172.16.10.248
cache
[Expert@CP-FW-01:0]# ssh admin@172.16.10.247
admin@172.16.10.247's password:
Last login: Fri Sep 27 08:35:02 2024 from 100.65.16.2
[Expert@CP-FW-02:0]# ip r g 8.8.8.8
8.8.8.8 via 172.16.10.1 dev eth0 src 172.16.10.247
cache
[Expert@CP-FW-02:0]# ^C
[Expert@CP-FW-02:0]# cphaprob roles

ID Role

1 Master
2 (local) Non-Master

[Expert@CP-FW-02:0]# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=113 time=9.13 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=113 time=6.36 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=113 time=6.00 ms
^C
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 6.008/7.169/9.133/1.399 ms
[Expert@CP-FW-02:0]#

the_rock · ‎2024-09-29

If I were you, apart from what the guys already asked, I would make sure routes are exactly the same on both members. Also, to confirm, navigate to cluster object in smart console, open network, then topology and click get interfaces WITHOUT topology, just to verify there are no errors.

Andy

melcu · ‎2024-09-29

[Expert@gw01:0]# ip ro get 8.8.8.8
8.8.8.8 via 10.134.0.1 dev eth1 src 10.134.0.11
cache ipid 0x40ef mtu 1500 advmss 1460 hoplimit 64

[Expert@gw02:0]# ip ro get 8.8.8.8
8.8.8.8 via 10.134.0.1 dev eth1 src 10.134.0.12
cache mtu 1500 advmss 1460 hoplimit 64

fw1 is MASTER

fw2 is NON-MASTER (as it's the standby unit) If I flip them internet works in FW2 but not on FW1 🙂

the_rock · ‎2024-09-29

I have a gut feeling I know what could be wrong. So its 100% NOT the specific member if an issue happens regardless which is master.

Can you check how below is set? No need to send a screenshot, just verify.

Andy

the_rock · ‎2024-09-29

@melcu I know this may sound silly (trivial), but can you confirm 100% that you do indeed have a proper rule in smart console allowing the traffic?

Andy

melcu · ‎2024-09-29

In LAB is wire. Specific rule for VIP and members to access internet and the unsafe from ANY to members and CLU (but public IP is protected by an IPS profile - just in case).

the_rock · ‎2024-09-29

To clear ANY doubts, run ping in one ssh window, then below in another and send what you get.

Andy

[Expert@CP-FW-02:0]# fw up_execute dst=8.8.8.8 ipp=0
Rulebase execution ended successfully.
Overall status:
----------------
Active clob mask: 2
Required clob mask: 0
Match status: POSSIBLE
Match action: Accept

Per Layer:
------------
Layer name: network
Layer id: 0
Match status: POSSIBLE
Match action: Accept
Possible rules: 3 4 6 7 8 9 16777215

Layer name: appc+urlf
Layer id: 6
Match status: MATCH
Match action: Accept
Matched rule: 5
Possible rules: 5 16777215

Layer name: content-awareness-layer
Layer id: 3
Match status: MATCH
Match action: Accept
Matched rule: 1
Matched rules: 1

Layer name: final-allow-layer
Layer id: 7
Match status: MATCH
Match action: Accept
Matched rule: 1
Matched rules: 1

[Expert@CP-FW-02:0]#

melcu · ‎2024-09-29

the_rock · ‎2024-09-29

K, so if its accepted, rules are fine. Not sure then, maybe some kernel parameter...lets see if anyone else may have an idea. Anyway, I have to go now, get ready for some biking event.

Hope you find the resolution soon.

Andy

melcu · ‎2024-09-29

Told ya! It's driving me crazy!

Even looking in the logs it shows that traffic is accepted by fw01, is NATted by the public IP and goes out. Even in the Fortigate I see traffic coming from the NAT IP (when initiated by the standby member) but when it returns it gets dropped or something by fw01.

Perimetral firewall (Fortigate- but irrelevant):

FGT-FW01 (vdom-) # diagnose sniffer packet any 'host 8.8.8.8 and host 91.208.215.149'
interfaces=[any]
filters=[host 8.8.8.8 and host 91.208.215.149]
11.250797 91.208.215.149 -> 8.8.8.8: icmp: echo request
11.250813 91.208.215.149 -> 8.8.8.8: icmp: echo request
11.250814 91.208.215.149 -> 8.8.8.8: icmp: echo request
11.280356 8.8.8.8 -> 91.208.215.149: icmp: echo reply
11.280358 8.8.8.8 -> 91.208.215.149: icmp: echo reply
11.280368 8.8.8.8 -> 91.208.215.149: icmp: echo reply
11.280369 8.8.8.8 -> 91.208.215.149: icmp: echo reply

So it sees the traffic originating from standby node and NATTed behind 91.208.215.149

The return traffic reaches the VIP (on the active node)

[Expert@gw01:0]# tcpdump -vvv -ni eth1 host 8.8.8.8
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
17:27:25.095506 IP (tos 0x0, ttl 59, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64
17:27:25.095623 IP (tos 0x0, ttl 58, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64
17:27:25.095908 IP (tos 0x0, ttl 57, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64
17:27:25.095948 IP (tos 0x0, ttl 56, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64
17:27:25.096033 IP (tos 0x0, ttl 55, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64
17:27:25.096062 IP (tos 0x0, ttl 54, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64
17:27:25.096142 IP (tos 0x0, ttl 53, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64
17:27:25.096177 IP (tos 0x0, ttl 52, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64
17:27:25.096261 IP (tos 0x0, ttl 51, id 0, offset 0, flags [none], proto: ICMP (1), length: 84) 8.8.8.8 > 91.208.215.149: ICMP echo reply, id 10267, seq 8, length 64

But on the standby member .. mumu 🙂 sees only the outgoing packets but nothing back.

[Expert@gw02:0]# tcpdump -vni any host 8.8.8.8
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
17:30:39.739541 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: ICMP (1), length: 84) 91.208.215.149 > 8.8.8.8: ICMP echo request, id 10325, seq 176, length 64
17:30:40.739875 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: ICMP (1), length: 84) 91.208.215.149 > 8.8.8.8: ICMP echo request, id 10325, seq 177, length 64
17:30:41.740411 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: ICMP (1), length: 84) 91.208.215.149 > 8.8.8.8: ICMP echo request, id 10325, seq 178, length 64
17:30:42.739779 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: ICMP (1), length: 84) 91.208.215.149 > 8.8.8.8: ICMP echo request, id 10325, seq 179, length 64

First I thought that the packet will return to a different interface and that's why I've used "-i any". It doesn't come back from FW01.

Pretty sure this is Kernel issue. Btw, even turning fwaccell off doesn't solve.

the_rock · ‎2024-09-29

Just came back from my race...I thought about this while running and now that I checked your topology, I am certain your issue is the fact you have external if configured as sync. Can you create SEPARATE sync interface and test?

Andy

melcu · ‎2024-09-29

Hmm SYNC was not on this interface

Before:
eth0       UP                    sync(secured), unicast
eth1       UP                    non sync(non secured), unicast

Virtual cluster interfaces: 2
eth0            10.144.70.20
eth1            10.134.0.10

After:
eth0       UP                    non sync(non secured), unicast
eth1       UP                    sync(secured), unicast

Virtual cluster interfaces: 2

eth0            10.144.70.20
eth1            10.134.0.10

Behavior is the same..

Just tested. ClusterXL_admin down on FW1 and FW2 instantly reaches internet.

SO definitely something in the Kern.

the_rock · ‎2024-09-29

But then how come your topology shows it is?? Can you send topology screenshot again? And also commands below from BOTH members?

Andy

If you got time, lets do remote, since it a lab

Let me know

[Expert@CP-FW-01:0]# cphaprob -a if

CCP mode: Manual (Unicast)
Required interfaces: 4
Required secured interfaces: 1

Interface Name: Status:

eth0 (LM) UP
eth1 (LM) UP
eth2 (LM) UP
eth3 (S) UP

S - sync, HA/LS - bond type, LM - link monitor, P - probing

Virtual cluster interfaces: 3

eth0 172.16.10.246
eth1 192.168.10.246
eth2 172.31.10.246

[Expert@CP-FW-01:0]# cphaprob -i list

There are no pnotes in problem state

[Expert@CP-FW-01:0]# cphaprob syncstat

Delta Sync Statistics

Sync status: OK

Drops:
Lost updates................................. 0
Lost bulk update events...................... 0
Oversized updates not sent................... 0

Sync at risk:
Sent reject notifications.................... 0
Received reject notifications................ 0

Sent messages:
Total generated sync messages................ 1458146
Sent retransmission requests................. 0
Sent retransmission updates.................. 0
Peak fragments per update.................... 1

Received messages:
Total received updates....................... 3962798
Received retransmission requests............. 0

Sync Interface:
Name......................................... eth3
Link speed................................... 1000Mb/s
Rate......................................... 25340 [Bps]
Peak rate.................................... 802520[Bps]
Link usage................................... 0%
Total........................................ 27707 [MB]

Queue sizes (num of updates):
Sending queue size........................... 512
Receiving queue size......................... 256
Fragments queue size......................... 50

Timers:
Delta Sync interval (ms)..................... 100

Reset on Sun Sep 22 12:15:49 2024 (triggered by fullsync).

[Expert@CP-FW-01:0]#

the_rock · ‎2024-09-29

Forgot to mention, IF topology shows something different, go to smart console cluster object, network and then edit topology, click "get interfaces WITHOUT topology", make sure it saves without errors, publish, install policy, test.

Andy

PhoneBoy · ‎2024-09-30

So an upstream router is basically providing the NAT here only for the public VIP?
I'm with @AkosBakos, we probably need to see fw ctl zdebug + drop output from the active member while trying to initiate communication from the secondary.

melcu · ‎2024-09-30

OOOook . so basically after one of my idiot colleague deleted my VMs I fully reinstalled both SMS and this time 4 gateways: 2 x R80.30 and 2 x R81.20.

R81.20 worked by default

R80.30 - I was back to square 1.

It seems that somehow by doing the fwha_silent_standby_mode 1 on the Standby member suddenly everything works.

But as I've changed a lot of things (from NAT rulebase where I've included all 3 IPs (VIP + 2 gateways), from policy where I've let everything outbound and everything Inbound (hopefully the public IP is protected by an IPS Profile upfront).

I'm now waiting for the gateways to reboot (as i flipped a lot of values with fw ctl set ) and I will try to secure the policy (no full in bound, no full outbound).

I'll keep you posted.

Btw, fw ctl zdebug + drop didn't snow anything.

the_rock · ‎2024-09-30

OOOook . so basically after one of my idiot colleague deleted my VMs I fully reinstalled both SMS and this time 4 gateways: 2 x R80.30 and 2 x R81.20.

Dont say that mate...maybe he (she) had good reason to do it, who knows. Anyway, glad you got it going.

Andy

Andrejs__Андрей · ‎2024-09-30

morning,

try set on both members:

fwha_silent_standby_mode = 0
fwha_cluster_hide_active_only = 0
fwha_forw_packet_to_not_active = 1
ccl_force_use_ccp = 1

regards,
Andrey.

melcu · ‎2024-10-01

Morning,

In my lab it worked by flipping fwha_silent_standby_mode to 1 on the Standby member but in their environment when I've asked to do the same not even the internal network worked.

Very strange. I will let TAC handle this but it's kind of annoying.

Andrejs__Андрей · ‎2024-10-01

ok, will be interested to know what decision will be better for their environment.

i provided solution that i use on R81.10 open servers (dedicated HW and on VMware).

regards,
Andrey.

the_rock · ‎2024-10-01

I still find it a bit odd any of that would be needed. I created cluster in the lab who knows how many times and on different versions and NEVER had to do any of those values manually.

Andy

melcu · ‎2024-10-01

I am completely agree with that! But who (I mean I never did this) used a private IP address to be translated to a public IP directly on the gateway and then routed to internet. Either I had a public IP address directly on the gateway itself and the traffic was hidden behind that IP or otherwise (like 99.5% of the time) I use private IP spaces and I let the main router do NAT and whatever else it needs to do.

At some point in time I remember that I had a cluster using something similar to this but it was in the age of R77 and some static arp entries were needed on the edge router in order for those NAT IP addresses to be reached from the internet. (it was mainly used for DNATs (so inbound connections).

Anyway tomorrow along with TAC I will have a deep look into the client's network and I will try to better understand what's wrong and how is this fixable 🙂

melcu · ‎2024-10-02

Mwahaha! :)) Spent 3 hours with TAC and we couldn't figure out what's wrong.

I'll keep you posted when I'll have the root cause.

the_rock · ‎2024-10-02

K, fair enough...keep us posted.

Andy

ScottF · ‎2024-10-04

I am also running R81.20 on dedicated VMWare hardware, and up until you posted this my standby members could not access the internet.
What I was seeing was the standby member sending its internet destined traffic using its internet IP address as the source over the sync interface, which is an internal vlan to the active member but it never was seen on the active member. I ran fw ctl zdebug + drop on the active member as well as tcpdumps on each interface, nothing.

I ran these commands on both members and now my R81.20 standby members have internet access. what a relief. thank you!

fw ctl set int fwha_silent_standby_mode 0
fw ctl set int fwha_forw_packet_to_not_active 1
fw ctl set int fwha_cluster_hide_active_only 0
fw ctl set int ccl_force_use_ccp 1

I have a TAC case open about this as well and posted this community link in the case with details of the success of the commands

Are you a member of CheckMates?

Standby member no internet