Route propagation through virtual switch issues

S_K_S · ‎2021-12-29

Hi all and happy holidays!

I have a weird issue with a new provisioning of a VSX cluster (2 appliances in VSLS mode, R80.40 with the latest JHF take). There are several virtual firewalls already configured and linked through a virtual switch. The local interfaces of each VS (for example - interface bond0.10 of VS2, interface bond1.20 of VS3 and so on) are all propagated through the virtual switch and are reachable from the other firewalls - i.e. if I try to ping an interface belonging to VS3 from VS2 through the virtual switch - that works fine. However, static routes configured on each VS and also propagated through the virtual switch don't seem to work, even though they appear in the routing table of the other virtual firewalls. We have tested one such static route from the VS which is propagating it to the other VS's and we know that it works because a remote host in the respective subnet responds to ping, but any attempt to do the same ping from a VS to which that route is propagated fails. I have performed some tcpdumps on the wrp interface of each VS and can see only traffic directed at local subnets defined as interfaces on the other virtual systems, but no traffic from or to remote subnets defined and propagated as static routes. So it looks like for whatever reason the route propagation just doesn't work, which is weird. Has anybody encountered such a problem?

S_K_S · ‎2021-12-29

And just to add - all policies have been installed multiple times (including the VS0 policy) and we have tried removing and re-adding some of the problematic routes to no avail. Reboot also didn't help.

genisis__ · ‎2021-12-29

Sounds like a TAC case, what does 'ip route get' tell you about one of the routes added (at a VS level in expert mode); have you tried deleting a route and re-adding it via Smartconsole or vsx provisioning tool?

One thing I've seen (on R77.x VSX) is a similar issue, but in that case the route was added in Smartconsole but not visible at the command line.

In that specific case rebooting the VSX nodes resolved it. I've not seen this type of issue on R80.x.

What does mac table look like from another VS were the route has been learned from ie. does it have the correct arp entries?

S_K_S · ‎2021-12-29

So far we have tried deleting and re-adding one of the non-working routes through the Smart Console and updating the topology + installing the policy across all firewalls, The route shows in the other VS's with netstat -rn or route, also with ip route get the next hop is correct - the wrp interface of the respective VS and the IP address of the VS propagating the route. The ARP table of each VS contains the MAC addresses of the wrp interfaces of the other VS's - but you need to ping them first, because the entry expires due to no traffic - currently only traffic from and to one of the should-be-propagated via a static route subnet should work, that's where the monitoring servers reside (there is also another propagated subnet for DNS which has the same problem but we have not tried the delete/re-add gymnastics with it). Also, if we run a ping from the monitoring server in that subnet to the wrp interface of a VS which should receive the propagated route to the server (and does receive it, according to the routing table), tcpdump on the wrp interface of that VS doesn't get anything, while the ICMP packets are visible on the wrp interface of the VS propagating the route. Everything is allowed in the policy, fw ctl zdebug drop doesn't detect any relevant drops neither on the source, nor on the destination VS, nothing useful the Smart Console logs either - so it entirely looks like a route propagation problem or rather something related to it.

genisis__ · ‎2021-12-30

Very strange, what does fw monitor report (do this for each VS and VSW)? Either way sounds like TAC need to get involved.

Also are you running Jumbo Take 139 and when you installed R80.40 was this an in-place upgrade from an older version or was this a clean build?

Chris_Atkinson · ‎2021-12-30

Out of interest what's the subnet / mask involved?

CCSM R77/R80/ELITE

S_K_S · ‎2021-12-30

So far we have tested 3 non-working routes - two are /32, one is /24 from the 10.0.0.0/8 space.

The VSX is on a "re-purposed" 15600 appliance which used to be a standard gateway (R80.10). We installed R80.40 on it - I think with JHF 48, then updated to JHF 139. All the tests done so far are on JHF 139. I've been thinking about downgrading to a lower JHF version to check if this will make a difference - we have several VSX's with identical setup on JHF 91 which work fine.

fw monitor shows the same thing as tcpdump, plus the inbound and outbound interfaces (which are correct) - but running it with the same filter on either VS shows the output from the VS which propagates the route and nothing on the VS's which "receive" it (I think it's like that since earlier versions, I've seen fw monitor capture packets from, say, VS 5 while being executed on VS 10 as long as there is traffic on VS 5 which matches the content of the filter, at least as far back as R77). On the virtual switch I see packets coming from the inbound wrp interface but nothing on the outbound one, same with tcpdump.

Something interesting I've noticed while running tcpdump on the wrp interface of the destination VS specifically for ARP is that there is a significant time delay between the request and the response to each ARP query - several seconds, sometimes even more than 10 it seems. That's for wrp interfaces of VS's which are active on the same VSX, i.e. not going to the other appliance from the VSLS pair. Also, I've tried pinging the default gateway (VRRP address of an upstream switch) of VS1 from VS2 through the virtual switch and this didn't work, while pinging the interface of VS1 leading to the gateway from VS2 through the virtual switch works. Looks like some internal communication problem and the only addresses which are accessible between the VS's are the internal ones, nothing outside of the VSX - unless it's accessed through the VS which has a direct route, not a propagated one.

genisis__ · ‎2021-12-30

TAC need to get involved, also please confirm if the 15600 was actually rebuilt from scratch or if this originally ran R80.10, then an in-place upgrade was done to R80.40.

I ask this because when I had an issue with R80.40 TAC recommended a clean build (b.t.w we have 15600's as well and of course the issues we faced are not identical), sounds a little daunting, but its not. If you have the option please do consider doing a clean build to R80.40 and then installing the jumbo, clearly TAC should be the primary point of any action plan.

_Val_ · ‎2021-12-30

Please open a TAC case for this.

HeikoAnkenbrand · ‎2021-12-30

Hi @S_K_S,

We have the same and other problems with routing between VS over virtual switches at some customers. We have several TAC cases open for this purpose. A workaround was to turn off SecureXL for both VS. This can cause performance problems. Therefore only an emergency solution.

I would open a TAC case as described by @_Val_.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

genisis__ · ‎2021-12-30

Its interesting, we are all running R80.40 in VSX mode, all of us have had connectivity issue going through a virtual switch.

In my case the system works fine for ages, then we start to get packet loss for any VSs going through a VSW. A reboot resets the issue.

Now for us a number of bugs where identified. We have two bugs remaining, T139 takes care of one of them and another which relates to PIM currently is not integrated (I have requested this to be incorporated within a near future Jumbo).

It's been about 3 weeks since we installed a jumbo so if our issue has not been resolved, we should get a reoccurrence within the next few weeks.

VSs that do not utilise a VSW have never seen a issue.

So there is a common thread amongst us all, and that is the use of a VSW.

S_K_S · ‎2021-12-31

We have already tried disabling SecureXL on the affected VS's (one of the first things to check as it's the usual suspect in such cases) but this didn't help.

We'll see what the downgrade to a lower JHF take will achieve anything - probably in the first week of January as everyone monitoring the project is off at the moment. If nothing changes, we'll open a case with TAC.

genisis__ · ‎2021-12-31

We are currently running JHFA125, so far no reported issues in over 3 weeks.

Alex- · ‎2021-12-31

I had a fair share of routing and DHCP relay issues in R80.20/30/40 VSX, especially withs VSW. Most of them were solved with a kernel value change or private hotfix until something else popped up.

In the end, I decided to go to R81 T44 with fresh install and it works perfectly, no such issues since.

Are you a member of CheckMates?

Route propagation through virtual switch issues