Packet Loss over Site-to-Site VPN

Wyman · ‎2021-05-28

Hi. We've started to have some packet loss issues between 2 of our offices. Office A has R80.40 gateways and office B has R80.30 gateways. Office B is the office where users have reported the issue in not being able to print (print server is in Office A),

Office B isn't getting these issues with our other office which is running R80.30. We ran pathping and can see that packet loss occurs at the Office A side of the tunnel when the packet gets to the external VIP of our cluster. Pinging from A to B shows packet loss as soon as that packet hits the internal VIP of the gateway.

Apart from the cluster upgrade, which happened last week, no other changes have been made. This particular issue only seems to have started yesterday so we're not quite sure why this is. Could it be caused by the differences in OS on each side of the tunnel?

PhoneBoy · ‎2021-05-28

It could be MTU issues or similar.
Packet captures from the relevant gateway might give you an idea what’s going on.

Timothy_Hall · ‎2021-05-29

As Phoneboy said this sounds like an MTU issue; I would assume printing will send max size packets that would run afoul of a low MTU somewhere. First off, check the MTU on all interfaces of the upgraded members and make sure they are 1500 along with the rest of the interface settings. Doubtful that Gaia 3.10 is the cause of your issue but it is a newer OS with updated drivers and such and some interface settings may not have quite made it through the upgrade.

How was the upgrade performed precisely at Office A? In-place with CPUSE or a reimage/new box with a reconfiguration by hand? Either way you could have lost your fwkern.conf file (or others) that may have had some kind of MTU mitigation settings in it. See this thread for a list of files that may have had customizations in them that were lost during the upgrade that you will need to reintroduce: https://community.checkpoint.com/t5/Security-Gateways/Hand-edited-Files-to-Check-After-Gateway-Upgra...

Also this SK sounds kind of similar to your issue, what Jumbo HFA level are you utilizing with your R80.40 boxes?

sk167953: Traffic is dropped with "dropped by fwmultik_process_f2p_cookie_inner Reason: fwmultik_f2p...

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

the_rock · ‎2021-05-29

Thats a tough one to figure out. Phoneboy and Tim made good points, though I would find it a bit odd that upgrade would have caused any issues with MTU. I dont think it matters at all as far as OS version, I see people still have vpn tunnels between R77.30 gateways and R81 and works with no issues. Personally, I would contact TAC, just to verify that something with the config had not changed on the upgraded cluster.

Best,
Andy

genisis__ · ‎2021-05-30

As 'The_Rock' suggests, sounds like a TAC case, but also worth gathering the below to then submit to TAC.

Ensure your running JHFA118 on R80.40(latest GA release), R80.30 is JHFA228 (May want to consider latest ongoing take which is JHFA236 which has been out almost 3 weeks now so its getting close to GA).

Gather cpinfos from Office A and Office B
tcpdumps from Office A and Office B
Do some VPN debugging on both sites:
vpn debug trunc
vpn debug on TDERROR_ALL_ALL=5

replicate issue

vpn debug off
vpn debug ikeoff

Collect $FWDIR/log/vpnd.elg from both devices.

Ensure MTU Path discovery is working (I think there was a post about MTU discovery and allowing inbound access to ICMP type 3 code 4, but I would whitelist access to the gateways for this rather then generic inbound access to the gateways)

b.t.w Is Office A a virtual system or a physical gateway?

I think with the above you will be giving TAC allot to go on to escalate quickly if required.

the_rock · ‎2021-05-30

All very valid points and Im positive thats what TAC would ask him for anyway : )

Best,
Andy

Wyman · ‎2021-06-02

Thanks for the tips, everyone. The performance looks to have improved but I will keep this for future reference.

genisis__ · ‎2021-06-02

What was the issue and resolution?

Wyman · ‎2021-06-02

Hi. It looks to have been caused by high bandwidth utilisation.

the_rock · ‎2021-06-02

Thanks for letting us know, thats interesting.

Best,
Andy

Are you a member of CheckMates?

Packet Loss over Site-to-Site VPN