Re: ClusterXL Not Automatically Failing Over

John_Pinegar · ‎2019-05-21

Appliances: (2) 5400 16GB RAM Gaia R80.10

I have been experiencing this issue for over 18 months and haven't made progress with TAC. I am currently running R80.10 and was experiencing this issue in R77.30 as well (my upgrade to R80.10 was an attempt to resolve this issue).

Description: When physical memory approaches 16GB of consumption, traffic begins to drop. Running 'fw ctl zdebug drop' reveals a lot of 'Reason: PSL Drop: TCP segment out of maximum allowed sequence.' errors. If I'm lucky enough to catch things at this point, I can manually fail over to the standby node and the issue is immediately resolved. If I don't catch things at this stage, the primary node will eventually stop passing traffic and does not automatically fail over to the standby node. I cannot get in or out of my network and I cannot remotely manage the gateway without using the lights-out port (I've added lights-out because of this issue). This cluster is in my HQ office and all 26 remote locations are in a VPN community with this cluster (remote locations are 1450 appliances running R77.20.86). When this issue occurs, everyone in the company is impacted.

QoS definitely has an impact on this issue. Memory usage climbs by 1GB/day with QoS enabled. With QoS disabled, memory usage climbs by about 100MB/day. So with QoS disabled, the issue occurs much less frequently. With QoS enabled, I've got about a week before this issue occurs. In the past, when I manually fail over, I will reboot the non-active node. I tried something different last week. I failed over to the standby (cpstop && cpstart) and when the primary was showing 'standby' I failed back over. At some point 2 days later after business hours, the primary stopped passing traffic and didn't fail over.

I find it hard to believe that I'm the only one experiencing this issue. If anyone has any ideas, I'd greatly appreciate the help.

PhoneBoy · ‎2019-05-22

I suspect the memory leaks are the real issue here. Has any work been done on the TAC case(s) around that?

John_Pinegar · ‎2019-05-23

We have performed memory leak tests at TAC's request and not found anything definitive.

Are you a member of CheckMates?

ClusterXL Not Automatically Failing Over