We have/had some serious issues with our gateways: the latency through the FW-modules grew and grew over time. We saw growing latencies *within* our network from (usually 1-2ms) up to 130ms when traffic passed the firewall. When some point of the latency was reached, it all collapsed ( ! ) when the modules spontaneously didn't forward any traffic all. Unfortunately without any HA-activity. The cluster just ran without new traffic, which is a rather bad condition for a central firewall.
So after weeks of debugging and testing we found a hint to a kernel-bug in the underlying linux: when the firewall had to issue icmp-unreachable-packets (which seemed in our case a different traffic-pattern) the kernel leaked memory. And this led to a growing memory-consumption which 1) led to the growing latency, and 2) eventual led to the collapse, when the memory was full all the CPUs went into 100%. We also found out, that ongoing-connections were not affected when it all collapsed, and this was most probably the cause that the HA-mechanisms weren't affected: the cluster just exchanged traffic as before.
We then had to convince the CP-Support/RnD that this is our error. So we now have a bugfix for this, running on one module and the mem-leak seems to be gone.
How we discovered the latency-issue: we use a graphing-tool called smokeping, where the latency is seen as "smoke" in the graphs. So we saw the growing latency once we knew the place to look at. Our quick-fix for the issue was simply to reboot the cluster-nodes every 5 days and to switch over to that freshly rebooted node.
We have an R80.10-FW-Cluster, both on modules and management, we upgraded in November 2018, the error occurred first in January 2018. The errorline when it collapsed was
Feb 1 23:45:12 2018 fwm-name1 kernel: dst cache overflow
The problem in the linux-kernel was CVE-2009-0778.
Maybe this helps someone to identify such errors.