Strange ( periodic ) packet loss

EVSolovyev · ‎2022-01-13

Good afternoon.

We have a cluster (HA) of 15600 with R80.40 actual JHF. Faced with a problem - some time after restarting the active node, starts losing packets. I.e. everything is working fine, then suddenly a couple (usually 2) packets are lost. Rebooting the active device - the problem goes away for a while, but then comes back.

Our scheme - the CP cluster by 10 Gb SFP+ port on interface card is connected to the core switch, from which VLANs are going through a large multi-storey building. The gateway for all is the CP.

Looked for errors on the CP port toward the switch - no errors. Decided to find a new little switch and try to connect a test segment of users through it for test (now serching for witch). fw ctl zdebug drop show nothing.... While we are looking for a switch, maybe you can offer some ideas on debugging, please.

On screen we pings gateway for this net (on on CP's VLAN port) and google DNS server.

HeikoAnkenbrand · ‎2022-01-14

Hi @EVSolovyev

I would check the following:
1) Is your internet connection ok before the firewall?
2) Check if you have errors on the interfaces (RX-OVR, RX-ERR, RX-DRP)
     # netstat -in
3) Is multi queueing enabled for the 10G interface?
     # mq_mng --show -v
4) You can see a high utilisation of the software interrupts (si) in the SecureXL instances.
     # fw ctl affinity -l -> Check which cores are used for SecureXL (Interfaces)
     # top + 1                 -> View the cores of the SecureXL instances
5) Running "show asset" command returns "Line Card Type: N/A" rather than properly identifying an installed 4 Port 10GBase-F SFP+
    > show asset
6) Do you see any interface errors in the file /var/log/messages
    cat /var/log/messages | grep "NETDEV"

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

EVSolovyev · ‎2022-01-14

Hello, @HeikoAnkenbrand

Thank you for the detailed answer.

1. I think, internet connected well - at this moment I see no droped packet to internet (ping about 2k packets).
2. There is no errors, but I see drops:

3. Multi queueing enabled on 4 cores:

4. In some times I see full utilization on some cores. But this is a rare occurrence and I have not yet been able to catch the process that does this. I think that fw_worker.

5. No, all is good:

6. No, there is nothing....:

But....:

What it can be?

HeikoAnkenbrand · ‎2022-01-14

It can be seen that the RX-DRP are high. If the errors continue to increase means that the SNDs can no longer handle the traffic.
I would give the system more SND cores in the first step change from 4 to 6 cores. Thus, more cores should also be used for multi queueing. Thus, more cores should also be used for multi queueing.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

genisis__ · ‎2022-01-14

I agree with that, we increased our SND allocation to 6 cores; and we are running 15600s as well.

EVSolovyev · ‎2022-01-14

I'm sorry, but I don't understand your advice.... My SND configuration:

Here we can see:

Best Practice - We recommend to allocate an additional CPU core to the CoreXL SND only if all these conditions are met:

There are at least 8 processing CPU cores.
In the output of the top command, the idle values for the CPU cores that run the CoreXL SND instances are in the 0%-5% range.
In the output of the top command, the sum of the idle values for the CPU cores that run the CoreXL Firewall instances is significantly higher than 100%.

If at least one of the above conditions is not met, the default CoreXL configuration is sufficient.

Whan I see cpview for CPU (screens are upper), I see, that about 100% utilization CPUs are mapped for fw_workers. I see never 100% utilization of SND cores.

And I'm sorry, but I can't understand, how to add 2 cores from fw to SND.... Am I need to decrease number of fw_workers in cpconfog only and free cores are automatically were added to SND after rebooting? Or to increase the number of SND cores I need to go some other way, which I have not yet been able to find?

genisis__ · ‎2022-01-14

Are these VSX appliances using virtual switches?

EVSolovyev · ‎2022-01-14

VSX is disabled and not used.

genisis__ · ‎2022-01-14

ok thanks, I have a similar issue with VSX and VSWs, but as your not running VSX no point in bring this to the table.

Can you confirm what jump your running? I would ensure your running at least JHFA125 (GA is JHFA139)

EVSolovyev · ‎2022-01-14

genisis__ · ‎2022-01-14

looks ok to me, I know we had issues with DNS packets and soon as we updated to that Jumbo issue was resolved.

can you also confirm no debugging is running 'fw ctl debug 0'. So we ensure its not a resource issues because of that.

Additionally silly checks like duplex settings

run this:

ifconfig -a | grep encap | awk '{print $1}' | grep -v lo | grep -v bond | grep -v ":" | grep -v ^lo | xargs -I % sh -c 'ethtool %; ethtool -i %' | grep '^driver\|Speed\|Duplex\|Setting' | sed "s/^/ /g" | tr -d "\t" | tr -d "\n" | sed "s/Settings for/\nSettings for/g" | awk '{print $5 " "$7 "\t " $9 "\t" $3}' | grep -v "Unknown" | grep -v "\."

and of course what Heiko has suggested to check.

EVSolovyev · ‎2022-01-14

Duplex - first thing, that was checked. ) I see no CPU utilization.

Alex- · ‎2022-01-14

You mention it happens on the active device. Is it always the same device displaying this behavior, or does it happen on whatever machine becomes active after reboot?

In addition to all what was explained here, you might want to do a failover and run the hardware diagnostics tool during a maintenance window. It could indicate if it's an issue with your NIC's.

EVSolovyev · ‎2022-01-14

It happen on whatever machine becomes active after reboot, but not immediately - after some time. May be some hours, or days. In CP cluster we have 2 devices, but core switch is a single device with multiple line cards.

genisis__ · ‎2022-01-14

I have similar issues where after a reboot I get packet loss after about 4 weeks, but this is on a VSX system and R&D have confirmed a bug.

You may want to log a TAC case, just in case its a bug; additionally perhaps installed the latest GA Jumbo as TAC will likely ask for this to be done.

Timothy_Hall · ‎2022-01-14

First off you need to make sure your cluster is stable, as losing 2 ping packets in a row will generally happen when there is a non-graceful failover between members. What is the failover count shown by cphaprob state from expert mode?

RX-DRPs could be the source of the drops, but those could also be non-IPv4 packets hitting the interface and getting dropped. Please provide updated output of netstat -ni along with ethtool -S eth3 so we can distinguish between ring buffer drops and unknown protocol drops. Based on the fact you are getting them on all interfaces they are probably unknown protocol drops. If they are actually ring buffer drops you can look at the history of when that counter is getting incremented with sar -n EDEV to see if it is slowly incrementing, or coming in clumps when you are experiencing the loss.

The high CPU utilization on some workers could be caused by elephant flows, and any "mice" trapped on that worker core with an elephant flow will be degraded and possibly lose packets. Any elephant flows in the last 24 hours reported by running fw ctl multik print_heavy_conn?

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

EVSolovyev · ‎2022-01-17

Hello.

Thank you for answer.

Yes, cluster is stable. "What is the failover count shown by cphaprob state from expert mode?" Info is here. 15 failovers is due to collegues rebooting the active device when losses start to occur. Reboot solves the problem for a few days.

Yes, sometime we have a big traffic connections:

Timothy_Hall · ‎2022-01-17

The constant rate of RX-DRP reported by sar would seem to indicate the presence of non-IPv4 protocols and probably not packet loss due to SNDs being overloaded, please provide the output of ethtool -S eth1-01 to be sure.

Looks like you have quite a few elephant flows squashing "mice" connections when they get trapped on a worker core with an elephant flow which could be the source of your packet loss. Make sure that priority queueing is enabled for when workers get fully loaded by running this command: fw ctl multik prioq

Beyond that you'll need to upgrade to R81 or later to take advantage of the pipeline paths that can spread the processing of elephant flows across multiple worker cores.

Also please provide the output of fw ctl pstat just in case it is a resource limitation on the firewall.

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

EVSolovyev · ‎2022-01-21

Hello.

Sorry for the delay in replying - my colleagues rebooted the devices and the problem went away.

There are no losses now, but there are delays. Pings a gateway (CP) for this machine:

On SND cores I am not able to catch high load. But I can see the ceiling on the workers.

But I see no elephant streems in cpview:

ethtool -S eth1-01:

fw ctl multik prioq:

fw ctl pstat:

I have read the RN for the R81 but found no mention of this. Can you please tell me where you got this information?

genisis__ · ‎2022-01-21

As a symptom this sounds exactly like my issue, after a reboot packet loss goes away for about 5 to 6 weeks.

Can you get CP TAC to check you are not experience this bug: PRJ-25443

Timothy_Hall · ‎2022-01-21

Yeah you'll need to run these commands again when the cluster is having issues as I don't see a smoking gun in those outputs. The only slightly unusual thing are the rx_no_dma_resources drops on your interface but there probably aren't enough of them to be significant.

The pipeline paths which are enabled by default in R81 aren't really documented, I learned about them in a call with R&D. 🙂 However if you are not having elephant flow issues I don't think the pipeline paths will help much, but we'll need to identify what the actual issue is first before making that determination.

Generally if firewall performance degrades over time it is some kind of occasional failure to free a resource that eventually starts to run short. The fw ctl pstat output would probably be most relevant if that is the case.

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

Dolev · ‎2022-01-17

Hi @EVSolovyev,

Do you have or had a SR opened with Check Point support on this issue by any chance? If so, can you reply to me with the SR#?

Thanks,

Dolev

EVSolovyev · ‎2022-01-18

Good afternoon.

No, we have not opened an SR at this time. The problem is that our support ran out, and we did not renew it in time. Now the process of buying it is underway. We are a state company and such processes are very slow. If we had tech support, I would have opened SR right away.

Are you a member of CheckMates?

Strange ( periodic ) packet loss