Re: Short communications disruptions

Stefano_Cappell · ‎2020-11-03

We are experiencing short (30 sec-2 min) communications distruptions, where all the connectivity is gone and the main cluster member doesn't respond anymore (while the standby member does).

looking through /var/log/messages we can find some patterns here. Every time there is something like:

Starting CUL mode because CPU-02 usage (81%) on the local member increased above the configured threshold (80%).

Then multiple logs like:

cerbero1 kernel: [fw4_1];[~~censored_public_~~ip:44288 -> ~~Censored_public_ip~~:53] [ERROR]: malware_res_rep_rad_query: rad_kernel_api_async_get_resource() failed with error: Service is down

And then:

cerbero1 kernel: [fw4_1];CLUS-120202-1: Stopping CUL mode after 80 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.

what may cause this problem?

Thanks in advance

_Val_ · ‎2020-11-03

Can you please provide more details concerning the platform in use and software version?

Timothy_Hall · ‎2020-11-03

CUL getting activated indicates that CPU is getting spiked, the new CPU Spike Detective tool might help, or your firewall just needs some tuning. You can investigate the outage periods in historical mode with cpview -t.

Please post Super Seven outputs.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Erik_Schelz · ‎2021-07-27

Did you find any solution to this problem, because I am facing the exact same issue. I had a support call open, but support could not help me. 😞

Kind Regards

the_rock · ‎2021-07-27

Hi Erik,

Let us try help you here. Can you provide some more details about the issue, such as below:

-when did issue happen?

-any particular changes made that would (could) have caused it?

-any odd cluster behavior when this happens?

-what is the output of fw tab -t connections on the firewall?

-is connections limit set to particular number or automatic on gateway (cluster) object?

-can yopu run top, free -m, ps -auxw commands when this occurs?

-what type of hardware do you use? how many cpu/cores? what is the amount of ram?

Cheers.

Erik_Schelz · ‎2021-07-27

Hi,

thanks for responding.

The issue happens occasionally. We couldn't find any pattern when it will occur. We are struggling with this for quite some time now and I think the last bigger change we made before, was to enable Anti-Bot.

Just as Stefano describes in the initial post, the active cluster member hangs and e.g. some vpn connections via mobile access client are terminated. I guess only those, which are handled by the appropriate fw_kern thread. The passive cluster member is still reachable and fully functional.

While the issue is active, it is not possible to login to the active gateway. Therefore I can't provide the debugging output from top, ps, etc. The particular timeframe is also missing in "cpview -t".

Connection Limit is set to "Automatically".

The hardware is a 5900 Appliance with 16 CPUs and 32GB RAM.

If more information is needed, I will happily provide it.

the_rock · ‎2021-07-27

Ok, I see what you are saying...yea, thats a tricky situation, because its frustrating not being able to get any info, since member is "hanging" and you cant even access it. Just curious, did you try rebooting the fw when this happened? What did TAC suggest?

Personally, its hard to say 100% if this could be hardware issue, but the only way to really know for sure, would be to run hardware diagnostics tool and see what results you get. Based on hardware specs you provided, you got pretty powerful device.

Timothy_Hall · ‎2021-07-27

Code and JHFA level?

The CUL warnings are the symptom of high CPU load, not the underlying cause. Depending on your version the Spike Detective may be available to help you track down the high CPU utilization: sk166454: CPU Spike Detective.

The behavior described could also be indicative of an elephant flow, what does output of fw ctl multik print_heavy_conn show, ideally run within 24 hours of the problem happening.

Please provide output of the Super Seven commands for further tuning recommendations: S7PAC - Super Seven Performance Assessment Command...

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Erik_Schelz · ‎2021-07-28

Hi Tim,

I gathered the information you requested in the text file attached. I used the script you linked and added the output of the fw ctl multik print_heavy_conn and the contents of the log file, that spike_detective created for one of the spikes and a "cpinfo -y all". All the spike_detective log files look very similar with around 70% overhead for kiss_thin_nfa_exec_one_buf_parallel_xlate.

I hope you can use that information and look forward to hearing from you.

Kind Regards,

Erik

the_rock · ‎2021-07-28

Maybe @Timothy_Hall can double check for you, but I had a quick look and could not really see anything abnormal.

Timothy_Hall · ‎2021-07-28

There are a couple of things:

1) RX-DRP percentages are fine, but the presence of any TX-DRPs on numerous interfaces is fairly rare and I've only seen that a few times. I'd be curious to see if the TX-DRPs start incrementing during the disruptions you mention. After a disruption wait at least 10 minutes then run sar -n EDEV as this will show when the network error counters started moving, and please post the results. Could indicate some kind of hang on your interfaces or some kind of buffering problem between SecureXL and the NICs.

2) While acceleration path percentages are fine, I find it a bit odd that 0% of your traffic is fully accelerated and 0% of your connections are templated with SecureXL enabled. Please provide output of enabled_blades, I suspect that you may need to tune your Threat Prevention policy so it is not inspecting every last bit of traffic traversing the firewall.

3) Spike detective is showing a lot of CPU usage by "kiss_thin_nfa_exec_one_buf_parallel_xlate", not sure what that is exactly but I get a "stealth" hit for this string in SecureKnowledge for sk165173: VSX Security Gateway crashes when the IPS blade is enabled, and an FWK core is created. So probably something in IPS.

4) Based on this elephant flow:

[fw_0]; Conn: dmz_host2:841 -> internal_host:2049 IPP 6; Instance load: 60%; Connection instance load: 99%; StartTime: 27/07/21 20:04:41; Duration: 1577; IdentificationTime: 27/07/21 20:07:14; Service: 6:2049;

I suspect that you have extremely heavy traffic kicking up between your DMZ and internal network at very high speeds causing the disruptions (backups? Massive NFS-based file transfers?) and it is thoroughly saturating your active member but not enough to cause a failover. This is probably due to this heavy traffic getting sucked into Threat Prevention inspection when it should not be, so you'll probably need to either tune your Threat Prevention policy or fast_accel whaetver these heavy flows are to avoid killing the rest of the firewall. This theory tracks with what I mentioned in numbers 2 and 3 above.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Erik_Schelz · ‎2021-07-29

Thanks for the thorough explanation. We are using the following blades: fw vpn cvpn urlf ips identityServer anti_bot mon.

I will do the sar command the next time, when the issue arises. I guess doing it like 2 days later is not too effective.

The elefant flow you mentioned is not causing any problems afaik. But that might be the case because there is not much other traffic at this time. The elefant flow at 11:41 made my vpn connection hang and the system was unresponsive. When I ssh into it while this happens, the connection is established but I don't get a command prompt until the issue goes away. In the same time there is no historical data in cpview.

SK173405 sadly does not completely match the scenario.

Thanks for your help, I will keep you informed.

Timothy_Hall · ‎2021-07-29

Based on that enabled_blades list I'd say it is almost certainly IPS that is the culprit.

The sar command has 30 days of history built in. To access the network counter historical data you would do this:

sar -n EDEV -f /var/log/sa/saXX (XX is the day number you want to see, so July 26 would be 26.)

Might also be interesting to see CPU loads and where the cycles were being spent during the disruption (us, sy, wa, etc.), so also run sar -f /var/log/sa/saXX and post the results. You can also try poking around with cpview in history mode (which also keeps 30 days of historical data), cpview -t will put you in history mode then you use + and - to move forward or backward in time. If you post the sar data please also mention exactly when that day the disruption occurred.

If the SSH session connects at the TCP level but you cannot log in during the disruption, generally that means all CPUs are being monopolized in the kernel by INSPECT/SecureXL (si/sy space), and the sshd daemon running up in process/user space cannot get enough CPU cycles to service your login request after the TCP/22 connection was initially created by the kernel.

If you can figure out a way to cause or reproduce the disruption, try running ips off beforehand and see if that helps. Note that doing this may expose your organization to attacks for the time period IPS is disabled, and don't forget to turn it back on with ips on when done!

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

HeikoAnkenbrand · ‎2021-07-28

Hi @Stefano_Cappell

CUT>>>
cerbero1 kernel: [fw4_1];...[ERROR]: malware_res_rep_rad_query: rad_kernel_api_async_get_resource() failed with error: Service is down

And then:
cerbero1 kernel: [fw4_1];CLUS-120202-1: Stopping CUL mode after 80 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
<<<CUT

I think the RAD service is not available (marked red) and is restarted again and again. Anti-Bot, Anti-Virus, URL Filtering, HTTPs Inspection uses Resource ADvisor (RAD process) to enforce their policies/profiles. RAD forwards the relevant reputation/categorization requests to CP cloud. The request is being made to cws.checkpoint.com.
cws.checkpoint.com resolves to one Akamai servers. The server may change and once it does RAD is not able to recognize it.

Solution:

To identify that this is indeed the issue, do the following:

1) From within the Security Gateway identify the current cws.checkpoint.com IP by either pinging it or resolving it with nslookup.

2) Use 'netstat -nap | grep rad' and see which IP RAD uses at the moment.

If point 1 and 2 do not match, then this is the issue of change

Possible solutions:

1) Install policy.
2) Restart rad process by running 'rad_admin restart'

---

If that is not the problem, I would open a TAC case if the RAD service is not available.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

HeikoAnkenbrand · ‎2021-07-28

It can also be a hotfix problem. As the Hotfix is multi-threaded, you will see more RAD processes, and CPU can exceed 100%. This is a usual behavior.

Are you a member of CheckMates?

Short communications disruptions