Re: 6800 stability Issue - R80.10 w\/Take 203 or R...

Mark_Thomasson1 · ‎2019-05-27

Our company purchased a pair of the new 6800 gateways. We initially applied Take 203 , before it became GA last week, as we weren't comfortable deploying new hardware without any notifies , supposedly the new 6800s can only run R80.10 and R80.20 ISOs and general release Takes and hot fixes can't be installed.

A week after putting them live into production the backup member of 6800 HA Pair NIC hung twice within a period of 24 hours. the submitted cpinfo indicated issues with the NIC registers and there was a recommendation to RMA the Quad port NIC card to "fix" the issue.

the following sk described what we were seeing in our logs

https://supportcenter.checkpoint.com/supportcenter/portaleventSubmit_doGoviewsolutiondetails=&soluti...

Not more than 96 hours later the same behavior occurred with the primary member of the 6800 cluster resulting in the failover to backup. again we RMAed the NIC card . on the recommendation of support , we changed the timeout for logging from 10 seconds to 2 seconds to better capture in the logs the behavior of the card before it failed. the card failed a 2nd time on what was the primary member , this time the unit flapped over 30 times over a period of 45 minutes before it ended up failing. this time the issue was more severe as the Cisco switch actually shut down the attached ports in a self protection mode due to continued prolonged flapping.

we are still waiting a solution from CheckPoint Support and R&D on this issue.

we initially thought the problem was isolated to our environment, but later heard the issue occurred in Check Point's lab environment in Canada on a pair of 4800s after they applied take 203. they initially thought the issue was flow control but the issues persisted after they turned flow control off.

It was later conveyed to me that an other customer that purchased 6800s was having the same issues but was on R80.20 and Take 74. This customer had replaced the NIC and was also looking to RMA the entire chassis .

Vladimir · ‎2019-05-28

@Mark_Thomasson1 , please clarify if the instability manifested in R80.20 only after JHFA 74 was installed.

I'll be deploying a pair of the 6500s this weekend and would like to get some feedback from the community on their experience with 6000 series.

Thank you,

Vladimir

Mark_Thomasson1 · ‎2019-05-28

The issues that we reported were with the 6800s with the quad port 10GB cards. the issue relayed to me from another customer was with a 6800 with R80.20 and JHFA Take 74. The 6000 lines have special ISOs. the Take 203 is GA for R80.10 and Take 74 for R80.20 is not

Danny · ‎2019-05-28

Have you tried R80.20 Ongoing Take 80?

Mark_Thomasson1 · ‎2019-05-28

As the 6000 series have special ISO that don't allow installation of standard takes only those intended for that platform

R80.10 - Take 203 which is GA

R80.20 -Take 74 which is not GA

if you don't have confidence in either of those Takes you are left with the "unpatched" version of R80.10 or R80.20 as your only option

Timothy_Hall · ‎2019-05-28

Is this only happening to NIC ports that are located on a slot expansion card, or is it happening on the built-in NIC ports as well?

The fake TX hang would seem to indicate a driver issue, whereas a "real" hang would normally indicate a hardware issue. Can you please provide the output of the ethtool -i (interfacename) and ethtool -S (interfacename) and ethtool -k (interfacename) for an interface that has hung? (Hopefully just after the hang and prior to a reboot/reset) In the case of the last command all NIC card offloads should be *off* as this has caused some strange issues in the past, and it is certainly possible they are using some kind of new NIC hardware in these 6000 series boxes that may have some type of new offload enabled.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Mark_Thomasson1 · ‎2019-05-28

This is only happening with the quad 10GB expansion cards for us

Not troubleshooting by posting the output of commands here. this issue has already been escalated with R&D

Timothy_Hall · ‎2019-05-29

> Not troubleshooting by posting the output of commands here. this issue has already been escalated with R&D

Understood, please post the solution here once it is found by R&D for the benefit of all. Thanks!

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Mark_Thomasson1 · ‎2019-06-04

I don't know how Alex can state the issue is NOT the platform as the customer facing the issue and no root cause has been established .

since the 6000 series have special builds of R80.10 and R80.20 that MAY or MAY NOT be. a contributing factor , I don't know AGAIN how anyone can definitely rule that out.

we have taken the 6800 out of production as a result

we did have another crash/hang yesterday while sitting on etc network and not passing any network load

We were able to upload a cpinfo shortly after the change to our SR in hopes to further address the issue.

phlrnnr · ‎2019-09-23

Hi @Timothy_Hall ,

I was assured by Checkpoint that the cards in my 6800 are not affected by this. However, I seem to be having the problem on R80.20 Take 87. I'll be opening a TAC case shortly. However, here is the output of the commands you asked for before I reboot the machine.

[Expert@<removed>:0]# ethtool -i eth2-01
driver: ixgbe
version: 3.9.15-NAPI
firmware-version: 0x800000cb
bus-info: 0000:06:00.0

[Expert@<removed>:0]# ethtool -S eth2-01
NIC statistics:
rx_packets: 317574224
tx_packets: 608453986
rx_bytes: 70420161246
tx_bytes: 702313270365
rx_errors: 5249807244155220
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 2624903623517988
collisions: 0
rx_over_errors: 0
rx_crc_errors: 2624903622077610
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 20999228976620880
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
rx_pkts_nic: 321846298
tx_pkts_nic: 609048206
rx_bytes_nic: 2624976580835660
tx_bytes_nic: 784557258971
lsc_int: 3
tx_busy: 0
non_eop_descs: 0
broadcast: 2624903622077622
rx_no_buffer_count: 0
tx_timeout_count: 0
tx_restart_queue: 0
rx_long_length_errors: 2624903622077610
rx_short_length_errors: 2624903622077610
tx_flow_control_xon: 2624903622077610
rx_flow_control_xon: 2624903622077610
tx_flow_control_xoff: 2624903622077610
rx_flow_control_xoff: 2624903622077610
rx_csum_offload_errors: 2
alloc_rx_page_failed: 0
alloc_rx_buff_failed: 0
rx_no_dma_resources: 41997972621937425
hw_rsc_aggregated: 0
hw_rsc_flushed: 0
fdir_match: 2624903622077610
fdir_miss: 2624903622077610
fdir_overflow: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_queue_0_packets: 608453986
tx_queue_0_bytes: 702313270365
rx_queue_0_packets: 317574224
rx_queue_0_bytes: 70420161246

[Expert@<removed>:0]# ethtool -k eth2-01
Offload parameters for eth2-01:
Cannot get device udp large send offload settings: Operation not supported
Cannot get device GRO settings: Operation not supported
rx-checksumming: on
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off

Timothy_Hall · ‎2019-09-23

Most of those ethtool counters look completely invalid. Either something is seriously wrong in the ixgbe NIC driver, or there is some kind of corruption occurring. You could try to remove and reload the ixgbe driver from the kernel with the following commands, but keep in mind it will cause an outage on all network interfaces that utilize the ixgbe driver (and a failover if you are using a cluster), not just the eth2-01 interface:

modprobe -r ixgbe; modprobe ixgbe

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Alexander_Kim · ‎2019-06-04

Hi,

A quick update from RnD : this issue is definitely not appliance related (6k series, or any other...).

We are still studying the issue, most likely it will be found in ixgbe driver code, in a rare type of traffic bursts.

We will update with more details as soon as we make sure.

We would like to reach out to customers who experience this problem and provide them with a driver which we believe should resolve fake TX HANG.

Please, feel free to contact Keren Nitzan (kerenni@checkpoint.com) and myself (alexkim@checkpoint.com) if you have a customer with this issue.

phlrnnr · ‎2019-06-04

Thank you for this update! Do you know if this issue manifests itself in the 6800 appliance R80.20 GA code (Check_Point_R80.20_T101_R80.10_Dual_6000_T18.iso), or only after an Ongoing Jumbo hotfix accumulator is applied? (eg. Post Take 74, etc)

Alexander_Kim · ‎2019-06-04

From what we know, it's not a matter of specific JHF. We are gathering more information about such cases, but it seems like it can happen on R80.20 GA too. It's more related to the characteristics of the traffic flowing through the NIC

Mark_Thomasson1 · ‎2019-06-04

also Alex postulating this was due to a " rare type of traffic bursts" , I will comment that 3 of the 5 hangs we have observed has been with passive members of a cluster and yesterday essentially with a unit that is offline waiting a fix

I will applaud him in encouraging other customers who are facing similar issues to come forward

Mark_Thomasson1 · ‎2019-08-05

turned out to be a hardware problem with the 4-port cards, affecting small subset of customers, and a full Root Cause Analysis is not yet available describing the nature of the problem

As a permanent fix, we will be getting a different quad-port NIC which is not susceptible to the same problem (design improvement) once those are cards available.

Timothy_Hall · ‎2019-08-05

Interesting, thanks for the follow-up.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

6800 stability Issue - R80.10 w\/Take 203 or R80.20 w/ Take 74