Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
biskit
Advisor
Jump to solution

Network Card Issue

Hi all,

I've got a network issue which isn't Check Point per se, but it's leaving one of my VSX cluster members down so figured I'd put it out there and see if anyone has any ideas...

Everything was working perfectly, but after nearly 500 days uptime I did a routine reboot.  The server never came back.

Connecting via the local console and doing some testing with tcpdump I have concluded that the NIC is receiving traffic, but not sending traffic.  I've proved this beyond doubt.  So this is the problem.

If I boot the server from a Knoppix live CD I can configure the interfaces and they all work perfectly.  So the hardware is fine.  Something has gone screwy with the GAiA TCP stack on the server, receiving but not sending.

Does anyone know what I can do?  Or is the best option to reinstall from the ISO?  (I hope not - it's a little drastic!)

Thanks,

Matt

0 Kudos
1 Solution

Accepted Solutions
Timothy_Hall
Legend Legend
Legend

TL;DR version of this thread: friends don't let friends use Broadcom NICs...

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

View solution in original post

0 Kudos
13 Replies
Timothy_Hall
Legend Legend
Legend

If the system is a member of a cluster, that interface may not be transmitting anything because it thinks it is the standby/backup member for that interface.  I think this is the most likely scenario.  Does the problematic interface appear in cphaprob -a if or show vrrp interfaces?  Is the cluster showing a healthy condition via cphaprob state?

When booted into Gaia, do commands fw ctl iflist and sim if show that interface as being correctly bound to the Check Point code?

Any chance clustering got disabled from cpconfig some time ago, and then took effect when the reboot happened?

After booting into Gaia and trying to pass traffic for a few minutes, run ethtool -S (interfacename) and post the results to diagnose it if is a NIC driver/hardware issue (doubtful).  What do the network counters on the switch port attached to that interface show on the switch itself?

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
biskit
Advisor

One thing I should have mentioned is that is does this even after "cpstop".  I presume it's a dumb Red Hat box at that point?

With "cpstart"ed I ran your commands.  As I can't SSH to the box I've captured the outputs via iLO.

ethtool -S eth0 gives this:  (only the first few lines have any data.  The rest of the output are all zero's.)

 

Capture.PNG

Capture.PNG

And just for good measure, here's ifconfig, showing RX packets but no TX packets.

There are four interfaces used on this system.  All four have exactly the same problem - receiving packets but not transmitting, even with Check Point completely stopped, and only since rebooting it.  It all worked perfectly prior to the reboot.  It's been rebooted many times since to no avail.  Odd 🙄

0 Kudos
biskit
Advisor

Oops, here is the ifconfig!

Capture.PNG

0 Kudos
Timothy_Hall
Legend Legend
Legend

That is strange, what does your routing table look like?  (netstat -rn) Do the TX counters remain at zero even if you try to initiate traffic from the firewall itself instead of trying to transit traffic across it?  When doing so are ARP requests even sent?  Also please provide output of ethtool (interfacename) and ethtool -i (interfacename).

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
biskit
Advisor

netstat -rn is identical on the working and the broken machines.

TX counters remain at zero when I ping out from the server.

ethtool outputs are:

Capture4.PNGCapture5.PNG

0 Kudos
Timothy_Hall
Legend Legend
Legend

Sure enough, it is a Broadcom NIC (tg3).  Broadcom cards are terrible and well-known for having random stability & performance problems.  If you run ethtool -i on the other working interfaces what driver are they using?

If you have an Intel-based NIC available in that server, move that card configuration to the Intel NIC and you should be fine.

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
biskit
Advisor

Thanks Tim.

Unfortunately I only have the 4 on-board NIC's.  Both cluster members are identical and use the same driver version.

I presume there's a way to reinstall just the NIC drivers?  But therein lies the problem of how I'd get those drivers on to the server with no working NIC's to start with.

I presume JHFA won't include newer drivers?  It's R77.30 on Take 221 at the moment.  The plan very soon is to upgrade to R80.20 but there are still some hoops to jump through before I can plough on with that.  I'm a month or two away.

So would your conclusion be to dig out the DVD and start again?  Or try to mount a USB stick to copy newer drivers over?

0 Kudos
Timothy_Hall
Legend Legend
Legend

Try these commands to forcibly rip the tg3 driver out of the kernel and cause a re-initialization.  Note that this will cause an outage on all your interfaces so do it from the console/LOM if possible to avoid cutting yourself off:

ifdown eth0; modprobe -r tg3; modprobe tg3; ifup eth0

The latest Jumbo HFA for R77.30 does not include a newer tg3 driver than the 3.122n one you are already using so that is unlikely to help; R80.30 with the 2.6.18 kernel is also still using version 3.122n for tg3.  Try contacting TAC as they may have a new tg3 driver available that could address what sure looks like a Broadcom bug to me; do not try to load your own unsupported version of the tg3 driver. 

There is a known R80.20 limitation with Broadcom cards, try tampering with the line speed of eth0 and see if that shakes it loose:

GAIA-3205

Interface that uses the tg3 driver is not able to set the speed to 1000 Mbit/sec in the following scenario:

  1. The interface is connected to a 1000 Mbit/sec switch port
  2. Disable auto-negotiation of the speed on the interface
  3. Manually configure the speed on the interface to 10 Mbit/sec or 100 Mbit/sec
  4. The speed on the interface is set at 10 Mbit/sec or 100 Mbit/sec
  5. Enable the auto-negotiation again
  6. The speed on the interface stays at the previous value of 10 Mbit/sec or 100 Mbit/sec 
R80.20

 

If I had a time machine the first thing I would do is go back and prevent Broadcom (and maybe Emulex) from ever manufacturing NIC cards...

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
biskit
Advisor

No further forward 😢

I ran the commands to force the reload and that didn't work.

In fact, now when I try and alter the interface in CLISH loads of options are missing - including being able to set the ipv4-address.  If I enter #set interface eth0 <TAB> I get this:

Capture6.PNG

If I manually type #set interface eth0 ipv4<TAB>, normally expecting it to auto-complete the rest of the command, now it just reports "CLINFR0329 Invalid command".

If I run #show configuration, the interface IP config is all there. 

TX is still zero on all interfaces.

I've manually set the link speed to 100M, then back to auto-neg.  It returns to 1000M/Full, but makes no difference to TX.

I've rebooted again.  No difference.

So I think my only option now is to rebuild and hope that works!

 

0 Kudos
Timothy_Hall
Legend Legend
Legend

Yeah I think you are stuck reloading.  Any chance you can pop in an Intel expansion NIC card prior to reloading?

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
biskit
Advisor

Hi Tim,

Unfortunately no spare NIC's available to test with.  And besides, it had been down for 10 days already so the customer was keen to do whatever it took to get their VSX cluster working again.

I've reinstalled, JHFA'd, vxs_util reconfigure'd, and we're back in business.  

VSXBCapture2.PNG

Thanks for all your help 😁

Matt

0 Kudos
Timothy_Hall
Legend Legend
Legend

TL;DR version of this thread: friends don't let friends use Broadcom NICs...

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
biskit
Advisor

Quick addition for the benefit of others facing the same problem.  sk101515  😀

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events