Re: "UDP checksum is incorrect" - 100s of IPv6 DRO... - Page 2

Jerry · ‎2024-10-22

hi folks

quick one

one of my customers just upgraded to R82 last night and found in fw.log 100s of drops due to "UDP checksum is incorrect".

knowing how UDP works I presume that TP/IPS is to blame but which protection is responsible for that?

any clues?

Jerry

shais · ‎2024-10-22

The above will apply only until the next reboot.

To allow it to survive reboot, add to
$FWDIR/modules/fwkern.conf

udp_is_verify_cksum = 0

Jerry · ‎2024-10-22

Thanks a lot, Done. cprestart or reboot?

Jerry

Jerry · ‎2024-10-25

that still did not resolve the problem Shais ... are you able to continue and find out with me what is the root cause of that and if we could potentially fix it? Still having no response from the message I've sent to you ....

Jerry

the_rock · ‎2024-10-25

Hey bud,

So sorry about yesterday, had to deal with some other issues, will do some tests in the lab shortly and update you.

Andy

Best,
Andy

Jerry · ‎2024-10-25

no worries Andy, in fact I was hoping CP will be also interested in helping for their own benefit but seems that so far radio silence speaks by itself ...

Jerry

the_rock · ‎2024-10-25

No worries bud, maybe @shais is just busy. I will always do my best to help you. Btw, I updated fwkern.conf file, so letg me reboot and see if below values change, though no errors now, so not sure it may make any difference.

Andy

[Expert@R82:0]# cd $FWDIR/boot/modules
[Expert@R82:0]# more fwkern.conf
udp_is_verify_cksum=0
[Expert@R82:0]#

Best,
Andy

Jerry · ‎2024-10-25

thanks mate I always appreciate your commitment and solit contribution however,

let me just point out something very specific here, there is a little difference how you test it and how your topology looks like for the issues on R82

1. I use LACP 6x10GB interfaces with Fast Layer3+4 hashing

2. my errors appear after reboot but not immediately although you know I've applied the kern. changes and still rx checksum errors appears again

3. ping me on Teams so maybe we can t-shoot it together when you have time buddy?

4. I'm under the impression that driver version and vendor matters so maybe that is the key?

5. really wonder what CP R&D says so let's way until they "find time" .... 🙂

Jerry

Timothy_Hall · ‎2024-10-25

One key determination you'll need to make is whether the traffic being dropped due to these errors is "garbage" or not. Does the RX-DRP counter seem to slowly increment regardless of load? sar -n EDEV can help determining this.

It is possible there is some kind of occasional broadcast traffic getting splattered onto the network that the firewall would not process anyway; this traffic may have just been ignored by the R81.20 NIC driver but now it is getting reported. The classic example is the RX-DRP counter which before Gaia 3.10 indicated a drop of desirable traffic that we wanted to process but got lost due to buffering issues. But in Gaia 3.10 the counter is incremented for these buffering issues AND also for "garbage" traffic like unknown EtherTypes and improperly pruned VLAN tags that occasionally splatter into the network.

Unfortunately almost all NICs will not pass errored frames up the Gaia OS where they can be seen by tools like tcpdump, so unless you have a specialized sniffer appliance that can show you those errored frames this is going to be a tough road.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

the_rock · ‎2024-10-25

I know, I agree with you 100%.

However, I believe this is worth pointing out...

As BASIC as my lab is, I made that kernel value permanent, rebooted, checked interface counters, its exactly the SAME...did not change a single value.

Andy

Best,
Andy

Jerry · ‎2024-10-22

happy to do lab rabiting should you prefer zoom sesh ie. tomorrow? 🙂

Cheers

Jerry

Jerry · ‎2024-10-23

hi Shais, were you able to secure some time for the troubleshooting session re. the above?

the workaround you all guys suggested seems to work fw.log wise but not NIC errors disappearance. they still existing and increase. have you got any other hints/clues regarding that issue? See my post above from today.

In case you'd be up for the more deep-dive please let me know.

Jerry

the_rock · ‎2024-10-23

Bro, I would still open TAC case for the reference. I say this because if you bring this issue up to your local SE, 100% the first thing they will ask is if you have case open for it.

Andy

Best,
Andy

Jerry · ‎2024-10-23

hi mate the only problem is that opening TAC case would be the option unless your device affected by the above issue is not under the support right? 🙂 my LAB device isn't under support hence I'm quite reluctant to deal with the TAC as this wouldn't be an easy journey if your LAB device runs on the EVALs . Hope that makes sense?

That's why I've replied to Shais but so far no response 😞

Jerry

the_rock · ‎2024-10-23

Yes, Im so sorry, totally overlooked that part...lets hope @shais responds.

Andy

Best,
Andy

G_W_Albrecht · ‎2024-10-23

If the LAB is used to prepare the R82 rollout for a customer with valid support contract, you can open a SR# under the User Account of that customer.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Jerry · ‎2024-10-23

hi ,thanks for your contribution but I'm afraid I've mentioned earlier that my LAB is "just a lab" meaning it runs on EVAL's ever since as it isn't used for much just "lab'ing" so I cannot associate that device with ANY customer I support nor with ANY accounts I'm Administering over CP UC - so this is simply NO GO re. TAC Case, hence our friend from CP here could be the only option for t-shooting interface error issues, meanwhile I bet I'm not the only one having used below driver:

[Expert@cp15k:0]# ethtool --driver eth1-01
driver: ixgbe
version: 5.15.2 (V1.0.1_ckp)
firmware-version: 0x800000cb
expansion-rom-version:
bus-info: 0000:87:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Jerry

Timothy_Hall · ‎2024-10-23

So I assume you were not seeing these errors prior to the R82 upgrade?

It is possible that the updated R82 NIC driver is logging these errors now where it didn't before. So the problem could have been there before but now it is being logged. Either you are receiving frames that really do have a CRC/checksum error that are being discarded, or there is some kind of problem with the checksum offload and it is flagging frames that it shouldn't. Either way you won't be able to see these errored frames in a packet capture to identify what they are, as the NIC will not pass errored frames to the Gaia OS at all.

Question is are these frames something that the firewall needs to process, or are they "junk" traffic like broadcasts that we can't process anyway?

Next step would be to disable the checksum offloads at the NIC level from expert mode and see what happens:

ethtool -K eth1-01 rx-checksumming off

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Jerry · ‎2024-10-23

before R82 I definitelly did not have such errors as per interface basis, meanwhile I did try this and look at the touput:

[Expert@cp:0]# ethtool -K eth1-01 rx-checksumming off
ethtool: bad command line argument(s)
For more information run ethtool -h

Jerry

Timothy_Hall · ‎2024-10-23

Try this:

ethtool -K eth1-01 rx off

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Jerry · ‎2024-10-23

that worked, is that the next step and the final one or should I try something else and see if the RX errors increases Tim?

Jerry

the_rock · ‎2024-10-23

Does not that command simply hide the actual number of errors, even if they are technically present?

Andy

Best,
Andy

Jerry · ‎2024-10-23

take a look at this guys, check this out, errors are also on the BOND itself (lacp L3+4):

Jerry

the_rock · ‎2024-10-23

Hey bud,

I assume rebooting the appliance does not do much? Or issue goes "away" for 10-15 mins and then its back again?

Andy

Best,
Andy

Jerry · ‎2024-10-23

hi mate

you mean reboot after this command: ethtool -K eth1-01 rx off

I did not try yet but can do in about 2h when business hours ends in London and I'm officially OOO 🙂

If after that command performed on each 6 of my bond's eth's the errors goes away for longer I'll definitelly share that update with you, as always, but what if this still persist as shown on my screenshoot? Shall the CP R&D get involved re. driver for gaia t-shooting or shall we consider this much broader issue?

Just on a side, I did R82 upgrade for another "appliance" yesterday and despite "default and smooth process taking roughly 2h" .... no errors on interfaces whatsoever. Same line card (10G FIBRE) so ... is the same driver etc. look at this (2 diff. appliances):

[Expert@cp15k:0]# ethtool --driver eth1-01
driver: ixgbe
version: 5.15.2 (V1.0.1_ckp)
firmware-version: 0x800000cb
expansion-rom-version:
bus-info: 0000:87:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

vs

[Expert@cp58:0]# ethtool --driver eth1-01
driver: ixgbe
version: 5.15.2 (V1.0.1_ckp)
firmware-version: 0x73b90000
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

whilst cp58 has no errors on the interfaces whatsoever:

[Expert@cp58:0]# ifconfig eth1-01
eth1-01 Link encap:Ethernet HWaddr 00:1C:7F:61:E8:02
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:11337028 errors:0 dropped:0 overruns:0 frame:0
TX packets:14317656 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7943578450 (7.3 GiB) TX bytes:10475299922 (9.7 GiB)

Jerry

the_rock · ‎2024-10-23

I meant reboot in general. Btw, below is what I see in eve-ng lab.

Andy

[Expert@R82:0]# ethtool --driver eth0
driver: vmxnet3
version: 1.6.0.0-k-NAPI
firmware-version:
expansion-rom-version:
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
[Expert@R82:0]# mount
/dev/mapper/vg_splat-lv_current on / type xfs (rw,inode32)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/vda2 on /boot type ext3 (rw)
/dev/mapper/vg_splat-lv_log on /var/log type xfs (rw,inode32)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
cgroup on /sys/fs/cgroup type tmpfs (rw,uid=0,gid=0,mode=0755)
/sys on /var/log/jail/sys type none (rw,bind)
/proc on /var/log/jail/proc type none (rw,bind)
/usr/lib on /var/log/jail/lib type none (rw,bind)
/usr/lib64 on /var/log/jail/lib64 type none (rw,bind)
/usr/lib on /var/log/jail/usr/lib type none (rw,bind)
/usr/lib64 on /var/log/jail/usr/lib64 type none (rw,bind)
/usr/bin on /var/log/jail/usr/bin type none (rw,bind)
/usr/bin on /var/log/jail/bin type none (rw,bind)
/opt/CPshrd-R82/monitoring on /var/log/jail/opt/CPshrd-R82/monitoring type none (rw,bind)
/var/log/opt/CPsuite-R82/fw1/tmp on /var/log/jail/opt/CPsuite-R82/fw1/tmp type none (rw,bind)
/opt/CPsuite-R82/fw1/oracle_oi on /var/log/jail/opt/CPsuite-R82/fw1/oracle_oi type none (rw,bind)
/opt/CPsuite-R82/fw1/oracle_oi on /var/log/jail/etc/fw/oracle_oi type none (rw,bind)
/var/opt/CPsuite-R82/fw1/conf/scrub_watermark on /var/log/jail/opt/CPsuite-R82/fw1/conf/scrub_watermark type none (rw,bind)
/var/opt/CPsuite-R82/fw1/conf/scrub_watermark on /var/log/jail/etc/fw/conf/scrub_watermark type none (rw,bind)
/opt/CPsuite-R82/fw1/scripts/tex_watermark on /var/log/jail/opt/CPsuite-R82/fw1/scripts/tex_watermark type none (rw,bind)
/opt/CPsuite-R82/fw1/scripts/tex_watermark on /var/log/jail/etc/fw/scripts/tex_watermark type none (rw,bind)
/var/log/jail/opt/CPsuite-R82/fw1/scripts on /var/log/jail/var/log/jail/opt/CPsuite-R82/fw1/scripts type none (rw,bind)
/opt/CPsuite-R82/fw1/scripts/cpfc_jail on /var/log/jail/opt/CPsuite-R82/fw1/scripts/cpfc_jail type none (rw,bind)
/var/log/dlp/smtp on /var/log/jail/opt/CPsuite-R82/fw1/dlp/smtp type none (rw,bind)
/var/log/dlp/http on /var/log/jail/opt/CPsuite-R82/fw1/dlp/http type none (rw,bind)
/var/log/dlp/ftp on /var/log/jail/opt/CPsuite-R82/fw1/dlp/ftp type none (rw,bind)
/var/log/dlp/fingerprint on /var/log/jail/opt/CPsuite-R82/fw1/dlp/fingerprint type none (rw,bind)
/tmp/scrub on /var/log/jail/tmp/scrub type none (rw,bind)
[Expert@R82:0]#

Best,
Andy

Jerry · ‎2024-10-23

well mate VMWare is a totally different kettle of fish 🙂

Jerry

the_rock · ‎2024-10-23

Yea, I know lol

Anywho, lets see if @shais can help further.

Andy

Best,
Andy

Jerry · ‎2024-10-24

hi chaps

the update I've pro ised is as following. it took the gaia R82 some time (again) but unfortunatelly the issue still persist (not in the logs though but the error counters are still present on each physical 10G interface as well as on the BOND. please see below:

*** physical 10G SFP+ interfaces ***

eth1-01 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:4915134 errors:1026 dropped:0 overruns:0 frame:0
TX packets:9763534 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4391624842 (4.0 GiB) TX bytes:5995868400 (5.5 GiB)

eth1-02 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:9681626 errors:1056 dropped:0 overruns:0 frame:0
TX packets:9726841 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6111685985 (5.6 GiB) TX bytes:9917356185 (9.2 GiB)

eth1-03 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:8034042 errors:985 dropped:0 overruns:0 frame:0
TX packets:4877023 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:8419293653 (7.8 GiB) TX bytes:3472150420 (3.2 GiB)

eth1-04 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:5667213 errors:1127 dropped:0 overruns:0 frame:0
TX packets:5140141 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4686993232 (4.3 GiB) TX bytes:5163720723 (4.8 GiB)

eth2-01 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:7864861 errors:960 dropped:0 overruns:0 frame:0
TX packets:3445062 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7061148299 (6.5 GiB) TX bytes:2654281510 (2.4 GiB)

eth2-02 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:6338698 errors:789 dropped:0 overruns:0 frame:0
TX packets:9442289 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6556240235 (6.1 GiB) TX bytes:9210719409 (8.5 GiB)

*** and the bond of 6x10G ***

bond1 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
inet6 addr: fe80::21c:7fff:fe69:35bc/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:9216 Metric:1
RX packets:42538935 errors:5943 dropped:0 overruns:0 frame:0
TX packets:42425808 errors:0 dropped:2 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:37240011883 (34.6 GiB) TX bytes:36426317412 (33.9 GiB)

@Timothy_Hall - any ideas, clues or hints?

@the_rock - what'd you do next though?

Cheers!

Jerry

the_rock · ‎2024-10-24

I will do some more tests in the lab later bro.

Andy

Best,
Andy

Jerry · ‎2024-10-24

much appreciated, I think I've personally exhausted all the options on the table to narrow that investigation as if the driver is the root cause, imho the issue is on GAIA R82 and somehow sooner or later the checksum errors will reappear.

so unless you prove me wrong I'm under the impression that I won't be the only one having that issue considering that R82 is pretty fresh GA hence other than this issue all works perfectly fine and overall I'm happy with all new stuff build-in and how the performance of that platform improved the security posture however, those checksum errors are quite worrying don't you think folks?

hope we can get some answers from R&D and soon be in a position to get it fixed for all not just myself.

Cheers!

Jerry

Are you a member of CheckMates?

"UDP checksum is incorrect" - 100s of IPv6 DROPs in fw.log - TP/IPS responsible?