- CheckMates
- :
- Products
- :
- General Topics
- :
- Re: "UDP checksum is incorrect" - 100s of IPv6 DRO...
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"UDP checksum is incorrect" - 100s of IPv6 DROPs in fw.log - TP/IPS responsible?
hi folks
quick one
one of my customers just upgraded to R82 last night and found in fw.log 100s of drops due to "UDP checksum is incorrect".
knowing how UDP works I presume that TP/IPS is to blame but which protection is responsible for that?
any clues?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The above will apply only until the next reboot.
To allow it to survive reboot, add to
$FWDIR/modules/fwkern.conf
udp_is_verify_cksum = 0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot, Done. cprestart or reboot?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
that still did not resolve the problem Shais ... are you able to continue and find out with me what is the root cause of that and if we could potentially fix it? Still having no response from the message I've sent to you ....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey bud,
So sorry about yesterday, had to deal with some other issues, will do some tests in the lab shortly and update you.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
no worries Andy, in fact I was hoping CP will be also interested in helping for their own benefit but seems that so far radio silence speaks by itself ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No worries bud, maybe @shais is just busy. I will always do my best to help you. Btw, I updated fwkern.conf file, so letg me reboot and see if below values change, though no errors now, so not sure it may make any difference.
Andy
[Expert@R82:0]# cd $FWDIR/boot/modules
[Expert@R82:0]# more fwkern.conf
udp_is_verify_cksum=0
[Expert@R82:0]#
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks mate I always appreciate your commitment and solit contribution however,
let me just point out something very specific here, there is a little difference how you test it and how your topology looks like for the issues on R82
1. I use LACP 6x10GB interfaces with Fast Layer3+4 hashing
2. my errors appear after reboot but not immediately although you know I've applied the kern. changes and still rx checksum errors appears again
3. ping me on Teams so maybe we can t-shoot it together when you have time buddy?
4. I'm under the impression that driver version and vendor matters so maybe that is the key?
5. really wonder what CP R&D says so let's way until they "find time" .... 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One key determination you'll need to make is whether the traffic being dropped due to these errors is "garbage" or not. Does the RX-DRP counter seem to slowly increment regardless of load? sar -n EDEV can help determining this.
It is possible there is some kind of occasional broadcast traffic getting splattered onto the network that the firewall would not process anyway; this traffic may have just been ignored by the R81.20 NIC driver but now it is getting reported. The classic example is the RX-DRP counter which before Gaia 3.10 indicated a drop of desirable traffic that we wanted to process but got lost due to buffering issues. But in Gaia 3.10 the counter is incremented for these buffering issues AND also for "garbage" traffic like unknown EtherTypes and improperly pruned VLAN tags that occasionally splatter into the network.
Unfortunately almost all NICs will not pass errored frames up the Gaia OS where they can be seen by tools like tcpdump, so unless you have a specialized sniffer appliance that can show you those errored frames this is going to be a tough road.
March 27th with sessions for both the EMEA and Americas time zones
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I know, I agree with you 100%.
However, I believe this is worth pointing out...
As BASIC as my lab is, I made that kernel value permanent, rebooted, checked interface counters, its exactly the SAME...did not change a single value.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
happy to do lab rabiting should you prefer zoom sesh ie. tomorrow? 🙂
Cheers
Jerry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi Shais, were you able to secure some time for the troubleshooting session re. the above?
the workaround you all guys suggested seems to work fw.log wise but not NIC errors disappearance. they still existing and increase. have you got any other hints/clues regarding that issue? See my post above from today.
In case you'd be up for the more deep-dive please let me know.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Bro, I would still open TAC case for the reference. I say this because if you bring this issue up to your local SE, 100% the first thing they will ask is if you have case open for it.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi mate the only problem is that opening TAC case would be the option unless your device affected by the above issue is not under the support right? 🙂 my LAB device isn't under support hence I'm quite reluctant to deal with the TAC as this wouldn't be an easy journey if your LAB device runs on the EVALs . Hope that makes sense?
That's why I've replied to Shais but so far no response 😞
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, Im so sorry, totally overlooked that part...lets hope @shais responds.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the LAB is used to prepare the R82 rollout for a customer with valid support contract, you can open a SR# under the User Account of that customer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi ,thanks for your contribution but I'm afraid I've mentioned earlier that my LAB is "just a lab" meaning it runs on EVAL's ever since as it isn't used for much just "lab'ing" so I cannot associate that device with ANY customer I support nor with ANY accounts I'm Administering over CP UC - so this is simply NO GO re. TAC Case, hence our friend from CP here could be the only option for t-shooting interface error issues, meanwhile I bet I'm not the only one having used below driver:
[Expert@cp15k:0]# ethtool --driver eth1-01
driver: ixgbe
version: 5.15.2 (V1.0.1_ckp)
firmware-version: 0x800000cb
expansion-rom-version:
bus-info: 0000:87:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So I assume you were not seeing these errors prior to the R82 upgrade?
It is possible that the updated R82 NIC driver is logging these errors now where it didn't before. So the problem could have been there before but now it is being logged. Either you are receiving frames that really do have a CRC/checksum error that are being discarded, or there is some kind of problem with the checksum offload and it is flagging frames that it shouldn't. Either way you won't be able to see these errored frames in a packet capture to identify what they are, as the NIC will not pass errored frames to the Gaia OS at all.
Question is are these frames something that the firewall needs to process, or are they "junk" traffic like broadcasts that we can't process anyway?
Next step would be to disable the checksum offloads at the NIC level from expert mode and see what happens:
ethtool -K eth1-01 rx-checksumming off
March 27th with sessions for both the EMEA and Americas time zones
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
before R82 I definitelly did not have such errors as per interface basis, meanwhile I did try this and look at the touput:
[Expert@cp:0]# ethtool -K eth1-01 rx-checksumming off
ethtool: bad command line argument(s)
For more information run ethtool -h
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this:
ethtool -K eth1-01 rx off
March 27th with sessions for both the EMEA and Americas time zones
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
that worked, is that the next step and the final one or should I try something else and see if the RX errors increases Tim?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does not that command simply hide the actual number of errors, even if they are technically present?
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
take a look at this guys, check this out, errors are also on the BOND itself (lacp L3+4):
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey bud,
I assume rebooting the appliance does not do much? Or issue goes "away" for 10-15 mins and then its back again?
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi mate
you mean reboot after this command: ethtool -K eth1-01 rx off
I did not try yet but can do in about 2h when business hours ends in London and I'm officially OOO 🙂
If after that command performed on each 6 of my bond's eth's the errors goes away for longer I'll definitelly share that update with you, as always, but what if this still persist as shown on my screenshoot? Shall the CP R&D get involved re. driver for gaia t-shooting or shall we consider this much broader issue?
Just on a side, I did R82 upgrade for another "appliance" yesterday and despite "default and smooth process taking roughly 2h" .... no errors on interfaces whatsoever. Same line card (10G FIBRE) so ... is the same driver etc. look at this (2 diff. appliances):
[Expert@cp15k:0]# ethtool --driver eth1-01
driver: ixgbe
version: 5.15.2 (V1.0.1_ckp)
firmware-version: 0x800000cb
expansion-rom-version:
bus-info: 0000:87:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
vs
[Expert@cp58:0]# ethtool --driver eth1-01
driver: ixgbe
version: 5.15.2 (V1.0.1_ckp)
firmware-version: 0x73b90000
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
whilst cp58 has no errors on the interfaces whatsoever:
[Expert@cp58:0]# ifconfig eth1-01
eth1-01 Link encap:Ethernet HWaddr 00:1C:7F:61:E8:02
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:11337028 errors:0 dropped:0 overruns:0 frame:0
TX packets:14317656 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7943578450 (7.3 GiB) TX bytes:10475299922 (9.7 GiB)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I meant reboot in general. Btw, below is what I see in eve-ng lab.
Andy
[Expert@R82:0]# ethtool --driver eth0
driver: vmxnet3
version: 1.6.0.0-k-NAPI
firmware-version:
expansion-rom-version:
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
[Expert@R82:0]# mount
/dev/mapper/vg_splat-lv_current on / type xfs (rw,inode32)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/vda2 on /boot type ext3 (rw)
/dev/mapper/vg_splat-lv_log on /var/log type xfs (rw,inode32)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
cgroup on /sys/fs/cgroup type tmpfs (rw,uid=0,gid=0,mode=0755)
/sys on /var/log/jail/sys type none (rw,bind)
/proc on /var/log/jail/proc type none (rw,bind)
/usr/lib on /var/log/jail/lib type none (rw,bind)
/usr/lib64 on /var/log/jail/lib64 type none (rw,bind)
/usr/lib on /var/log/jail/usr/lib type none (rw,bind)
/usr/lib64 on /var/log/jail/usr/lib64 type none (rw,bind)
/usr/bin on /var/log/jail/usr/bin type none (rw,bind)
/usr/bin on /var/log/jail/bin type none (rw,bind)
/opt/CPshrd-R82/monitoring on /var/log/jail/opt/CPshrd-R82/monitoring type none (rw,bind)
/var/log/opt/CPsuite-R82/fw1/tmp on /var/log/jail/opt/CPsuite-R82/fw1/tmp type none (rw,bind)
/opt/CPsuite-R82/fw1/oracle_oi on /var/log/jail/opt/CPsuite-R82/fw1/oracle_oi type none (rw,bind)
/opt/CPsuite-R82/fw1/oracle_oi on /var/log/jail/etc/fw/oracle_oi type none (rw,bind)
/var/opt/CPsuite-R82/fw1/conf/scrub_watermark on /var/log/jail/opt/CPsuite-R82/fw1/conf/scrub_watermark type none (rw,bind)
/var/opt/CPsuite-R82/fw1/conf/scrub_watermark on /var/log/jail/etc/fw/conf/scrub_watermark type none (rw,bind)
/opt/CPsuite-R82/fw1/scripts/tex_watermark on /var/log/jail/opt/CPsuite-R82/fw1/scripts/tex_watermark type none (rw,bind)
/opt/CPsuite-R82/fw1/scripts/tex_watermark on /var/log/jail/etc/fw/scripts/tex_watermark type none (rw,bind)
/var/log/jail/opt/CPsuite-R82/fw1/scripts on /var/log/jail/var/log/jail/opt/CPsuite-R82/fw1/scripts type none (rw,bind)
/opt/CPsuite-R82/fw1/scripts/cpfc_jail on /var/log/jail/opt/CPsuite-R82/fw1/scripts/cpfc_jail type none (rw,bind)
/var/log/dlp/smtp on /var/log/jail/opt/CPsuite-R82/fw1/dlp/smtp type none (rw,bind)
/var/log/dlp/http on /var/log/jail/opt/CPsuite-R82/fw1/dlp/http type none (rw,bind)
/var/log/dlp/ftp on /var/log/jail/opt/CPsuite-R82/fw1/dlp/ftp type none (rw,bind)
/var/log/dlp/fingerprint on /var/log/jail/opt/CPsuite-R82/fw1/dlp/fingerprint type none (rw,bind)
/tmp/scrub on /var/log/jail/tmp/scrub type none (rw,bind)
[Expert@R82:0]#
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
well mate VMWare is a totally different kettle of fish 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi chaps
the update I've pro ised is as following. it took the gaia R82 some time (again) but unfortunatelly the issue still persist (not in the logs though but the error counters are still present on each physical 10G interface as well as on the BOND. please see below:
*** physical 10G SFP+ interfaces ***
eth1-01 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:4915134 errors:1026 dropped:0 overruns:0 frame:0
TX packets:9763534 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4391624842 (4.0 GiB) TX bytes:5995868400 (5.5 GiB)
eth1-02 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:9681626 errors:1056 dropped:0 overruns:0 frame:0
TX packets:9726841 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6111685985 (5.6 GiB) TX bytes:9917356185 (9.2 GiB)
eth1-03 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:8034042 errors:985 dropped:0 overruns:0 frame:0
TX packets:4877023 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:8419293653 (7.8 GiB) TX bytes:3472150420 (3.2 GiB)
eth1-04 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:5667213 errors:1127 dropped:0 overruns:0 frame:0
TX packets:5140141 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4686993232 (4.3 GiB) TX bytes:5163720723 (4.8 GiB)
eth2-01 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:7864861 errors:960 dropped:0 overruns:0 frame:0
TX packets:3445062 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7061148299 (6.5 GiB) TX bytes:2654281510 (2.4 GiB)
eth2-02 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9216 Metric:1
RX packets:6338698 errors:789 dropped:0 overruns:0 frame:0
TX packets:9442289 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6556240235 (6.1 GiB) TX bytes:9210719409 (8.5 GiB)
*** and the bond of 6x10G ***
bond1 Link encap:Ethernet HWaddr 00:1C:7F:69:35:BC
inet6 addr: fe80::21c:7fff:fe69:35bc/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:9216 Metric:1
RX packets:42538935 errors:5943 dropped:0 overruns:0 frame:0
TX packets:42425808 errors:0 dropped:2 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:37240011883 (34.6 GiB) TX bytes:36426317412 (33.9 GiB)
@Timothy_Hall - any ideas, clues or hints?
@the_rock - what'd you do next though?
Cheers!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will do some more tests in the lab later bro.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
much appreciated, I think I've personally exhausted all the options on the table to narrow that investigation as if the driver is the root cause, imho the issue is on GAIA R82 and somehow sooner or later the checksum errors will reappear.
so unless you prove me wrong I'm under the impression that I won't be the only one having that issue considering that R82 is pretty fresh GA hence other than this issue all works perfectly fine and overall I'm happy with all new stuff build-in and how the performance of that platform improved the security posture however, those checksum errors are quite worrying don't you think folks?
hope we can get some answers from R&D and soon be in a position to get it fixed for all not just myself.
Cheers!
