Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
PetterD
Collaborator

Problem with local logging for some gateways - buffer overflow

Hello,

I have a strange issue with log-latency from some gateways. The issue keeps re-occuring and came back again a few weeks ago.
I had a previous post for this issue, but then it automagically solved itself and now its back again for the last few weeks.

It never affects VSX or any of the 7 VS`es and another cluster (which are all connected to the same VLAN as the Management) but does occur for active member of 2/4 VE-Clusters and about 4 of 20 other gateways. so in total 6 gateways have this issue.


Enviroment is:

Firewall-1 Management/Log Server R81.20 running on Vmware on top of Nutanix.
Several gatways, VSX-cluster, Appliances and 4 VE-Clusters also running on Vmware/Nutanix, on R81.10 and R81.20.

We do have a TAC case for this issue now but are having a hard time finding the root-cause of the issue.
The issue is "always" present but it varies from 30min -> 5 hours delay in logging.

When issue occurs we do see the "Writing logs locally due to high log rate (buffer overflow)" message with "cpstat fw -f log_connection".
Also when running fwd-debug with "TDERROR_ALL_FWLOG_DISPATCH=5" we see the following in fwd.elg:

 


###########
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] logbuf_write: writes logs to local disk because overflow
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ..--> changeWritingLogStatusToLocal
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ..<-- changeWritingLogStatusToLocal
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ..--> create_default_log
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...--> connect_to_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] connect_to_server: server default
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ....--> connect_to_local_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] connect_to_local_server: 'default' as DEFAULT server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .....--> set_new_server_status
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .....<-- set_new_server_status
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] connect_to_local_server: connected to local server successfuly
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ....<-- connect_to_local_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...<-- connect_to_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] create_default_log: connected to default log server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...--> log_local_write
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...<-- log_local_write
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] create_default_log: Buffer Overflow ! Save the cyclic buffer content locally. Start at -870268199323566080, end at 2550727950
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...--> disconnect_from_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ....--> set_new_server_status
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ....<-- set_new_server_status
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] disconnect_from_server: stop logging at 'default'
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...<-- disconnect_from_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] create_default_log: disconnected from default log server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ..<-- create_default_log
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .<-- logbuf_write
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .--> log_has_connected_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .<-- log_has_connected_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] log_add_e__logclient: writes logs to local disk because overflow
###########

Now the issue is finding root cause of this issue as it seems to happen for 6/20 gateways only and "buffer overflow" doesnt say much on where the issue is.

Why does 6 gateways where most have very little traffic and CPU-usage and same amount of logs as other gateways log locally and sporadically send logs to the Management ?

According to "cpstat mg -f log_server" its been weeks/months since log connections were established from the gateways so it doesnt look like its the network that resets this.

Would this indicate issues with capacity on the log server, the gateways or congestion in the network and any ways to identity the root cause ? "buffer overflow" doesnt tell us much where to investigate further 🙂

All these 6 gateways do have to traverse 2 virtual systems to connect to the Log Server (which runs on the Management). But we do have several other gateways that logs the same way and works, and with almost identical amount of logging (52/60 Log Handling Rate for these respective firewalls).

I would suspect an issue with the Log Server but then it "should" affect potentially all gateways and not always these 6, unless there is some sort of priority.


TAC points to the version and want us to upgrade to latest R81.20 and suspects the issue is on the gw side (currently focusing on just one of the 6) but we do have gateways with same version that works and never have log latency.

Also the fact that we had this issue for a while, automagically it was gone for weeks but then back again for weeks makes me suspect the "network" somehow, but then it should affect more gateways than just these 6. Also we arent seeing any capacity/traffic issues on the virtual systems.

 

CCSM / CCSE / CCVS / CCTE
0 Kudos
6 Replies
PhoneBoy
Admin
Admin

Are the relevant gateways showing logs collecting in $FWDIR/log?

Was your R81.20 Management installation a fresh install or did you migrate in place from a previous version?
If you didn't install from scratch and migrate_import, I highly recommend doing this.
We fixed a long standing bug in disk partitioning that will improve disk performance in VM substantially.
However, it requires a fresh install of R81.20 on a new VM.

0 Kudos
PetterD
Collaborator

Hi!

We do see log files growing in $FWDIR/log . The issue only seem to affect the same 6/20 gateways.

The SmartCenter installation was performed using a "migrate_server export -t R81.20" from old R81.10 installation and fresh-installed on a new VM using the .ova file, adding a SCSCI-disk with the new disk aligned with existing and using lvm to extend log-partition.

We did more testing last week and after every reboot of the SmartCenter, logging immidiately was in sync again for a while but issue came back again. The last reboot was 6 days ago and now (still) the issue isnt back.  The only difference (except for having national hollidays and almost nobody at any office for 6 days) was prior to the last time the issue came back we did a policy install on one of the VS`es that this log-traffic traverses. Definately not sure if this is related but something we are following up on.

Any other good ways to find potential disk-issues on the Management side? We have reserved 100% of the CPU-resources for the VM but it will of course share other stuff like disk etc with other VM`s

CCSM / CCSE / CCVS / CCTE
0 Kudos
Timothy_Hall
Legend Legend
Legend

Every one of my experiences with Nutanix have been overwhelmingly negative as far as performance: network issues, disk performance issues etc.  Your issues may be related to where the 6 problematic gateways are assigned to real hardware by Nutanix, and that hardware is oversubscribed.  You could be experiencing network errors or "stealing" of CPU cycles by the hypervisor.

It sounds like the logging rate from the inspection code on these gateways is overwhelming the fwd daemon on the gateways, as it can't empty its logging buffer fast enough because the sending of logs to the SMS/Log Server is so bogged down.  The logging buffer can be increased, but this is only addressing the symptom of the problem and not the cause:

sk16417: How to increase the log buffer size on Linux platforms

sk118936: Gateway/Virtual System does not send logs to its Log Server

Please post the output of the following commands taken on a problematic gateway, ideally while the log problem is happening:

netstat -ni

cpprod_util FwIsUsermode

sar -n EDEV

sar -P ALL

fw ctl affinity -l -r -v

free -m

fwaccel stats -s

On a system with 20 or more cores, fwd is automatically assigned its own dedicated CPU core.  This might be needed if fwd is getting starved for CPU resources by a busy kernel.

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
PetterD
Collaborator

Thanks for the reply!


Yes im definately not very confident with the Nutanix solution and there have been both disk and cpu issues in the enviroment before, the issue is to pinpoint where it is. CPU usage is low on the Management all the time, and we also got 100% CPU reservation on the Management VM in Nutanix so it shouldnt have to "fight" the other VM`s for physical CPU.

Out of the 6 gateways we have had issues with, two of them are Appliances with very little traffic and not running virtualized.
We did two reboots of the Management on May 16th and after the last reboot of the Management the issue has not reoccured yet.

Since it happens for gateways that are also running on physical hardware (5200 appliances, not cluster) im lead to believe its "something" on the Management, also the fact that rebooting clears the issue, and now has for 6 days. However its strange that it never occurs for the 20+ other gateways, so unless there is some sort of prioritization on the Management, or a very weird coincidence thats also strange 🙂  Will keep monitoring the issue and post output of the commands if the problem comes back again!

 

 

CCSM / CCSE / CCVS / CCTE
0 Kudos
the_rock
Legend
Legend

Let us know how you sort this out.

Best,

Andy

0 Kudos
Timothy_Hall
Legend Legend
Legend

OK thanks, from your description it wasn't clear where you thought the issue was but it sounds like it is on the SMS in Nutanix.  Here are the revised commands you should run on the SMS when it happens again:

netstat -ni

sar -n EDEV

sar -P ALL

free -m

iotop -d 5 -b -n 3

cpstat os -f multi_cpu -o 1 -c 5

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events