Hello,
I have a strange issue with log-latency from some gateways. The issue keeps re-occuring and came back again a few weeks ago.
I had a previous post for this issue, but then it automagically solved itself and now its back again for the last few weeks.
It never affects VSX or any of the 7 VS`es and another cluster (which are all connected to the same VLAN as the Management) but does occur for active member of 2/4 VE-Clusters and about 4 of 20 other gateways. so in total 6 gateways have this issue.
Enviroment is:
Firewall-1 Management/Log Server R81.20 running on Vmware on top of Nutanix.
Several gatways, VSX-cluster, Appliances and 4 VE-Clusters also running on Vmware/Nutanix, on R81.10 and R81.20.
We do have a TAC case for this issue now but are having a hard time finding the root-cause of the issue.
The issue is "always" present but it varies from 30min -> 5 hours delay in logging.
When issue occurs we do see the "Writing logs locally due to high log rate (buffer overflow)" message with "cpstat fw -f log_connection".
Also when running fwd-debug with "TDERROR_ALL_FWLOG_DISPATCH=5" we see the following in fwd.elg:
###########
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] logbuf_write: writes logs to local disk because overflow
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ..--> changeWritingLogStatusToLocal
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ..<-- changeWritingLogStatusToLocal
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ..--> create_default_log
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...--> connect_to_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] connect_to_server: server default
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ....--> connect_to_local_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] connect_to_local_server: 'default' as DEFAULT server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .....--> set_new_server_status
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .....<-- set_new_server_status
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] connect_to_local_server: connected to local server successfuly
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ....<-- connect_to_local_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...<-- connect_to_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] create_default_log: connected to default log server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...--> log_local_write
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...<-- log_local_write
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] create_default_log: Buffer Overflow ! Save the cyclic buffer content locally. Start at -870268199323566080, end at 2550727950
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...--> disconnect_from_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ....--> set_new_server_status
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ....<-- set_new_server_status
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] disconnect_from_server: stop logging at 'default'
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ...<-- disconnect_from_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] create_default_log: disconnected from default log server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] ..<-- create_default_log
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .<-- logbuf_write
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .--> log_has_connected_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] .<-- log_has_connected_server
[FWD 11164 3943094272]@gw-fwi01[13 May 11:39:55] log_add_e__logclient: writes logs to local disk because overflow
###########
Now the issue is finding root cause of this issue as it seems to happen for 6/20 gateways only and "buffer overflow" doesnt say much on where the issue is.
Why does 6 gateways where most have very little traffic and CPU-usage and same amount of logs as other gateways log locally and sporadically send logs to the Management ?
According to "cpstat mg -f log_server" its been weeks/months since log connections were established from the gateways so it doesnt look like its the network that resets this.
Would this indicate issues with capacity on the log server, the gateways or congestion in the network and any ways to identity the root cause ? "buffer overflow" doesnt tell us much where to investigate further 🙂
All these 6 gateways do have to traverse 2 virtual systems to connect to the Log Server (which runs on the Management). But we do have several other gateways that logs the same way and works, and with almost identical amount of logging (52/60 Log Handling Rate for these respective firewalls).
I would suspect an issue with the Log Server but then it "should" affect potentially all gateways and not always these 6, unless there is some sort of priority.
TAC points to the version and want us to upgrade to latest R81.20 and suspects the issue is on the gw side (currently focusing on just one of the 6) but we do have gateways with same version that works and never have log latency.
Also the fact that we had this issue for a while, automagically it was gone for weeks but then back again for weeks makes me suspect the "network" somehow, but then it should affect more gateways than just these 6. Also we arent seeing any capacity/traffic issues on the virtual systems.
CCSM / CCSE / CCVS / CCTE