Re: Correlation between severe packet loss on Sync...

Vincent_Croes · ‎2021-11-30

Hi CheckMates

I was wondering if anyone had the following experience:

Sync network is experiencing severe packet loss. Command 'cphaprob syncstat' and 'fw ctl pstat' confirms this. Also fwk.elg denotes:"State synchronization is in risk. Please examine your synchronization network to avoid further problems !"
- Cause is now known: incorrect Cisco storm control parameters on the Sync switchport
On the Active VS, we see an increase in connections over time
On the Standby VS, we see a huge increase in connections over time. Completely asynchronous with the Active VS. Due to the issues on the Sync network, I can imagine the states not being properly synced or cleared.

Now the most important part: on the Active VS, we see a whole lot of connections being dropped due to "First packet isn't Syn". And the TCP flags range from: ACK, RST-ACK, FIN-ACK.

After the incident on the Sync network has been resolved, we see a huge decrease in these kinds of connection drops. Does anyone know the correlation between having an issue on the Sync network and seeing a huge increase in "First packet isn't Syn" drops. Which in my eyes mean that the connection has already been erased from the state table. The timeout values of the used protocol should be at 7200 seconds. All of these connections were allowed at some point. Long before a lifespan of 7200 seconds, these connections were dropped with the "First packet isn't Syn" message.

Anyone have some pointers?

PhoneBoy · ‎2021-12-05

You sure there's not packet loss on other interfaces as well?
I would suspect the issues are related to issues with the sync network and maybe only receiving partial data.
That might cause some connections to get removed from the primary.

Vincent_Croes · ‎2021-12-05

No packet loss was observed on other interfaces. I'll include a nice graph, which shows the amount of 'first packet isn't syn' drop messages on the environment. You can clearly see when the issues on the Sync interfaces started:

_Val_ · ‎2021-12-06

I believe sync issues lead to VS flapping back and forth, which means you have random packets trying to get through a "wrong" VS, causing all described.

Vincent_Croes · ‎2021-12-06

Is there anything in the logs I can search for to confirm this? ClusterXL messages are not showing up at this given timeframe.

_Val_ · ‎2021-12-06

It is either flaps or other networking issue, Standby VS should not get packets sent to it. Are you using VMACs?

Vincent_Croes · ‎2021-12-06

No, not using VMAC. I was more thinking what the reason could be why an Active state might try and clear out sessions faster aka "First packet isn't syn". I can see the session being accepted first and then after a random amount of time dropped due to FPIS. No aggressive aging was active during the incident on the Active state.

_Val_ · ‎2021-12-06

That theory does not explain why Standby member experience growth of connections

Vincent_Croes · ‎2021-12-06

I can assume that if the Active instance has severe packet loss towards the Standby one, that the Standby one has outdated info and is not updating the connectiontable as it should. As I don't know the inner workings of ClusterXL so I can only assume. But seeing that the mechanism to keep both tables in sync was impacted, I can imagine the Standby one not having an accurate representation of the situation.

_Val_ · ‎2021-12-06

Sorry, I do not understand this argument at all. The drops you are having, they are not on sync network, correct? If they are on production network, then they do not have anything to do to a sync discrepancy on standby.

Vincent_Croes · ‎2021-12-06

The "First packet isn't Syn" drops are indeed not on the Sync network, correct. The fact remains, when the issue on the Sync interface was solved, the drops disappeared on the production network. So I am looking for any correlation between them.

Timothy_Hall · ‎2021-12-09

OK I had to really think about this one for awhile.

What does the CPU utilization look like on the Active VS when the problems start? Sync network issues will cause excessive CPU utilization on the Active, and the Cisco storm control could certainly cause this when policing kicks in. Do you have Aggressive aging enabled under Inspection Settings?

Sounds to me like the Cisco sync network policing started to cause sync issues that incurred extremely high CPU load on your Active VS once your sync traffic reached a certain level. As the VS gets further and further behind trying to process traffic and simultaneously catch the sync network back up, the connections table (and/or memory utilization) hits the default 80% threshold and Aggressive Aging kicks in to start removing connections earlier and earlier than the default 7200 seconds in the state table thus causing the out of state messages. While the connections table continues to be brutally pruned back by Aggressive Aging on the Active, state sync is so far behind that the Standby member still has a large number of connections showing because it has not received the delete notifications from the Active.

Once everything catches back up, normalcy returns. Does this scenario sound plausible?

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Vincent_Croes · ‎2021-12-09

We most certainly have Aggressive Aging enabled in our configuration. The CPU utilization looks like this (capture from the entire day, you can see when it ramps up a little bit):

I would understand the "First packet isn't syn" behavior if only I would only find an "Aggressive aging" log in Smartlog for the Active state but I can't... Your scenario makes a lot of sense but I cannot find prove of this.

Note: something that I did notice during the incident: when capturing traffic to a pcap file, I noticed that the active state was sending out 10x as much data as the standby one. Seeing as the Active one was policed, I suspect this is a coping mechanism just to get these messages to the standby one.

Vincent_Croes · ‎2021-12-06

Also I checked if the "Origin" in the logs change rapidly but they do not. That is why I discarded a possible VS flap.

Are you a member of CheckMates?

Correlation between severe packet loss on Sync and Active VS showing alot "First packet isn't Syn"