Out of state drops

Ryan_Ryan · ‎2023-05-17

Hi all,

We have a fairly new deployment of Maestro. 2x Orchestrators, 3x 6600's. Single site, one security group.

We are seeing consistent and inconsistent failures, examples:

- From one host to another host, all protocols go out via one sg and come back via another, everything dropped 100% of the time. Same issue for at least another 4 destination IP addresses.

- as above, From the same source host to another host in the same dest subnet everything goes via the same sg and all work. Same good result for several other destination IP addresses

-From an external IP to an internal IP, sometimes we get out of state drops due to a different sg being used and sometimes it works ok.

R81.10 Take 87

We are using the default per port auto topology distribution, layer 4 distribution disabled. All ports are correctly identified as internal or external. The only conclusion I can make at this stage is it seems to always be the same sg that drops the traffic. Whether that is relevant or not I am not sure.

We have done dxl calc's and the sg that is dropping the traffic is not in the path. I have tried applying asg_excp_conf but that didn't work, still saw traffic hitting the incorrect sg.

Of course we do have an SR open for it but 2 weeks on and we have found more problems than solutions and customer is getting a bit concerned.

Any help would be greatly appreciated!

Daniel_Szydelko · ‎2023-05-17

Hi,

Can you provide more details:

- this is regular Security Group or VSX?

- NAT is used for problematic connections?

- what asg search is showing for problematic connection?

- what syntax was used for asg_excp_conf?

- what happen (any changes) to introduce issue?

- which SGM is dropping connection? Have try to see what happen when you administratively down such SGM? If it starting to work fine then do the same but with another SGM (excluding problematic one).

- what kind of drops do you observe - can you provide log?

BR

Daniel.

Ryan_Ryan · ‎2023-05-17

Hi there,

- regular security group non-vsx

- no NATing

- asg search -

Lookup for conn: <192.168.10.10, *, 10.10.2.10, *, *>, may take few seconds...
<192.168.10.10, 5, 10.10.2.10, 0, icmp> -> 1_01, [O]1_03
Legend:
O - Owner

Lookup for conn: <192.168.10.10, *, 10.10.2.10, *, *>, may take few seconds...
<192.168.10.10, 63715, 10.10.2.10, 3389, tcp> -> 1_01, [O]1_03
Legend:
O - Owner

-exemption tried like so -

asg_excp_conf set 2 0 0 10.10.2.10 0
asg_excp_conf set 1 10.10.2.10 0 0 0

- These flows were something new, so never worked, the intermittent flow has been happening off and on since maestro put in

- I am keen to stop the member 2, but I am not sure what impact that will cause as there is production traffic going through it, will we drop all sessions or will they statefuly fall over to another SG?

Drops are seen like so:

g_fw ctl zdebug + drop

[1_02]@;731232026;[vs_0];[tid_3];[fw4_3];fw_log_drop_ex: Packet proto=1 10.10.2.10:0 -> 192.168.10.10:5 dropped by fw_first_packet_state_checks Reason: ICMP reply does not match a previous request;

[1_02]@;731235835;[vs_0];[tid_2];[fw4_2];fw_log_drop_ex: Packet proto=6 10.10.2.10:3389 -> 192.168.10.10:63649 dropped by fw_first_packet_state_checks Reason: First packet isn't SYN;

Daniel_Szydelko · ‎2023-05-19

Hi,

Really strange - this connection should be handled by SGM 1_03. It is also strange that asg_excp_conf didn't work fine. This is another signal that is something wrong with asg_excp_conf (I have info from another env. running R81.10 JHF Take_79).

You can to stop SGM 1_02 but I do suggest anyway to do it in maintenance window.

BR

Daniel.

Ryan_Ryan · ‎2023-05-23

Tried that, as soon as sg2 was stopped everything started working, so rebooted it, all worked till it came back and joined the cluster it stopped again. I proceeded to reboot 3 then 1 and had the same effect, anytime only two cluster members were active the traffic worked, as soon as it came back into load bearing state the traffic dropped again. So frustrating.. !

Timothy_Hall · ‎2023-05-23

It sounds like for whatever reason the MHO's and the SGMs are not using the same distribution algorithm. The MHO uses the distribution algorithm to pick a downlink port leading to the SGM to handle the packet, then the connection owner SGM uses the (supposedly) same distribution algorithm to determine who will be the backup should it fail, and then HyperSync who the backup would be.

With 2 members in the SG the backup is always just the other SGM, but when you get to 3 SG members it sounds like the MHO and SGMs are not always in agreement as to what the distribution should be. When the MHO directs a packet to an SGM that was not HyperSynced by the connection owner, the drop out of state happens. The fact that asg_excp_conf didn't work is surprising, so I'd suspect the MHO of being in the wrong here more than the SGMs.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with editions for both the EMEA and Americas time zones

Ryan_Ryan · ‎2023-05-23

That does make perfect sense, I did try reboot the MHO aswell but that had no effect on the traffic.

I can find the distribution mode on the SG using 'show distribution configuration'. This command doesn't seem to apply to the MHO, is there a way to see what the MHO thinks the mode is?

Timothy_Hall · ‎2023-05-23

As far as I know the MHO is supposed to inherit the correct distribution settings from the SMO Master. You can try running distutil update on the SMO Master which is not really documented but appears to re-sync the SGMs and MHO as far as distribution algorithm.

The following SK seems to match your problem, but it is for R80.20SP:

sk164712: Traffic distribution inconsistency between Orchestrators and Security Gateways

Try the following commands from the SMO Master which will run some diagnostics to verify distribution:

clish: show distribution verification verbose

expert mode: distutil verify -v

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with editions for both the EMEA and Americas time zones

Ryan_Ryan · ‎2023-05-23

thanks I hadn't found sk164712 before in my searches.

show distribution verification verbose - that matches across all three members and all show as passed

distutil update - that ran and completed immediately didn't give any output and didn't make any change to traffic behaviour however, still broken

uplink_trunk_mode/state shows disabled

cat /etc/mlx_conf.json | grep sym_l4 | head -n1 - interestingly does show one result, I couldn't work out the naming convention to figure out which one it is.

We are using a bond interface which is vlan tagged, in the orchestrator topology the physical nic's are added to the sg, bond nor vlan svi's show up in the available list to add. all the traffic flows across those interfaces is working ok though.

Lari_Luoma · ‎2023-05-23

Hi Ryan,

If you have no NAT, you should go to manual-general distribution mode. Auto-topology works best when there is NAT.

Ryan_Ryan · ‎2023-05-23

thanks for response, we have a few nat rules ie hide nat out to the internet etc and a few static nats. The broken traffic doesn't have NAT enabled.

Should we be ok to try general mode?

Lari_Luoma · ‎2023-05-23

I saw you mentioned earlier you don't have NAT. If you do have even a few hide-nat rules auto-topology is the best way to go.
Is only certain traffic flow impacted or is this more general problem?
Is this traffic going from an internal interface to another internal interface or to external?

Have you tested with only one SGM in the security group? or at least shutting down #2.

Ryan_Ryan · ‎2023-05-23

Ok sorry yes I didn't articulate that well, the flow in question has no NAT but otherwise there is NAT in use elsewhere, this issue is affecting only several source to dst pairings, and it affects them 100% of the time, I have seen it happen on one other flow but self resolved and didnt break again. The broken flows are part of a subnet where other flows work, rdp to .10 works but .11 has this issue.

The traffic is from an internal interface to external

If I shutdown any of the 3x SG the flows immediately start working, and starts failing again once the SG is re-introduced

nealsrkr · ‎2023-05-24

Hi Ryan, we were experiencing a similar issue and were recommended to install take 95 JHF. Let us know what TAC recommends you. Thanks

Lari_Luoma · ‎2023-05-24

Thanks! This is definitely a correction related issue. You see the traffic on a third member because it's being corrected. The question is why it's dropped as correction mechanism should take care of it and send it to the owner of the flow for processing. The ones you see in asg search are the actual owner and backup SGM.

This traffic is corrected because it doesn't have NAT enabled in auto-topology mode going from internal to external. I know you already tried asg_excp_conf to move it to the SMO, but said it didn't work. Did you get any error messages or it took the command fine, but nothing just happened?

I'm skeptical recommending general mode because there is a risk that it could break the natted connections. General mode uses only part of the hide-NAT ports and you might end up of running out of available ports for NAT. It would also most likely introduce correction to those connections and if we have a correction problem, they might end up not working.

I know we had some issues (not related to correction though) with the JHF take 87. If you can, I would recommend upgrading to the latest recommended jumbo. Before I give any further recommendations, let me talk internally with people and will get back to you.

Ryan_Ryan · ‎2023-05-24

Thanks Lari, based on some other comments it seem a JHF upgrade is the way to go.

As for the asg_excp_conf exception, it took the command, said it was successfully applied across all 3 SG's, however nothing seemed to happen, I could still see drops by member 2 in zdebug. I will plan the upgrade and also do some more playing with the exception, maybe I was just using an incorrect syntax.

Ryan_Ryan · ‎2023-05-30

So update is, we applied JHF across all the devices, didn't make any difference to the issue. We then switched to manual-general mode and that fixed it straight away.

On a side note, whilst upgrading, when the second orchestrator went down for its reboot we lost all traffic in and out for the duration of the reboot (over 5mins outage). Was expecting it to be fairly hitless, any particular order they should be done in? I did this order: manager 2, manager 1, orchestrator 2, orch 1, gateway 3, 2 then 1

Lari_Luoma · ‎2023-05-30

Are all your interfaces bonded? If not, that explains the outage. With bonded interfaces, rebooting an orchestrator should be seamless. When testing it a week or so ago, we lost like one ping packet in the process.

Ryan_Ryan · ‎2023-05-30

Yes we have bonded interfaces, oh well no big deal. thanks for the help!

Timothy_Hall · ‎2023-05-30

Yeah you shouldn't have had a 5 minute outage while rebooting one Orchestrator. Something is not correct with the bonding setup for your uplinks, and your upstream switches probably did not detect that the Orchestrator was dead. Make sure you are using LACP and it is being seen working properly on your uplink devices.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with editions for both the EMEA and Americas time zones