- Products
- Learn
- Local User Groups
- Partners
- More
Maestro Masters Series 2026
WATCH NOWHello Checkmates,
Recently we started the process of migrating to Maestro in our DC's.
For that we decided to go with 2xMHO140's and either 3 x SG9300 or SG9400 - depending on the size of the DC.
Racked everything in February, and since then we're battling with some weird things.
Starting from the connections and set-up, we're having a single site with 2 MHO's and each SG's connected in both.
Then we have the first 1 and 2 ports from MHO's connected to our ACI for SG Management - Vlan168 - and then the Uplinks on ports 5, 7, 8, 11 and 12) . For SG's - downlinks - we use port 25 (changed to downlink!) and 27 and 29 .
Now, for the installation process, we re-imaged the appliances with R82 T777 (back in February) and configured the Management, LOM and other standard settings.
We've built an SG (VSNext) with only one appliance, did the JHF60 (again it was back in February) and added it to Management - brand new VM with R82.
Added a 2nd node on the SG, and after everything synchronized we created an VS. Added that new VS to Management and all was fine.
When we wanted to push an updated policy to the newly created SG - VS0 - we got several failures due to different SIC or communication issues (as per below examples). And the FUN begins 🙂 .
From the investigations we started with Support Engineer and our Professional Services guys, we noticed that for whatever reason, when we are doing SIC verifications from Management, either 2 times out of 5 or 3 times out of 5 we get different SIC errors like:
But when we check on the SG directly, we can see that SIC is OK.
That error shows only when we have 2 or more nodes in the Security Group, either with VSNext or without.
We also applied the Certificate Hotfix for the JHF60, still no change and we also did the JHF73 without success - same SIC errors or policy push errors were seen.
We've verified the SIC certificate to be valid and Management knows it, we re-did SIC at least 10 times, no change. As long as we have a 2nd member in Security Group, it's starting to show the SIC error while validating or pushing policies.
Last Saturday, we have re-imaged all the appliances from Singapore DC to R82 T779 and applied JHF91, same behavior.
While waiting for the Support engineer to see with BU what can be wrong, I want to ask if any of you that work with Maestro, have seen this beavior, and
Thank you,
PS: in order to exclude the dedicated Management interface as a possible cause of the problem, we have 3 ports shut-down from MHO side, so right now we are working with a single Interface for the management of SG's. Same SIC errors are seen.....
I do have Maestro configured. Unfortunately, it is on R81.20 and it is connected to regular Cisco Nexus. This is why I cannot be helpful enough for your problem investigation.
However, please, take a look at this SK. It may give you a lead.
sk168181 - Communication problems with ClusterXL clusters connected to Cisco ACI
From general Maestro standpoint. Magg interfaces are excluded from distribution, and it should not be a problem until somehow management traffic is sent/received via a data interface. Usually, it happens because of some routing mistake. I am sure that it was already checked.
If you would like to look into SMO state in more details, then you can try this zdebug command:
fw ctl zdebug -T -d 'SMO,smo' -m cluster + conf
It is very lightweight and shows changes in SMO role.
"-d" puts a filter in kernel to match string "SMO" or string "smo", otherwise module cluster with flag conf returns too many data.
Please have a check through this SK and see if it helps: https://support.checkpoint.com/results/sk/sk168181
@Lari_Luoma , @Anatoly Any comments?
How is your magg bond configured, and does it match what's on the switches? Are the switches doing any sort of MAC learning from outbound packets? If so this would need to be disabled.
Either way, If we go with normal Security Group or with VSNext, the management interface is configured active/back-up .
On VSNext, underneath it's an MAGG while on the V0 it's an WRP interface.
On ACI side we have all ports set as access, no MAC filtering.
|
[Global] ALVS-SGFW022-s01-01:0> show interface magg1 Statistics: SD-WAN: Not Configured 1_02: Statistics: SD-WAN: Not Configured [Global] ALVS-SGFW022-s01-01:0> |
|
[Global] ALVS-SGFW022-s01-01:0> show interfaces all Statistics: SD-WAN: Not Configured Leading to Virtual Switch: mgmt-switch (ID 500)
|
Thank you,
Good day!
Do you see any regular packet drop when you ping from the Management Server to the SG Management IP? May it be a general connectivity issue? I see 6.5% of RX errors on magg interfaces. This doesn't look good.
If we see significant packet loss from the Management Server to the SG, then it is expected that SIC verification and policy push fails. We need to narrow down the problem area starting from bottom up.
hello @Gennady ,
No we do not have packet loss between Management and SG's.
And even if we would have a packet loss.
If that would be the case, then how can we explain that in an VS created on top of the SecurityGroup, we get 5 out of 5 SIC validations and no errors when pushing the Policy?
Thank you,
In addition,
Did you have a chance to capture traffic on the SG (all SGMs at the same time) at the moment when you verify SIC from the Management Server?
The fact that the problem appears only if you add more than 1 SGM to the SG points to some distribution problem. It should not affect magg interface until packets from the Management Server in fact arrives on some Data interface or SMO role is flapping (this is very unlikely).
Yes we did packet captures on Management and both SG members and shared them with the Support .
Indeed it points to a distribution problem, but the distribution set-up is for the data-path and not for management per my understanding.
SMO role was not flapping as we checked and it's always the first member that we used to build the SG.
still while doing tcpdumps on both members and we were checking SIC, we've seen that almost every time when SIC failed, the other member was showing traffic at one point. So why is shifting, I can't say.
Thank you,
Thank you,
MAC learning would explain the shifting. If you do the tcpdumps with the MACs shown (-e) do you see the inbound MACs change when SIC stops working and you see traffic on the other SGM?
We'll run again the captures and watch the MAC and come back.
Still I repeat, if we are using a single port between MHO's and ACI for management, then we have a single Inbound MAC so ?
Thank you,
No it's MAC per SGM, not per MHO/port. If it works stable with one SGM then it's not MAC flapping. Not sure about VSNext if each VS also has its own MAC address on there but that might need checking too.
I have the same understanding, and in order to exclude VSNext, we have wiped and created an standard Security Group as well.
And with that we have the same SIC behavior.
Ty,
Not MAC filtering, MAC learning. If the ACI is learning MAC addresses from outbound packets, it's going to be constantly changing the MAC table for the magg interfaces, and packets are going to get lost, leading to SIC failures. On the SG management port, each SGM uses its own MAC address, so MAC learning can break things.
As I know, we don't do any MAC learning/filtering on ACI side.
But, if we were to do any of that, we should have alerts on ACI side for MAC flapping and the Leaf ports that we have the Management connected - those first 2 ports from each MHO.
But since we are with only one port active right now, there is no MAC flapping, unless the Management IP - 10.18.169.41 in Singapore case - jumps between the 2 members for whatever reason....
I'll doublecheck and come back.
Thank you,
As promised, we checked the ACI side for MAC flapping and we don't have any alerts for the ports that we use for SG Management.
I do have Maestro configured. Unfortunately, it is on R81.20 and it is connected to regular Cisco Nexus. This is why I cannot be helpful enough for your problem investigation.
However, please, take a look at this SK. It may give you a lead.
sk168181 - Communication problems with ClusterXL clusters connected to Cisco ACI
From general Maestro standpoint. Magg interfaces are excluded from distribution, and it should not be a problem until somehow management traffic is sent/received via a data interface. Usually, it happens because of some routing mistake. I am sure that it was already checked.
If you would like to look into SMO state in more details, then you can try this zdebug command:
fw ctl zdebug -T -d 'SMO,smo' -m cluster + conf
It is very lightweight and shows changes in SMO role.
"-d" puts a filter in kernel to match string "SMO" or string "smo", otherwise module cluster with flag conf returns too many data.
ty @Gennady , I will check the SMO state and come back.
As for the ACI and SK you provided, that makes sense, but we should have alerts on ACI side if an IP changes MAC .
As you see, we have the same MAC for the SG IP:
ty,
PS: I've run the zdebug on both nodes while checking SIC and getting failures, and I did not get any packets, so during the SIC check SMO was not changing, I guess....
If there are no messages in zdebug during the problem replication. then we can rule out SMO flap.
An example of SMO role change is below:
just to clarify for myself, what would cause an SMO change ?
Other than the shut-down/reload of the active one.....
can I trigger that by CLI so I just confirm it ?
Thank you,
Good day!
The easy way to trigger SMO role change is by brief shutdown of downlink port from MHO side.
Use "orch_stat" to find logical port number for downlink interface which leads to SMO. Then you can use tor_util on both MHO at the same time (you can try just one MHO at first for the test):
"tor_util set_port_admin_state <logical_port_number> down && sleep 1 && tor_util set_port_admin_state <logical_port_number> up"
Example:
"tor_util set_port_admin_state 65 down && sleep 1 && tor_util set_port_admin_state 65 up"
The command above will disable downlink port for 1 second and then bring it back UP. This results in SMO role change from 1_01 to 1_02 for ~35 seconds and then it will go back from 1_02 to 1_01. You will see both events in zdebug. The zdebug command should be executed on 1_02 because you will break SSH to 1_01.
Be careful if the test is in production because SMO role flap distracts all SMO local connections:
Hi,
I am not a ACI expert and not sure how much of ACI features you are using, but I know ACI is some what different than a traditional switched network. Is there a possibility to connect the MAGG to a traditional (non-ACI) switch to see what happens?
Maybe you can rule out ACI.
Don't think it is relevant here, but did you configure a Primary Interface in the MAGG bond? I have seen strange things when creating an Active/Backup bond without a Primary Interface.
Martijn
hello @Martijn ,
We don't have anything fancy on this Vlan - Vlan168 where we have the SG Management assigned.
As for the MAGG creation, I did not do anything in the CLI - as per documentation you do the interface preference in CLI - we did everything in GAIA.
I just set the primary interface on the magg1 as eth1-Mgmt1 and still the SIC error is there 😫.
|
[Global] ALVS-SGFW022-s01-01:0> show bonding groups 1_02: [Global] ALVS-SGFW022-s01-01:0> |
Thank you,
Please have a check through this SK and see if it helps: https://support.checkpoint.com/results/sk/sk168181
Hello mates,
Following SK that I was pointed to, SK168181 ( thank you @Gennady and @emmap ) it states that what we’re facing, is due to "Dataplane Endpoint Learning".
|
Symptoms
Cause Cisco ACI has several proprietary features which cause problems with Check Point clusters.
Check Point Clustering Most switches learn information about hosts connected to the network by listening to ARP requests and replies. Check Point clustering relies on this behavior to ensure traffic is always sent to the Active cluster member by using GARP.
Key points about ClusterXL Clusters:
Key points about Chassis and Maestro Security Groups:
Cisco Endpoint Learning Features Cisco ACI does not behave like most switches as explained above.
Solution There are several possible solutions: Note: Each of the following is an independent option. They are not all required.
|
So, we had checked our DC ACI and we could confirm what the SK states.
and
Based on those information’s we’re seeing in the Faults/Events I was able to confirm that the management issue could be cause by "Dataplane Endpoint Learning".
As this confirms that we indeed had 2 MAC’s pointing to the same IP address – 10.4.169.181 from the 0012.C110.00B5 and 0012.C120.00B5
and
To confirm those, we have also tested by temporary moving Management Ports outside ACI and as soon as DC guys helped us with the connections move, we could validate SIC for 5 - 7 times. Previously we could not get 3 out of 5 validations.
As we can’t stay with the Management outside of ACI, we discussed with DC ACI responsible colleagues to disable the “IP Data-plane Learning” on specific IP’s and confirm if that fixes our problem or not. Just few minutes after we had set the specific IP's to not do “IP Data-plane Learning”, we could validate the the SIC communication, same way we did while outside ACI.
In conclusion, if you have in your DataCenters ACI implemented, or wherever you install Maestro, make sure that the Security Group Management is either outside ACI - if that is possible - but if you don't, then add the Management IP's specifically and disable “IP Data-plane Learning” !!!!
Confusing part was that we would expect to see Alerts on ACI, like any other MAC or IP flapping but that was not the case. On Maestro Security Group, it seems that even there is one SG appliance selected as "primary" for Management, the traffic shows with MAC from each node in certain cases.
| [Expert@ALVA-SGFW01-s01-01:0]# ifconfig | grep 00:12 wrp0 Link encap:Ethernet HWaddr 00:12:C1:10:00:B4 [Expert@ALVA-SGFW01-s01-01:0]# |
| [Expert@ALVA-SGFW01-s01-02:0]# ifconfig | grep 00:12 wrp0 Link encap:Ethernet HWaddr 00:12:C1:20:00:B4 [Expert@ALVA-SGFW01-s01-02:0]# |
| [Expert@ALVA-SGFW01-s01-03:0]# ifconfig | grep 00:12 wrp0 Link encap:Ethernet HWaddr 00:12:C1:30:00:B4 [Expert@ALVA-SGFW01-s01-03:0]# |
|
[Expert@ALVA-SGFW01-s01-03:0]# asg stat -i tasks [Expert@ALVA-SGFW01-s01-03:0]# |
Thank you everyone for the support, as I learned something new in the last months.
Now we're moving forward with the planning of the migration from the 2 x 15K cluster to vFW on Maestro.
Thank you and have a great week,
Leaderboard
Epsum factorial non deposit quid pro quo hic escorol.
| User | Count |
|---|---|
| 8 | |
| 5 | |
| 3 | |
| 3 | |
| 2 | |
| 2 | |
| 2 | |
| 2 | |
| 2 | |
| 1 |
About CheckMates
Learn Check Point
Advanced Learning
YOU DESERVE THE BEST SECURITY