Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
PJ_WONG
Contributor

Having more than two SGM in a SG will cause network interruption

Hi Checkmates,

Would like to ask is there anyone faced an issue like this?

We have 2 SGM in a SG, where the network is working fine. However, when we add in two more SGM, which there are 4 SGM in total, we notice there will be some network interruption, the symptom is inconsistently some website is unable to reach and also ping.

Any 2 out of 4 SGM will need to set to down state then the interruption is stopped immediately. There is one TAC blame the uplink switch but there were no changes at the uplink, adding in SG should belongs to downlink.

It couldn't be asymmetric route at maestro since the maestro should handle this properly within itself.

Any help is appreciated.

 

Thanks.
PJ

0 Kudos
10 Replies
Danny
Champion Champion
Champion

@Anatoly @Lari_Luoma can you please assist?

0 Kudos
RS_Daniel
Advisor

Hello,

Just some sanity checks: When you added two new SGM's, do you have auto clonning enabled? also did you check licenses on the new SGM's were fine? New SGM's are using the same SFP's/DAC cables sku's than the old ones for downlink connections?

I had a similar scenario with 3 SGM's. After many tests and troubleshooting we rebooted all 3 SGM's one by one and the traffic started to work fine. In our case the signal to know if the SGM was "failing" was doing a ping from that specific sgm to any directly connect IP address, problematic SGM's failed, working SGM's succeded.

Regards

0 Kudos
emmap
Employee
Employee

Yea we need to know how the new SGMs are being added, are they patched up to the same JHF patch as the existing ones, are they all the same model, etc etc. Other than that it should start being investigated as if it was any regular gateway - on a problematic one, check connectivity, check policy, check zdebug drops, etc etc - try and match it to how it's going on a working one to see what's different.

0 Kudos
PJ_WONG
Contributor

Hi,

Thanks for the comment, the auto cloning were enabled and the licenses were inserted to the SGM. the DAC cables are bought together with the working SGM as we just migrated to maestro.

We are able to replicate this issue in our lab, the version is R81.20 JHF 105. 

May I know the version is used for your case?

 

Appreciate.

0 Kudos
Lari_Luoma
Ambassador Ambassador
Ambassador

Hi!

Sorry for being late to the game here. Did you test only with ping? What distribution mode is enabled? I have seen problems where ICMP traffic is not corrected properly causing ping returns to be dropped, but TCP traffic was working all the time. In the logs you would see errors like "ICMP reply does not match an existing connection". Can you verify this?

0 Kudos
PJ_WONG
Contributor

Hi!

Sorry for late reply as I was trying to replicate the issue in the lab. To answer your question:

1) Did you test only with ping?

"No, we also tried to use browser to browse the websites, where random websites will just keep loading."

 

2) What distribution mode is enabled?

"The mode auto-topology is used.

Another thing worth pointing out is we noticed the L4 distribution was enabled at the orchestrator (which should be configured as N/A by default). I have turned it off at the orchestrator and the SG but still haven't bring the 3rd and 4th SGM to test the network, which might have solve the issue already."

 

3) In the logs you would see errors like "ICMP reply does not match an existing connection". Can you verify this?
"Tried to search for this in logs, but did not find any similar logs."

 

In my lab, I am able to replicate this with an unmanaged switch. However, I have some reservation as it is just a dummy switch.

This is because when I set up the lab with other brand stacking switch, we have a very stable network with switch's link aggregation configured as dynamic LACP.

I am suspecting towards the LACP configuration at the EU site. They use a Huawei switch with static LACP. Sk179447 suggests specific settings for Check Point with Huawei. The description were similar except we have the bond interface working now.

While their production network is stable with "static LACP" and 2 active SGMs, I will plan to test with dynamic LACP or sk179447 recommended config in the next attempt.

 

Appreciate if you could provide any insight on this! Not certain if the bond interface or L4 distribution is affecting the traffic when 3 or more SGM are active.


Thanks,
PJ

 

 

 

0 Kudos
PJ_WONG
Contributor

Hi Lari,

When we do zdebug on three active SGM we can see "ICMP reply does not match a previous request". May I know what have you done in your previous case to rectify it?

image.png

 

Best Regards,
PJ

0 Kudos
Lari_Luoma
Ambassador Ambassador
Ambassador

Last time I saw this I changed distribution mode to general and then back to auto-topology. Haven't seen the issue since. If you environment is in production and has hide NAT, I recommend to do this in a maintenance window. 

This seems a software issue and for a permanent fix, it would be better to open a TAC ticket.

0 Kudos
PJ_WONG
Contributor

Hi Lari,

Thanks for the recommendation! We have not tried this yet and it is definitely worth trying in the next maintenance window.

We already engaged TAC with R&D, but so far they seem have not pinpoint the issue.

Aside from this, we tried g_reboot but the issue is not resolved, may I know have you tried g_reboot in your case? Just curious as rebooting might help in software issue case.

 

Thanks,
PJ

0 Kudos
Lari_Luoma
Ambassador Ambassador
Ambassador

I think reboot could solve it too. 

0 Kudos