Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
GUEYDON_Olivier
Contributor

SIC issue

Hi all,

I have a cluster of 2 5000 Appliances, running R80.10.

My trooble is one of the member, the standby, has lost SIC with the SMS. The Active is running well, but i can't push new policies.

I tried 30 times to reset SIC between the standby and the SMS, but always got error (300, 148).

So about 30 revoked certs on the SMS ...

My question is : must i reset both gateways SIC (even the active) ?

If so, as i can't push policies, what would happen for the active GW ?

Thanks a lot for your help,

27 Replies
Anthony_Joubai1
Contributor

Hello,

First question would be to determine why a standby member is not able to maintain SIC.

there is no basically no need to reset it.

Please elaborate architecture first: FULL HA or Distributed ?

Once SIC is establish, you can push once and then you are blocked?
New installation ?

regards,

Anthony

0 Kudos
GUEYDON_Olivier
Contributor

Hello Anthony (french?)

The cluster was working well since February. There is no Nat (Lan firewalling), and is in Active/Passive mode with HA.

Both GW are on the same vlan but not on the same as SMS but as i said, active one works well. I can ping SMS from active and standby, and GW from SMS.

Ports 18191,18192 are Listen on the standby GW.

I started having problem when i pushed a modified policy. I ran into an error of services port conflict (Uncheck match for any checkbox ...).

Thanks,

0 Kudos
Anthony_Joubai1
Contributor

Hello,

Could you run the following commands,
fw ctl zdebug drop | grep <SmartcenterIP> on both firewalls. and try to check SIC status on the object.

We may have the answer.

On the cluster objects, which IP are used for cluster member (private/public)

 

I would be delighted to discuss about this case in the french section Smiley Happy

CheckMates en Français 

0 Kudos
GUEYDON_Olivier
Contributor

Yes i'd also prefer move to french section. How can we do ?

Let's talk about my IP :

Active GW : 10.30.255.241

Standby GW : 10.30.255.243

SmartCenter : 10.33.1.130

The results of the command :

On the active GW :

Packet proto=6 10.33.1.130:55216 -> 10.30.255.243:18192 dropped by fw_tcp_state_update Reason: Illegal post SYN packet;

On the standby GW : nothing

0 Kudos
Anthony_Joubai1
Contributor

Solution could be on both member to

On the fly :
# fw ctl set int fwha_forw_packet_to_not_active 1

doesn't survive reboot

If it correct the problem, we will add it to the fwkern.conf.

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Judging from logs traffic from mgmt arrives on different interface (not 10.30.255.x) so you can try /32 routing solution provided here in case packets are not forwarded to standby member correctly. It's more to the end of the thread.

https://community.checkpoint.com/message/13561-re-problem-accessing-standby-cluster-member-from-non-... 

0 Kudos
Lloyd_Crosby
Contributor

Just a few questions.

Does the traffic traverse the VIP or the actual IP for the secondary?

What does the traffic from the Manangement server look like?

fw monitor -e 'accept host(10.33.1.130);'

Run that on both members.

See if it's hitting the VIP or the secondary at all.

Can you SSH from the MGMT server to the secondary firewall?

What does the routing look like?

ip route get x.x.x.x to the mgmt from FW1 and FW2 and do that from MGMT to both.

What does the Trace routes look like?

Are you Natting the MGMT behind the the cluster? If you aren't put in a non-nat rule and make sure you aren't.

If you are opening a ticket with the TAC get this information for them:

-Cpinfo of the GW's

-Cpinfo of the MGMT

-fw monitior and tpcdumps showing the communication between mgmt and the GW's

-CPD messages.

Jerry
Mentor
Mentor

If were  in your shoes and have had such circumstance I would have first determine all the aspects of the routing between the CoreXL Cluster and SMS/MDS then troubleshoot the SIC issue but narrowing the protocol flow by fw monitor. It may occure sometimes that one of the HA memebers lost SIC but essentially when it does the HA has no Management capabilities at all as you cannot push policy to one member only can you.

See if the certs can be removed from the none-working member first and then try to re-establish a SIC with the Cluster instead of the none working member (cpconfig on Active member).

if not - seek sk in Support Center - there was something about it ... let me try to locate it myself just now ...

Jerry
0 Kudos
Jerry
Mentor
Mentor

sk62873 then sk30579

Jerry
0 Kudos
Vincent_Bacher
Advisor
Advisor

fw debug cpca on TDERROR_ALL_ALL=5

might be interesting as well

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite
GUEYDON_Olivier
Contributor

Ok, here are the latest news:

@Anthony : fwha_forw_packet_to_not_active 1 : didn't solved.

@Jerry : couldn't find a way to remove the GW certificates.

In the ICA WebGUI (Management Tool), are all the revoked certificates for the non-working GW. I think i can delete them (confirm ?). There is also the valid one for the active GW, which is valid.

I found this : Regenerate the Internal CA Without Breaking SIC 

Could it be useful ?

0 Kudos
Jerry
Mentor
Mentor

yes indeed you can (just check the issue date beforehand) Smiley Happy

reg. Regenerate - indeed that may come helpful. 

you’re the man, best of luck !

Jerry

Jerry
0 Kudos
JozkoMrkvicka
Mentor
Mentor

fw unloadlocal on problematic node during issue to see if this is rulebase related ?

Also, did you try "fw ctl zdebug + drop | grep <SMS_IP>" as was suggested ?

Kind regards,
Jozko Mrkvicka
GUEYDON_Olivier
Contributor

Hi guys,

Seems to be endless problem. I've tried all you suggested me, but with no luck.

I think factory-default is the next step. As it's the first time for me, i'm a little bit confused Smiley Happy

Will i be able to reattach the GW to the cluster after that ?

Are there some actions to do before factory-default ?

0 Kudos
Jerry
Mentor
Mentor

don’t do defaulting just re-do full HA SIC again (yes, from scratch) - unless you don’t have console/mgmt access to the ClusterXL HW (VM/APPL).

do the SIC again with the Management but first, make sure you’ve got IP connectivity with the MGMT server and HA memebers can individually reach out to the SMS host/s.

what about the routing table ? did you compare the routing tables from BOTH HA members?

I have had such case where both Members have had DIFFERENCES between themselves. It might be the case of simply connectivity or ... NIC associations?

Jerry
0 Kudos
Anthony_Joubai1
Contributor

Hello,

Agree with Jerry, don't Reset the box, as long as the source problem is not identified.

Please contact your support Smiley Happy

A quick remote and the problem will be solved without Reset the Box.

regards,

Anthony

Jerry
Mentor
Mentor

agree with Anthony, call the Support, unless you’ve got no valid contract (valid support) with the Vendor. If you don’t then ... you may need to digg into the details of the design mate Smiley Happy

Jerry
0 Kudos
GUEYDON_Olivier
Contributor

Jerry, what do you mean by "re-do full HA SIC again (yes, from scratch)" ?

Do you speak about Regenerate the Internal CA Without Breaking SIC ?

It's the only thing i haven't done yet (too scared to loose the working box !)

About connectivity, both GWs can ping the SMS, but they don't ping each other.

SMS can ping both GWs.

Routing table are the same on both GWs

0 Kudos
Jerry
Mentor
Mentor

in that case yes, I think this is the only way to regenerate a SIC with CA without breaking the communication entirely, however I would be careful though if something bad happen so that I can still securely access Management IPs from the HA members.

Jerry
0 Kudos
GUEYDON_Olivier
Contributor

Found something strange in the smartcenter logs :

from SMS to non-working GW : DROP/  CDP_amon (18192) / Data received before SYN was acknowledged. Stripping all packet data.

0 Kudos
Jerry
Mentor
Mentor

my adivse would be to compare “take’s” on both GWs and re-run the “take” you’ve got the latest on the bad GW. This would re-introduce broken CDP daemon to the processes. this is what I would do, the decision is still yours.

Jerry
0 Kudos
JozkoMrkvicka
Mentor
Mentor

asymmetric routing.

 

I had exact issue with SIC and policy installation (takes around 20 minutes with timeout error) and it was discovered that issue is with routing.

 

Do tracert during normal situation and compare it during the issue.

Kind regards,
Jozko Mrkvicka
0 Kudos
GUEYDON_Olivier
Contributor

Well in fact, all the traffic from the bad geteway FW-2 is droped ...

Cluster member IP is being spoofed.

Since last friday. Don't know what i have done Smiley Sad

0 Kudos
Jerry
Mentor
Mentor

honestly I’d re-do ClusterXL from scratch otherwise you may look 7 days for causes and end up nowhere. Hard time for you mate hit truth is that investigation may take you days if nit weeks ...

Follw the sk’s and rebuild HA. Easiest option imho.

Jerry
0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

As i mentioned before - it looks like traffic from FW2 is being routed via FW1 (at least thats what your log is showing)

Make sure that traffic to/from management reaches FW2 directly, not via FW1. Easiest is to add /32 route on the gateway in front of the firewalls pointing directly to each cluster member physical IP 10.30.255.x. 

Also make sure that return route is correct on FW2 - seems odd that traffic from FW2 to SMS is being routed via FW1

0 Kudos
GUEYDON_Olivier
Contributor

Hi guys,

I did almost all you suggested, and finally, the cluster is now working well.

The issue came from a bad SIC certificate. I restored a snapshot from jully on the bad GW, created a new certificate on the SMS, and reinitialized the SIC.

And then : Trusted esablished.

The 2 GW are now synced.

Thanks again for your help.

Cheers!

Jerry
Mentor
Mentor

told you “new SIC Cert” will fix that 🙂

Jerry
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events