Re: SIC issue

GUEYDON_Olivier · ‎2018-08-09

Hi all,

I have a cluster of 2 5000 Appliances, running R80.10.

My trooble is one of the member, the standby, has lost SIC with the SMS. The Active is running well, but i can't push new policies.

I tried 30 times to reset SIC between the standby and the SMS, but always got error (300, 148).

So about 30 revoked certs on the SMS ...

My question is : must i reset both gateways SIC (even the active) ?

If so, as i can't push policies, what would happen for the active GW ?

Thanks a lot for your help,

Anthony_Joubai1 · ‎2018-08-09

Hello,

First question would be to determine why a standby member is not able to maintain SIC.

there is no basically no need to reset it.

Please elaborate architecture first: FULL HA or Distributed ?

Once SIC is establish, you can push once and then you are blocked?
New installation ?

regards,

Anthony

GUEYDON_Olivier · ‎2018-08-09

Hello Anthony (french?)

The cluster was working well since February. There is no Nat (Lan firewalling), and is in Active/Passive mode with HA.

Both GW are on the same vlan but not on the same as SMS but as i said, active one works well. I can ping SMS from active and standby, and GW from SMS.

Ports 18191,18192 are Listen on the standby GW.

I started having problem when i pushed a modified policy. I ran into an error of services port conflict (Uncheck match for any checkbox ...).

Thanks,

Anthony_Joubai1 · ‎2018-08-09

Hello,

Could you run the following commands,
fw ctl zdebug drop | grep <SmartcenterIP> on both firewalls. and try to check SIC status on the object.

We may have the answer.

On the cluster objects, which IP are used for cluster member (private/public)

I would be delighted to discuss about this case in the french section

CheckMates en Français

GUEYDON_Olivier · ‎2018-08-09

Yes i'd also prefer move to french section. How can we do ?

Let's talk about my IP :

Active GW : 10.30.255.241

Standby GW : 10.30.255.243

SmartCenter : 10.33.1.130

The results of the command :

On the active GW :

Packet proto=6 10.33.1.130:55216 -> 10.30.255.243:18192 dropped by fw_tcp_state_update Reason: Illegal post SYN packet;

On the standby GW : nothing

Anthony_Joubai1 · ‎2018-08-09

Solution could be on both member to

On the fly :
# fw ctl set int fwha_forw_packet_to_not_active 1

doesn't survive reboot

If it correct the problem, we will add it to the fwkern.conf.

Kaspars_Zibarts · ‎2018-08-09

Judging from logs traffic from mgmt arrives on different interface (not 10.30.255.x) so you can try /32 routing solution provided here in case packets are not forwarded to standby member correctly. It's more to the end of the thread.

https://community.checkpoint.com/message/13561-re-problem-accessing-standby-cluster-member-from-non-...

Lloyd_Crosby · ‎2018-08-11

Just a few questions.

Does the traffic traverse the VIP or the actual IP for the secondary?

What does the traffic from the Manangement server look like?

fw monitor -e 'accept host(10.33.1.130);'

Run that on both members.

See if it's hitting the VIP or the secondary at all.

Can you SSH from the MGMT server to the secondary firewall?

What does the routing look like?

ip route get x.x.x.x to the mgmt from FW1 and FW2 and do that from MGMT to both.

What does the Trace routes look like?

Are you Natting the MGMT behind the the cluster? If you aren't put in a non-nat rule and make sure you aren't.

If you are opening a ticket with the TAC get this information for them:

-Cpinfo of the GW's

-Cpinfo of the MGMT

-fw monitior and tpcdumps showing the communication between mgmt and the GW's

-CPD messages.

Jerry · ‎2018-08-09

If were in your shoes and have had such circumstance I would have first determine all the aspects of the routing between the CoreXL Cluster and SMS/MDS then troubleshoot the SIC issue but narrowing the protocol flow by fw monitor. It may occure sometimes that one of the HA memebers lost SIC but essentially when it does the HA has no Management capabilities at all as you cannot push policy to one member only can you.

See if the certs can be removed from the none-working member first and then try to re-establish a SIC with the Cluster instead of the none working member (cpconfig on Active member).

if not - seek sk in Support Center - there was something about it ... let me try to locate it myself just now ...

Jerry

Jerry · ‎2018-08-09

sk62873 then sk30579

Jerry

Vincent_Bacher · ‎2018-08-09

fw debug cpca on TDERROR_ALL_ALL=5

might be interesting as well

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite

GUEYDON_Olivier · ‎2018-08-09

Ok, here are the latest news:

@Anthony : fwha_forw_packet_to_not_active 1 : didn't solved.

@Jerry : couldn't find a way to remove the GW certificates.

In the ICA WebGUI (Management Tool), are all the revoked certificates for the non-working GW. I think i can delete them (confirm ?). There is also the valid one for the active GW, which is valid.

I found this : Regenerate the Internal CA Without Breaking SIC

Could it be useful ?

Jerry · ‎2018-08-09

yes indeed you can (just check the issue date beforehand)

reg. Regenerate - indeed that may come helpful.

you’re the man, best of luck !

Jerry

JozkoMrkvicka · ‎2018-08-09

fw unloadlocal on problematic node during issue to see if this is rulebase related ?

Also, did you try "fw ctl zdebug + drop | grep <SMS_IP>" as was suggested ?

Kind regards,
Jozko Mrkvicka

GUEYDON_Olivier · ‎2018-08-10

Hi guys,

Seems to be endless problem. I've tried all you suggested me, but with no luck.

I think factory-default is the next step. As it's the first time for me, i'm a little bit confused

Will i be able to reattach the GW to the cluster after that ?

Are there some actions to do before factory-default ?

Jerry · ‎2018-08-10

don’t do defaulting just re-do full HA SIC again (yes, from scratch) - unless you don’t have console/mgmt access to the ClusterXL HW (VM/APPL).

do the SIC again with the Management but first, make sure you’ve got IP connectivity with the MGMT server and HA memebers can individually reach out to the SMS host/s.

what about the routing table ? did you compare the routing tables from BOTH HA members?

I have had such case where both Members have had DIFFERENCES between themselves. It might be the case of simply connectivity or ... NIC associations?

Jerry

Anthony_Joubai1 · ‎2018-08-10

Hello,

Agree with Jerry, don't Reset the box, as long as the source problem is not identified.

Please contact your support

A quick remote and the problem will be solved without Reset the Box.

regards,

Anthony

Jerry · ‎2018-08-10

agree with Anthony, call the Support, unless you’ve got no valid contract (valid support) with the Vendor. If you don’t then ... you may need to digg into the details of the design mate

Jerry

GUEYDON_Olivier · ‎2018-08-10

Jerry, what do you mean by "re-do full HA SIC again (yes, from scratch)" ?

Do you speak about Regenerate the Internal CA Without Breaking SIC ?

It's the only thing i haven't done yet (too scared to loose the working box !)

About connectivity, both GWs can ping the SMS, but they don't ping each other.

SMS can ping both GWs.

Routing table are the same on both GWs

Jerry · ‎2018-08-10

in that case yes, I think this is the only way to regenerate a SIC with CA without breaking the communication entirely, however I would be careful though if something bad happen so that I can still securely access Management IPs from the HA members.

Jerry

GUEYDON_Olivier · ‎2018-08-10

Found something strange in the smartcenter logs :

from SMS to non-working GW : DROP/ CDP_amon (18192) / Data received before SYN was acknowledged. Stripping all packet data.

Jerry · ‎2018-08-10

my adivse would be to compare “take’s” on both GWs and re-run the “take” you’ve got the latest on the bad GW. This would re-introduce broken CDP daemon to the processes. this is what I would do, the decision is still yours.

Jerry

JozkoMrkvicka · ‎2018-08-11

asymmetric routing.

I had exact issue with SIC and policy installation (takes around 20 minutes with timeout error) and it was discovered that issue is with routing.

Do tracert during normal situation and compare it during the issue.

Kind regards,
Jozko Mrkvicka

GUEYDON_Olivier · ‎2018-08-10

Well in fact, all the traffic from the bad geteway FW-2 is droped ...

Cluster member IP is being spoofed.

Since last friday. Don't know what i have done

Jerry · ‎2018-08-10

honestly I’d re-do ClusterXL from scratch otherwise you may look 7 days for causes and end up nowhere. Hard time for you mate hit truth is that investigation may take you days if nit weeks ...

Follw the sk’s and rebuild HA. Easiest option imho.

Jerry

Kaspars_Zibarts · ‎2018-08-11

As i mentioned before - it looks like traffic from FW2 is being routed via FW1 (at least thats what your log is showing)

Make sure that traffic to/from management reaches FW2 directly, not via FW1. Easiest is to add /32 route on the gateway in front of the firewalls pointing directly to each cluster member physical IP 10.30.255.x.

Also make sure that return route is correct on FW2 - seems odd that traffic from FW2 to SMS is being routed via FW1

GUEYDON_Olivier · ‎2018-08-16

Hi guys,

I did almost all you suggested, and finally, the cluster is now working well.

The issue came from a bad SIC certificate. I restored a snapshot from jully on the bad GW, created a new certificate on the SMS, and reinitialized the SIC.

And then : Trusted esablished.

The 2 GW are now synced.

Thanks again for your help.

Cheers!

Jerry · ‎2018-08-16

told you “new SIC Cert” will fix that 🙂

Jerry

Are you a member of CheckMates?

SIC issue