R81_10_JUMBO_HF_MAIN_Bundle_T79 Problem with Stand...

DirkB · ‎2022-11-24

ClusterXL Appliance 3800, before R81.10 T78

Update yesterday to R81.10 T79, Update without Errors at all,

active Member works fine, no problem in operations - but standby goes in error

I have seen this error before in previous posts for other Takes ...https://community.checkpoint.com/t5/Management/Anti-Bot-amp-Anti-Virus-IPS-update-error-on-Standby-M... and others

rollback, wait for fixup, ignore?

Thank you

Lesley · ‎2022-11-24

Hey Dirk,

Depends, if the upgrade just finished it needs some 'time'. In this case I would reboot the stand-by unit + mgmt.

If this not helps there is a SK that the management is not showing the correct status.

Maybe check after reboot if it is really the case you cannot reach updates.checkpoint.com

-------
If you like this post please give a thumbs up(kudo)! 🙂

DirkB · ‎2022-11-24

Thanks Lesley,

no, ... update a few hours before, extra reboot just to look at it ...

DirkB · ‎2022-11-24

from active Cluster-Member access ok (Gaia)

from standby no access

ofc same DNS, Proxy (no Proxy) in Cluster for all members ...

Lesley · ‎2022-11-24

Maybe this SK will help:

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

-------
If you like this post please give a thumbs up(kudo)! 🙂

DirkB · ‎2022-11-24

I'll try ... skeptical: until yesterday with T78 everything was still running

the_rock · ‎2022-11-24

I cant comment for production environment, but I did this in the lab recently and never saw this problem for the cluster.

DirkB · ‎2022-11-24

thank you Rock, I have removed the warning in the post and will keep you up to date on what it was. ... I will also not make a rollback for now

the_rock · ‎2022-11-24

@Lesley gave good reference sk, but personally, I dont believe this is a bug. Maybe worth troubleshooting with TAC.

Andy

Lesley · ‎2022-11-24

Final idea, perform fail-over and let it do it's updates (if even possible). If it can do updates it is related because it is the stand-by if not something is going on with this member. If the update is a success make it stand-by again and watch what the effect is and if the issue reoccurs.

-------
If you like this post please give a thumbs up(kudo)! 🙂

the_rock · ‎2022-11-24

very logical point!

DirkB · ‎2022-11-24

Lesley, thanks for input

yes the thought I already had in mind 😉 ... but I work remotely ... I assume in any case that the session goes away, if not come back ... increased risk for me ...

svori · ‎2022-11-24

Probably same result but sometimes SmartConsole can fool you with "cosmetic errors"

Check status at CLI on gateway:

cpstat antimalware -f update_status

DirkB · ‎2022-11-24

@svori ... no cosmetic errors from console view ... was true no access from Gaia and CLI

thx for your input, the cpstat is ok now because the problem was temporarily solved by failover.

DirkB · ‎2022-11-24

So, ... I took the risk of a failover, session was gone, but came back after about 20 sec 😥 (different from usual, there was no interruption of the sessions...)

After the fail over everything was green (maybe I didn't wait long enough), the updates (IPS, anti-BOT) were also pulled now, but:
after a re-fail-over (back to the initial position) I have again on standby no connection to the CP Cloud ... i'll turn back again and look at the other cluster member regarding access ...

the_rock · ‎2022-11-24

Maybe compare output of below commands on both members:

ip r g 8.8.8.8

curl_cli -k google.com

Andy

DirkB · ‎2022-11-24

So, currently I get for the active and standby member for the route on Google DNS the external IF as destination and then "cache".

For the test on the Google site I get only on the active member a response (301 Moved), on the standby comes error (6) "could'nt resolve".
If I do a failover, this behavior persists, meaning the active member can resolve what didn't work before (as standby).

The standby can't resolve what work before (as active).

Basically, this behavior was already observed in Gaia when checking the updates ...

Naama_Specktor · ‎2022-11-24

@DirkB

Hi 🙂

My name is Naama Specktor and I am checkpoint employee.

if you opened a TAC SR , I will appreciate it if you share it with me here or via PM.

Thanks!

Naama

DirkB · ‎2022-11-24

@Naama_Specktor Thank you for the offer ... but I'm a "bit up in the air" with TAC ;), because we have no partner at the moment and the new contracts still take some time ... I will test a few more things and report back here.

DirkB · ‎2022-11-24

The problem is recurring and moves to the standby member each time a failover occurs.

It will probably be best if i roll back the update so that I can see if without the update everything runs normally again as before (?)

the_rock · ‎2022-11-24

Yes, but before you do so...can you run command I mentioned previously on the "problematic" member from expert mode -> curl_cli -k google.com

Andy

DirkB · ‎2022-11-24

hi Rock, thank you ... I answered to your hint further up directly - sry

the_rock · ‎2022-11-24

Sorry sorry, I see it now. So, logically, but this is just me...IF all this worked fine with previous jumbo, then I would say roll back and see if the problem goes away. Happy to do remote and check for you if you like, just message me privately. Not saying we would fix it, but I have few things in mind we could check.

DirkB · ‎2022-11-25

Hello all,

the error with R81.10 Take 79 is persisted. I tried some more SK (so https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...), nothing helped.

Always (and after every failover) the standby is/was without connection/access to the cloud (or generally internet/DNS, @the_rock hint: curl_cli -k google.com brings error (6) "could'nt resolve"). The active cluster member works always fine - just like the standby (as active) after failover.

Noticeable is the breaking of the sessions on failover (I worked on the problem remote/VPN), this was not before with take 78 ... (timeout after failover for about 10-15 seconds only with Take 79).

So I did a rollback to Take 78 (via Gaia), whereby the difficulty was that only the active member could be uninstalled (check before uninstalling requires access to the CP cloud). But the automatic failover works for happiness 😉

It might be interesting to know that the effect - that the standby has no connection to the CP-Cloud (can't do a DNS resolution) - remained even after the rollback to Take 78 was done on one member, only when both members in the cluster were back on R81.10 T78, the spook was over and everything worked again. (without further changes).

When i find out which specific config might have caused this behavior, i'll let you know here.

Thank you for support!

DirkB · ‎2022-11-25

Hints if you run into this error:

Update from 81.10 T78 to Take 79 results in the problem that the standby has no connection to the CP-cloud.

after failover (and becomes active member) everything works fine again - without further changes to the config, but the new standby has the access problem - constantly reproducible.
on failover the sessions break down for a short time (TO about 10-15 seconds)
nothing helped, no SK
only rollback helped in my case
if the error occurs elsewhere, maybe Checkpoint will look into it - so far we have had no such problems with updates (due to environmental configuration) - never.

Are you a member of CheckMates?

R81_10_JUMBO_HF_MAIN_Bundle_T79 Problem with Standby