Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
DirkB
Contributor

R81_10_JUMBO_HF_MAIN_Bundle_T79 Problem with Standby

ClusterXL Appliance 3800, before R81.10 T78

Update yesterday to R81.10 T79, Update without Errors at all,

active Member works fine, no problem in operations - but standby goes in error

cp-t79-fails-standby.jpg

 

I have seen this error before in previous posts for other Takes ...https://community.checkpoint.com/t5/Management/Anti-Bot-amp-Anti-Virus-IPS-update-error-on-Standby-M...  and others

rollback, wait for fixup, ignore?

Thank you

 

0 Kudos
24 Replies
Lesley
Contributor

Hey Dirk,

Depends, if the upgrade just finished it needs some 'time'. In this case I would reboot the stand-by unit + mgmt.

If this not helps there is a SK that the management is not showing the correct status.

Maybe check after reboot if it is really the case you cannot reach updates.checkpoint.com

0 Kudos
DirkB
Contributor

Thanks Lesley,

no,  ... update a few hours before, extra reboot just to look at it ...

0 Kudos
DirkB
Contributor

from active Cluster-Member access ok (Gaia)

cp-t79-act-ok.jpg

from standby no access

cp-t79-standby-fails.jpg

ofc same DNS, Proxy (no Proxy) in Cluster for all members ...

0 Kudos
Lesley
Contributor

0 Kudos
DirkB
Contributor

I'll try ...  skeptical: until yesterday with T78 everything was still running

0 Kudos
the_rock
Champion
Champion

I cant comment for production environment, but I did this in the lab recently and never saw this problem for the cluster.

(1)
DirkB
Contributor

thank you Rock, I have removed the warning in the post and will keep you up to date on what it was. ... I will also not make a rollback for now

the_rock
Champion
Champion

@Lesley gave good reference sk, but personally, I dont believe this is a bug. Maybe worth troubleshooting with TAC.

Andy

0 Kudos
Lesley
Contributor

Final idea, perform fail-over and let it do it's updates (if even possible). If it can do updates it is related because it is the stand-by if not something is going on with this member. If the update is a success make it stand-by again and watch what the effect is and if the issue reoccurs. 

the_rock
Champion
Champion

very logical point!

0 Kudos
DirkB
Contributor

Lesley, thanks for input

yes the thought I already had in mind 😉 ... but I work remotely ... I assume in any case that the session goes away, if not come back ... increased risk for me ... 

0 Kudos
rickardsv
Participant

Probably same result but sometimes SmartConsole can fool you with "cosmetic errors"

Check status at CLI on gateway:

cpstat antimalware -f update_status

0 Kudos
DirkB
Contributor

@rickardsv  ... no cosmetic errors from console view ... was true no access from Gaia and CLI

thx for your input, the cpstat is ok now because the problem was temporarily solved by failover.

0 Kudos
DirkB
Contributor

So, ... I took the risk of a failover, session was gone, but came back after about 20 sec 😥 (different from usual, there was no interruption of the sessions...)


After the fail over everything was green (maybe I didn't wait long enough), the updates (IPS, anti-BOT) were also pulled now, but:
after a re-fail-over (back to the initial position) I have again on standby no connection to the CP Cloud ... i'll turn back again and look at the other cluster member regarding access ...

0 Kudos
the_rock
Champion
Champion

Maybe compare output of below commands on both members:

ip r g 8.8.8.8

curl_cli -k google.com

Andy

0 Kudos
DirkB
Contributor

So, currently I get for the active and standby member for the route on Google DNS the external IF as destination and then "cache".


For the test on the Google site I get only on the active member a response (301 Moved), on the standby comes error (6) "could'nt resolve".
If I do a failover, this behavior persists, meaning the active member can resolve what didn't work before (as standby).

The standby can't resolve what work before (as active).


Basically, this behavior was already observed in Gaia when checking the updates ...

0 Kudos

@DirkB 

Hi 🙂

My name is Naama Specktor and I am checkpoint employee.

if you opened a TAC SR , I will appreciate it if you share it with me here or via PM.

Thanks!

Naama

0 Kudos
DirkB
Contributor

@Naama_Specktor  Thank you for the offer ... but I'm a "bit up in the air" with TAC ;), because we have no partner at the moment and the new contracts still take some time ... I will test a few more things and report back here.

0 Kudos
DirkB
Contributor

The problem is recurring and moves to the standby member each time a failover occurs.

It will probably be best if i roll back the update so that I can see if without the update everything runs normally again as before (?)

0 Kudos
the_rock
Champion
Champion

Yes, but before you do so...can you run command I mentioned previously on the "problematic" member from expert mode -> curl_cli -k google.com

Andy

0 Kudos
DirkB
Contributor

hi Rock, thank you ... I answered to your hint further up directly  - sry

0 Kudos
the_rock
Champion
Champion

Sorry sorry, I see it now. So, logically, but this is just me...IF all this worked fine with previous jumbo, then I would say roll back and see if the problem goes away. Happy to do remote and check for you if you like, just message me privately. Not saying we would fix it, but I have few things in mind we could check.

0 Kudos
DirkB
Contributor

Hello all,

the error with R81.10 Take 79 is persisted. I tried some more SK (so https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...), nothing helped.

Always (and after every failover) the standby is/was without connection/access to the cloud (or generally internet/DNS, @the_rock hint: curl_cli -k google.com brings error (6) "could'nt resolve"). The active cluster member works always fine - just like the standby (as active) after failover.


Noticeable is the breaking of the sessions on failover (I worked on the problem remote/VPN), this was not before with take 78 ... (timeout after failover for about 10-15 seconds only with Take 79).


So I did a rollback to Take 78 (via Gaia), whereby the difficulty was that only the active member could be uninstalled (check before uninstalling requires access to the CP cloud). But the automatic failover works for happiness 😉


It might be interesting to know that the effect - that the standby has no connection to the CP-Cloud (can't do a DNS resolution) -  remained even after the rollback to Take 78 was done on one member, only when both members in the cluster were back on R81.10 T78, the spook was over and everything worked again. (without further changes).

When i find out which specific config might have caused this behavior, i'll let you know here.

Thank you for support!

0 Kudos
DirkB
Contributor

Hints if you run into this error:

Update from 81.10 T78 to Take 79 results in the problem that the standby has no connection to the CP-cloud.

  • after failover (and becomes active member) everything works fine again - without further changes to the config, but the new standby has the access problem - constantly reproducible.
  • on failover the sessions break down for a short time (TO about 10-15 seconds)
  • nothing helped, no SK
  • only rollback helped in my case
  • if the error occurs elsewhere, maybe Checkpoint will look into it - so far we have had no such problems with updates (due to environmental configuration) - never.

0 Kudos