Solved: Policy push overwrote default route on cluster act...

the_rock · ‎2022-11-24

Hey guys,

I really hope someone can shed some light with this. So, one of our colleagues went into client's environment (they use smart-1 cloud) and 6000 series cluster and simply added couple of IP addresses to block group and once policy was applied, we noticed that active member could not be accessed.

At this point, thankfully, ssh to backup worked fine, so once we ssh-ed to active from backup, noticed that default route was gone. Now, in my 15 years with CP, I had NEVER seen or heard of problem like this. Keep in mind, failover never happened, however, there was Internet outage, as default route was gone. Default route was added back via clish afterwards and we did push policy couple of times afterwards and it was fine.

Now, just to try and figure this out ourselves, we downloaded audit.log from /var/log/audit dir, but it was not useful at all, as it does not have any timestamps, but we searched for words, such as route, default, delete, but no luck. We are 99.99% sure that something else caused this, rather than policy push, but really hard to say what at this point.

Also checked /var/log/messages files, but no luck there either. There was no one who was even logged into firewalls before this issue happened, so it begs the question HOW this happened.

We ended up opening TAC case for it, but after doing zoom meeting, gentleman told us would consult further internally and see what else can be done to try and find the reason.

If anyone else has an idea or any other file(s) we could check, it would be greatly appreciated!

Thanks as always.

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2022-12-14

@Ilya_Yusupov provided me with updated cpisp_update script from $FWDIR/bin directory and it worked fine in my lab, so that is the solution!

Cheers and thanks again Ilya for all your efforts, truly grateful!! 🙌

Best,
Andy
"Have a great day and if its not, change it"

View solution in original post

DirkB · ‎2022-11-25

As a newcomer at Checkpoint just a shot in the dark: is a CloningGroup defined? We once had strange effects with it (admin, cadmin ...), there were also accesses to cluster members lost (but not the default route 😉 ) But Routing is group feature ...

the_rock · ‎2022-11-25

Hey @DirkB , thanks for the response, but thats definitely not it, sorry. Lets see what TAC comes back with, as we are totally out of ideas where to even look next, as we checked everything we could humanly think of.

Best,
Andy
"Have a great day and if its not, change it"

DirkB · ‎2022-11-25

only stupid casually if DHCP (I don't know 😉 ):Are Kernel Routes activated, Kernel Routes are not activated by default.

Is Ping activated for the default route? If so, I think the route is deleted from table per design, if ping fails (or failed temporary) to next hop (and readded, if Ping succeeds) ... but it would be unclear to me how the static entries would be accessed - perhaps with save config (with the policies) ... wich is the timestamp of configs (can you track an order?). Perhaps just a coincidence in time.

the_rock · ‎2022-11-25

I dont personally think that would have anything to do with it, regardless about the ping failing. Let us see what TAC says next, because we absolutely have to give a reason to the customer, this cannot happen again, specially given the fact it caused Internet outage.

Best,
Andy
"Have a great day and if its not, change it"

BikeMan · ‎2022-11-25

Hi,

You could check /var/log/routed.log. May be more info in it.

Rgds,

the_rock · ‎2022-11-25

Ok, thats good idea, ty, will check that in couple hours and report back.

Cheers.

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2022-11-25

I attached a file with relevant times when it happened and messages from routed_messages and routed.log. Not sure if they matter, but I saw same messages for the last year, so hard to believe its relevant, but who knows : - )

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2022-12-06

In case anyone ever has this issue, please be mindful that this happened AFTER we upgraded customer from R80.40 to R81.10 and we found out with help from TAC escalations its caused by ISP redundancy. so appears something in R81.10 is different than in R80.40 for this, what, no clue. I also replicated this in my lab as well I will update once I find out 100% what exactly is causing the behavior, as it happened twice already.

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2022-12-07

For anyone who has ISP redundancy, TAC gave below change as issue was replicated, so you would have to do this. What exactly it does, not 100% sure, never got an answer, though it says to disable the script uncomment the line. I definitely will get an answer as to what is EXACT purpose of this.

Please uncomment this line "exit 1" from file $FWDIR/bin/cpisp_update file on the gateway:

# To disable the script uncomment the following line.
# exit 1

set MISP_MAX_ISPS = 10
set ISP_STATUS_FILE = "$FWDIR/conf/cpispstatus.conf"

Best,
Andy
"Have a great day and if its not, change it"

Blason_R · ‎2022-12-07

Thats wonderful finding let me replicate that in my scenario and test it out.

Thanks and Regards,
Blason R
CCSA,CCSE,CCCS

the_rock · ‎2022-12-07

I hope you never have the same problem, as its really frustrating. But, now that we know what was causing it, its an easy fix. I will definitely make sure to get an explanation if upgrade caused this, as it never happened on R80.40.

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2022-12-11

Hey @Blason_R . Im working with Israel folks offline on this issue, as I was advised this suggestion by TAC is not really a good one, as it disabled the ISPR script, as I suspected. I actually found what I believe is pretty good workaround, which is to change any line in cpisp_update script that contains words static-route from off to on, I tested it in the lab, with line # exit 1 there (meaning script is NOT disabled) and it worked fine.

Here is what I mean by this...so say you have ISPR enabled and you edit first isp link in gateway properties and you give it bogus DG of 9.10.11.12 and say for arguments sake, proper DG is 182.183.184.50...well, if you do this and push policy, Internet will still work, as it will show 2 entries for DG in Gaia after policy push...right one and then bogus one below it. Now, obviously no one in their right mind would put bogus IP for default route, but I simply wanted to test it in the lab to make sure connection would still work and it did.

Customer obviously has the right DG set in gateway properties for ISPR. I am waiting to confirm if what I modified in the script is even supported, but definitely appears to work.

Best,
Andy
"Have a great day and if its not, change it"

BikeMan · ‎2022-12-07

Hi,

Could you share the version you are using on the Smart-1 and on the firewall ?

I am managing about 180 modules all over the world and the new company standard is to use ISP redundancy every where. Currently running 80.40 and 80.20, upgrade to 81.10 is planned next year...

Thanks,

the_rock · ‎2022-12-08

I hate to say this, but unless CP actually fixed this permanently, you WILL most likely encounter this problem. Mgmt is S1C (smart-1 cloud) and gateways are 6400, all running R81.10. Now, keep in mind, when S1C was on R81.10 (upgrades are managed and scheduled by CP) and gateways on R80.40, this NEVER happened. Once gateways were upgraded to R81.10, thats when issue occurred 1st time maybe a week later and then 2nd time 2 weeks after. This is why Im pressing TAC to provide logic as to whether this is an issue in R81+ and how it can be avoided without having to modify that file.

Another interesting thing I will also try to confirm is whether this ONLY affects HA isp config or load sharing as well. As soon as I get the info, will update here.

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2022-12-08

Also @BikeMan , to add to my last comment, I also saw something interesting below:

https://sc1.checkpoint.com/documents/R81.10/WebAdminGuides/EN/CP_R81.10_Quantum_SecurityGateway_Guid...

See this part:

The ISP Redundancy Script

When the Security Gateway starts, or an ISP link state changes, the $FWDIR/bin/cpisp_update script runs on the Security Gateway.

This script changes the default route of the Security Gateway.

Warning - We do not recommend that you make any changes to this script.

***************************************

Now, here is my own logic. NEITHER of those scenarios applied to the customer. So, obviously, once gateways were upgraded, they had to be rebooted, so according to the document, it would imply that default route would change every time fw is rebooted?? That makes no sense. Also, their primary ISP link never failed either. Anyway, lets see what TAC says.

Best,
Andy
"Have a great day and if its not, change it"

Ilya_Yusupov · ‎2022-12-09

Hi @the_rock ,

I reviewed the case and i am sorry but the provided solution is wrong here, uncomment that line means you disable ISPR means failover of ISP will not work.

In this case customer is running with BGP which its not supported with ISPR, even if it worked for a customer on previous version we can't guarantee that it worked correctly as dynamic routing and ISPR are impacting routing of each other.

There is some discussion with RnD about the case but not sure there will be a solution.

Will keep update once we have some conclusions.

Thanks,

Ilya

the_rock · ‎2022-12-09

Thanks @Ilya_Yusupov , very grateful for your update. So, just wondering, I could not find any documents or articles stating that ISPR is not supported with BGP. Would you be able to provide that please?

Thanks in advance.

Best,
Andy
"Have a great day and if its not, change it"

Ilya_Yusupov · ‎2022-12-09

Sure i will share it on Sunday as i am not in front a laptop, its not only bgp but dynamic routing with ispr.

the_rock · ‎2022-12-09

No rush. I will ask via the TAC case.

Best,
Andy
"Have a great day and if its not, change it"

the_rock · ‎2022-12-09

K, TAC provided the link, but here is what makes no sense to me personally.

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

this part:

PMTR-68991ISP Redundancy is not supported if Dynamic Routing is configured (because the ISP Redundancy feature must create a static default route that overrides the default route created by dynamic routing).

well, customer's DG IS INDEED derived from ISPR and NOT bgp, so I cant connect the dots here as to why this would even apply to them.

Best,
Andy
"Have a great day and if its not, change it"

Ilya_Yusupov · ‎2022-12-09

You are right but as o mentioned dynamic routing and ispr may impact each other routes.

To answer your question i need to understand what exactly happened for that RnD are working with TAC to get more data.

the_rock · ‎2022-12-09

Just wondering, I know this may not be advisable, but would it make sense to actually change below line to on instead of off in cpisp_update script as thats what gets invoked when issue happened twice...

clish -c "set static-route default off"

I mean I know it says its NOT recommended to change the script, but not sure what else we can do now...

Best,
Andy
"Have a great day and if its not, change it"

Ilya_Yusupov · ‎2022-12-11

@the_rock ,

i will contact you offline to continue discussion about this case and we will back to this thread once we have full conclusions.

Thanks,

Ilya

the_rock · ‎2022-12-11

Responded to your email, very grateful for doing so @Ilya_Yusupov

Toda!

Cheers.

Best,
Andy
"Have a great day and if its not, change it"

_Val_ · ‎2022-12-11

What is not clear here? You cannot use ISP Redundancy and Dynamic routing on the same cluster. It is either or.

the_rock · ‎2022-12-12

If you read my exact response, this is part thats confusing -> because the ISP Redundancy feature must create a static default route that overrides the default route created by dynamic routing

Logically, in client's case, ISPR indeed DOES create default route that actually overrides anything created by BGP.

I honestly wish CP documentation in lot of cases was more clear and concise, because there are many cases where people are left wondering about the statements provided.

Best,
Andy
"Have a great day and if its not, change it"

Ilya_Yusupov · ‎2022-12-13

Hi all,

First of all @the_rock thank you for raising this case and for your patience for my questions ;).

i would like to update the thread that indeed we have some race issue with ISPR script where during running of the script we may lose clish lock and will not be able to set the route back during the push policy.

i was able to replicate the issue in my lab by running clish -c commands in loop during push policy process.

we are working on a fix and i hope it will be released soon to the Jumbo's.

Regarding Dynamic Routing and IPSR support, so here unless DR protocol is not publishing Default GW there should not be a problem to work with Dynamic Routing and IPSR together.

I'm running in my lab with such configuration and i don't see any caveats.

Thanks,

Ilya

the_rock · ‎2022-12-13

Thanks a lot @Ilya_Yusupov , my colleague and I are very grateful for all your help with this issue. I told customer's IT manager that we disabled the ISPR script for now and they understood, and even though its not great to leave it like that, but since it fixed policy push issues, we have to, for now, at least. This is why I made a comment saying I wished documentation about it was more precise, because in all honesty, the way its put in there, it makes it a bit unclear wteher it would be fully supported or not.

If you could confirm with someone from R&D if what I did in my lab is supported, which does work, would be something we can do for the customer.

So, again, just as a reminder, in my lab cpisp_update file on the gateway, I out back # in front of exit 1 line and at the end of every line that contains static-route words, I put work on, instead of off, which is there by default and that worked just fine.

Tested in R81.10 lab with single gateway, R81.20 with single gateway AND R81.10 with HA cluster.

Logically, to me, makes no doffierence if its cluster or not and Im 100% sure about that based on my testing.

Again, always GRATEFUL for your help!

👌👌👌

Andy

Best,
Andy
"Have a great day and if its not, change it"

Ilya_Yusupov · ‎2022-12-13

@the_rock ,

We are working on a fix which will address the issue and you will not need to do any WA, it's not recommended to change script manually.

Thanks,

Ilya

Are you a member of CheckMates?

Policy push overwrote default route on cluster active gateway

The ISP Redundancy Script