Re: Starting cpri_d: FAILED on boot

Adam276 · ‎2025-03-04

I am having a problem moving my old secondary firewall back into place after having problems with a new secondary firewall hardware/software upgrade. I tried moving to R81.20 Take 90 on a cluster (secondary only first) and had a reboot issue when pushing policy (I will look into that later). I am still working fine on the old R80.10 primary as I was only trying to get the secondary ready to switch to before experiencing the reboot problem. FYI... The old and new hardware is Dell hardware/NICs that are on the Checkpoint open server HCL.

My main problem right now though is trying to get the the old secondary back functional (it is R80.10). The cpri_d process is failing on boot. Because I had to change SIC in management for the new secondary hardware/software upgrade, I reset SIC on old secondary gateway and changed the version back to R80.10 in the management. SIC tests successful from SmartConsole. I can push policy and no errors on the policy push itself. The secondary is showing red though.

This is on bootup of the system...

Starting the system...

Starting cpri_d: FAILED

GUI shows green status for the primary(that system wasn't touched). The secondary is showing a red alert though in SmartConsole gateways view. I have tried rebooting, pushing policy again, etc but no change.

SmartConsole GUI shows this as an alert for this gateway.
'Firewall' is not responding. Verify that 'Firewall' is installed on the gateway. "If 'Firewall' should not be installed verify that it is not selected in the Products list of the gateway (SmartConsole > Security Gateway > General Properties . Software Blades List).

Trying to start cprid manually gives this error...

[Expert@Ode-Fw21:0]# $CPDIR/bin/cpridstart
DIAGDIR: Undefined variable.

Looking at the /opt/CPsuite-R80/fw1/log/lpd.elg log file I see this...

[lpd 2962 4158621392]@Ode-Fw2[3 Mar 9:24:56] [init][INFO][logger.cpp:62 : initLogger] ####################################################
[lpd 2962 4158621392]@Ode-Fw2[3 Mar 9:24:56] [init][INFO][logger.cpp:63 : initLogger] welcome to lpd log!
[lpd 2962 4158621392]@Ode-Fw2[3 Mar 9:24:56] [main][ERROR][daemon_main.cpp:42 : main] LPDException: Failed to fetch DIAGDIR environment variable
[lpd 2962 4158621392]@Ode-Fw2[3 Mar 9:24:56] [main][INFO][daemon_main.cpp:59 : main] Exit code: 3
[lpd 6621 4157658832]@Ode-Fw2[3 Mar 9:25:55] [init][INFO][logger.cpp:62 : initLogger] ####################################################
[lpd 6621 4157658832]@Ode-Fw2[3 Mar 9:25:55] [init][INFO][logger.cpp:63 : initLogger] welcome to lpd log!
[lpd 6621 4157658832]@Ode-Fw2[3 Mar 9:25:55] [main][ERROR][daemon_main.cpp:42 : main] LPDException: Failed to fetch DIAGDIR environment variable
[lpd 6621 4157658832]@Ode-Fw2[3 Mar 9:25:55] [main][INFO][daemon_main.cpp:59 : main] Exit code: 3

cphaprob shows correct backup for the secondary. Primary is active of course.

Anyone experience this?

No rules changes or object changes besides resetting SIC for the secondary member and changing version on the cluster object back to the old R80.10 version.

the_rock · ‎2025-03-04

Nothing comes up on support site about it, at least that I can find. Have you asked TAC about it? One suggestion...maybe console into it, run halt command, then unplug power cord, wait 30 seconds, plug it back in and see if that helps.

Andy

Adam276 · ‎2025-03-04

I see the same log stuff in lpd.elg on the other old firewall so I assume that is not related.

I also don't see a cprid process running on the old primary either so maybe that also is not related. It is possible on reboot the primary gives that same error and not be related also. Not sure. I don't want to touch the primary until I have a functional secondary though.

The main issue seems to be that SmartConsole shows the alert below for the second member.

'Firewall' is not responding. Verify that 'Firewall' is installed on the gateway. "If 'Firewall' should not be installed verify that it is not selected in the Products list of the gateway (SmartConsole > Security Gateway > General Properties . Software Blades List).

I can't find anything about that yet either. I might revert to a previous snapshot from yesterday on the old secondary that I took out of precaution. The only thing done on it though was SIC and management version change back and forth so it doesn't seem likely that it will change anything. I will still have to redo SIC again to get that going since SIC was reset in management recently for that cluster member.

The strange thing is it is complaining about firewall blade but cphaprob stat shows it in backup state which means the process checks that happen for HA state are at least ok.

the_rock · ‎2025-03-04

I actually remember that EXACT error, but not for fw blade, rather for identity awareness. In the lab, it disappeared when I rebooted the fw, but I never logically understood why it came up in the first place and could not really open a TAC case, since its a lab.

Andy

Adam276 · ‎2025-03-05

I setup a bare bones new management and cluster gateway VM lab (did not import my management data) with simple firewall policy (no VPN) and reproduced the exact problem with one of the gateway status as 'Firewall is not responding' error in the GUI. If you change the R80.10 gateway cluster object version to R81.20, attempt the upgrade to the secondary, and then change the version back to R80.10, put the old secondary gateway back in place, One of the two gateways then get reported in SmartConsole gateway status (when you click on the gateway in the gateways view) will show the 'Firewall is not responding'. In my lab the firewall that I didn't touch showed it so it would seem the change to R80.20 changes something in the parent cluster object that doesn't get changed back when you change the version back to R81.20 that causes this.

I did not attempt a snapshot revert of the firewall. My thoughts on the 'Firewall is not responding' pushed me to look at management because all I did on the old gateway was reset SIC and change the version in management. A session history revert of the management changes to before the R80.10 cluster version was changed to R81.20 fixed the issue. Nothing had to be done as far as pushing policy. The management alert for the gateway went green about a minute later. That result was the same in the lab and for the cluster that experienced the issue.

Now i get to move onto the other issue where pushing policy the first time to the secondary during hardware/software upgrade caused loss of connectivity to that gateway and multiple reboots of the new R81.20 secondary gateway that I put in place for the R80.10 migration to R81.20. I didn't get to the primary because of that. More work to do there... I won't post about that in this thread though. I will create another thread if needed for that. That was R81.20 take 90 so I might install latest HFA and try again or wait for the next official HFA because of reported IKE issues which would be critical for me.

EDIT: In addition, The errors that come up on boot and in that log were not related to the SmartConsole 'Firewall is not responding' alert of course taking in the info above. The firewalls appear to be working just fine back on R80.10 after the session history revert and likely have had those errors for a while now and are likely on the primary firewall as well.

Adam276 · ‎2025-03-05

I changed the title to reflect what was fixed in this thread.

PhoneBoy · ‎2025-03-05

How did you perform the migration to R81.20 from R80.10?
The supported method requires going through an intermediate version (R80.40).

Adam276 · ‎2025-03-05

It was not a 'migration' per say. Management was already on R81.20. The previous version of management was R80.40. Management was upgraded a while ago. The gateway cluster on R80.10 was a hardware and version upgrade with a fresh install of R81.20 and setup to match original hardware OS/Gaia config. It was a version and hardware upgrade attempt at the same time. I did a lab test of the upgrade of gateway cluster going from VMs on R80.10 to new VMs on R81.20 (basic access rules setup) and that worked perfectly in the lab... When doing it in production though, secondary didn't work as mentioned so I had to revert back and that is what triggered this thread's issue that I solved by reverting the management session history. The 'Firewall is not responding' issue is not related to the secondary failure and reboot issues I am pretty sure. The 'Firewall is not responding' seemed to be only when going back to the old version on the gateways after changing the version to R81.20 and then back to R80.10.

You can see my other post about the steps that I did in the VM lab for the gateway cluster upgrade but only got halfway through before needing to revert to previous version in production.
https://community.checkpoint.com/t5/Security-Gateways/Firewall-is-not-responding-SmartConsole-alert-...

Adam276 · ‎2025-03-24

Is that a hard requirement to transition to R80.40 for a gateway cluster if you don't need states to sync for the transition? If you are installing from scratch to new hardware and doing the 'Minimum Downtime' upgrade (Disable MVC) method that doesn't synchronize any states, I would think for a gateway cluster \you could still go direct to R81.20 from R80.10. It would be like you were installing from scratch an R81.20 gateway cluster. This assumes the READY state works properly for the versions you are going to. You just do a cpstop on the primary and the new secondary server takes over not trying to synchronize or do anything with the old primary (Minimum Downtime and not syncing states). After testing the primary, you could then put the new primary in place.

If you are upgrading the gateway in place using CPUSE or if you had to have states syncing for the transition to the new firewalls and hardware, Then going to R80.10 first would be required for the gateway but minimal downtime to new hardware that was installed from scratch on R81.20, seems to work in a lab anyway.

Certainly for management you must go through the transition to R80.40 first because of changes to objects, formats, etc I would assume. The management is already upgraded though to R81.20 (was R80.40 before being migrated (migrate_server) to R81.20.

This assumes that for an R81.20 management, It doesn't for any reason require you to transition an R80.10 gateway object version to R80.40 first (because it does something to the object that going direct to R81.20 doesn't). My lab testing was successful with a simple network policy doing a non-MVC Minimal Downtime upgrade from R80.10 to R81.20 with new hardware and fresh install. That isn't a robust test for all functionality of course.

Are you aware of problems doing that with a fresh install and non-MVC Minimal Downtime upgrade if you don't need states to sync?

Are you a member of CheckMates?

'Firewall' is not responding SmartConsole alert after changing back to previous cluster version