No interactive SSH/HTTPS access to the firewall

S_K_S · ‎2021-01-27

We have recently installed a coupe of 6200 appliances as R80.40 VSX cluster. They still haven't entered production and have only very basic configuration like IP addresses, some SNMP, NTP stuff and so on. The firewalls are properly licensed, with 3 virtual firewall configured (blank policies at the moment, except for VS0 which has a few access-related rules). Until yesterday they were both accessible without issues for SSH, HTTPS. Smart Center could communicate with them, push database and policy, but for whatever reason the first node no longer is. If I try to SSH to it, I can successfully connect (Layer 3), enter username and password and after pressing Enter - it displays no prompt. I can keep pressing Enter and it will move one row down each time but that's all. Text can by typed but if you try to execute a command, nothing happens. Basically it's like typing on Notepad. Likewise I can connect to the Gaia portal via HTTPS, enter username and password and once I try to connect, the browser just tries indefinitely, no timeout even. On the Smart Console the first node shows as unreachable (red X). The best part is - it stays like that even after the on-site people power-cycled the appliance. Has anyone encountered anything like this?

Chris_Atkinson · ‎2021-01-27

Something seems like it is not complete with the installation / provisioning of this cluster are you in contact with TAC or a partner?

SSH login delays can be caused due to missing or unreachable DNS settings but there is likely more to it in this instance.

Note also after setup VSX gateways won't have an accessible Web UI for admin functions (only SSH, console, LOM).

CCSM R77/R80/ELITE

S_K_S · ‎2021-01-27

We have applied some DNS configuration to both cluster members (identical). The second one works without issues. I would expect to have some delays if DNS fails to resolve something but not a full inability to interact with the devices.

HTTPS access to the Gaia portal works on VSX, just with very few options like updating the Deployment Agent and installing/uninstalling hotfixes. On the second node that page is accessible without issues.

The provisioning process went like this:

1. We received the appliances;

2. Used LOM to apply initial configuration (IP addresses on the interfaces, routing);

3. Used HTTPS to complete the setup on the Gaia Portal;

4. Enabled HA + VSLS via cpconfig;

5. Created a VSX cluster on the Smart Console;

6. Installed initial policies on VS0 on both members;

7. Created virtual switch and virtual firewalls + policies for the latter;

8. Installed the virtual firewall policies.

All steps above completed without issues and for several days the problematic node was accessible.

Bob_Zimmerman · ‎2021-01-27

Does VS0 have any cluster interfaces?

Are you able to get access via the serial console or LOM (VGA console redirection)? Or do they show the same problem?

S_K_S · ‎2021-01-27

No cluster interfaces on VS0 (this option is not even configurable on VSX if I'm not mistaken?).

Currently we don't have a remote console access because the console cable could not be attached (no free ports on the respective console server or something like that) and the LOM is unreachable because we couldn't get a routable IP address for it. For the initial configuration we used a remote session to a laptop connected to the LOM. Currently we are trying to arrange one of the console ports used by another firewall to be "borrowed" for the problematic one so we can try troubleshooting this way. I have a gut feeling that console access won't work as well though...

Bob_Zimmerman · ‎2021-01-28

VS0 can absolutely have clustered interfaces. The only technical distinction between VS0 and any other VS is traffic which the member initiates will use the routing table in VS0. All other features are available and identical on all VSs. It isn't advisable to put through-traffic on VS0 (just from a clarity-of-purpose standpoint), but it is definitely possible.

Can you try connecting with SSH with a few added -v flags? It shouldn't prompt you for the password until you've successfully negotiated SSH and the server has sent back a password or keyboard-interactive authentication method, but it would be good to have hard proof of whether the negotiation is working or not.

S_K_S · ‎2021-01-29

It doesn't look like there is a problem with routing or the network layer generally - the two cluster members are pingable separately, no packet loss, different TTLs or anything. I've tried to ssh -v but the result is the same.

Something new from today is that the second node has started behaving exactly like the first one. Still not completely inoperable but it looks like it will be shortly... No idea what part of the configuration we have entered might have caused this (if that's the case at all). We'll consider resetting to factor defaults if nothing else comes to mind.

Bob_Zimmerman · ‎2021-01-29

Successful ping and successful TCP connections but unsuccessful protocol negotiations is a symptom of drive failure. Once a system is booted, the network kernel is wired in RAM. Can't be swapped out. The network kernel handles TCP connection establishment and ICMP. Once a connection is established, it asks the drive for a copy of the daemon listening on the port to hand it the connection. That would show in 'ssh -vv' as a successful connection, and an outgoing negotiation (you should see "Local version string"), but no response.

Chris_Atkinson · ‎2021-01-29

To clarify 6200 or 16200 and with JHF T91 applied?

CCSM R77/R80/ELITE

S_K_S · ‎2021-01-29

It's 6200 with JHF take 91.

Alex- · ‎2021-01-29

Is there something to know about 16200 series and Take 91 when using VSX?

Chris_Atkinson · ‎2021-01-29

Not really, it's simply a more capable platform with less potential for oversubscription than the smaller 6200 performance wise.

CCSM R77/R80/ELITE

S_K_S · ‎2021-01-30

Are there any concerns about the 6200 appliances with VSX and/or that JHF take? We have ordered a total of 6 (3 clusters) for 3 sites in total, the plan is to deploy 3 virtual firewalls per site. I believe the customer discussed this with Check Point pre-sales and got a recommendation for this series from them, given that the sites in question are small, expected traffic volume is low and VSX is deployed mainly because it makes the implementation of the security design easier and more efficient than a single gateway. For the bigger sites, bigger models will be used.

Chris_Atkinson · ‎2021-01-30

Not really relevant to your current issue no since the boxes aren't yet in production, but it's always good to confirm the JHF level etc.

CCSM R77/R80/ELITE

S_K_S · ‎2021-02-01

OK, today we got console access to the two appliances by asking an on-site technician to re-connect some console cables. Via the console connections there was no difference compared to SSH or HTTPS - credentials could be entered but nothing after that. We asked for another power cycle of the appliances and this time they booted OK and we managed to connect to them via SSH. At the moment they are accessible.

Now, the question is - what may have caused that?

I have gone through /var/log/messages but nothing indicating a crash seems to be recorder. It does look like the system load had increased drastically at some point, although I've no idea what could be the reason given that the firewall is not in production and most of it's interfaces are shut down. Here's one of the prevalent messages in the logs:

Jan 26 13:53:00 2021 ESAG0-6-M-AFW-01_ESAC monitord[27532]: set_led libdb_set_one failed, Error:Timeout waiting for response from database server.
Jan 26 13:53:04 2021 ESAG0-6-M-AFW-01_ESAC snmpd: Error: Timeout waiting for response from database server.

Around that time (2 PM system time) the CPD process has began restarting literally every minute according to cpwd.elg. I've ran through the historical records of cpview and noticed that the CPU load has gone up across all cores at that time, stayed like that for a bit more than half an hour, after which it return to almost 100% idle and stayed like that until today when we rebooted the appliances; when the CPU load dropped however, nearly 1000 interrupts disappeared (from around 3100 to 2100) and did not reappear until we power cycled - so it looks like a number of processes got terminated and never recovered.

Any ideas?

Bob_Zimmerman · ‎2021-02-01

If the console could prompt for authentication, it probably isn't related to the storage subsystem (drive, SATA connection, drive controller).

What does $CPDIR/log/cpd.elg have to say?

Are there any core dumps in /var/log/dump/usermode?

S_K_S · ‎2021-02-01

In cpd.elg it's the same as in cpwd.elg - CPD re-initializing every minute. Here is a random sample:

==============================
Cpd Initializing
==============================

cpd_enable_epoll: Enabling epoll...
cpd_enable_epoll: epoll succesfully enabled..
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] SIC initialization started
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] cpsic_init: msg client name = cpd
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] cpsic_init: context id = 0
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] get_my_sicname_from_registry: Read the machine's sic name: CN=ESAG0-6-M-AFW-02_ESAC,O=ESAG0-F-M-MFW-001.mgmt.mmpds.esa.local.rhe7ge
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] Initialized sic infrastructure
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] SIC certificate read successfully
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] FAILED to find registry entry PROVIDER-1//CPLocalAuthDir
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] Initialized SIC authentication methods
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] cpsic_init: Failed to init message daemon
[CPD 28309 4133583168]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:40:08] CPSIC Error: Messaging mechanism failure - Could not initialize messaging daemon.
Failed to initialize SIC. Exiting ...
[CPD 28539 4133247296]@ESAG0-6-M-AFW-02_ESAC[29 Jan 20:41:08] CPD: Fri Jan 29 20:41:08 2021

Same from cpwd.elg:

[cpWatchDog 27820 4135847680]@ESAG0-6-M-AFW-02_ESAC[29 Jan 21:35:08] [INFO] Process CPD (pid = 7824) exited with exit_code -1
[cpWatchDog 27820 4135847680]@ESAG0-6-M-AFW-02_ESAC[29 Jan 21:36:08] [SUCCESS] CPD ctx=4 started successfully (pid=8037)
[cpWatchDog 27820 4135847680]@ESAG0-6-M-AFW-02_ESAC[29 Jan 21:36:08] [INFO] Process CPD (pid = 8037) exited with exit_code -1
[cpWatchDog 27820 4135847680]@ESAG0-6-M-AFW-02_ESAC[29 Jan 21:36:08] [SUCCESS] CPD ctx=2 started successfully (pid=8038)

The stream continues on both places until we rebooted the appliances today.

Nothing in /var/log/dump/usermode also.

Chris_Atkinson · ‎2021-02-03

Have you had any success with TAC in investigating this issue further?

The only suggestion I have at this time is to possibly investigate the relevance of sk171753

CCSM R77/R80/ELITE

S_K_S · ‎2021-02-08

Not much progress with the TAC case. The support contract goes through another company which raises/manages the cases so we don't have direct visibility, unfortunately. At least the issue has not re-appeared after we power cycled the appliances but the possibility of reappearance is looming over them...

I have checked sk171753 (thanks for the suggestion) but it does not seem to be relevant.

S_K_S · ‎2021-02-11

According to TAC the issue might have been caused by a corruption of the CPEPS database as per sk101484. We'll have to wait for the issue to re-occur to see if this is the case.

PGilBz · ‎2021-11-09

Hello, How did you solve this issue? we got the same behavior in a 13500 cluster...

Are you a member of CheckMates?

No interactive SSH/HTTPS access to the firewall