Solved: Re: State: Connection with 'fw-vsx-n01' is lost

Teddy_Brewski

Hello,

One of the VSs (out of 4), started reporting today:

State: Connection with 'fw-vsx-n01' is lost

There are no issues reported from the SSH session -- this node is active and handles the load.

cphaprob state, cphaprob -a if, cphaprob -ia list, revealed nothing wrong.

Tried to reboot the management but it didn't help.

Management: R81.20 Take 65

CP VSX: R81.20 Take 90

Thank you.

AkosBakos

Hi @Teddy_Brewski

The https://support.checkpoint.com/results/sk/sk101484 said:

In some scenarios cpwd.elg shows repeatedly:

[ERROR] CPD (pid_of_cpd) did not send keep-alive message for x number of times

----------------
\m/_(>_<)_\m/

View solution in original post

Chris_Atkinson

What does 'vsx stat -v' show?

CCSM R77/R80/ELITE

Teddy_Brewski

Looks healthy too.

The one that is complaining about one of the nodes being down is VS5 (fw-vs-cloud):

VSX Gateway Status
==================
Name: fw-vsx-ext-n01
Access Control Policy: fw-vsx-external-vsx
Installed at: 14Jan2025 15:27:58
Threat Prevention Policy: <No Policy>
SIC Status: Trust

Number of Virtual Systems allowed by license: 6
Virtual Systems [active / configured]: 3 / 3
Virtual Routers and Switches [active / configured]: 2 / 2
Total connections [current / limit]: 32393 / 96500

Virtual Devices Status
======================

Type: S - Virtual System, B - Virtual System in Bridge mode,
R - Virtual Router, W - Virtual Switch.

AkosBakos

Hi @Teddy_Brewski

Does it happen periodically

Maybe the cpd process crashes.

Have a look at on this: https://support.checkpoint.com/results/sk/sk101484

What does cpwd.elg say?

Akos

----------------
\m/_(>_<)_\m/

Teddy_Brewski

Hi @AkosBakos

No, this is the first time it happened.

I checked cpwd.elg on the affected node, and although I see 'did not send keep-alive message for 1 number of times' error messages, none of them are related to CPD, but rather MSGD:

[cpWatchDog 19785 4133372096]@fw-vsx-n01[27 Jan 11:47:49] [ERROR] MSGD (pid=30558) did not send keep-alive message for 1 number of times
[cpWatchDog 19785 4133372096]@fw-vsx-n01[27 Jan 11:49:34] [ERROR] MSGD (pid=30661) did not send keep-alive message for 1 number of times

AkosBakos

Hi @Teddy_Brewski

The https://support.checkpoint.com/results/sk/sk101484 said:

In some scenarios cpwd.elg shows repeatedly:

[ERROR] CPD (pid_of_cpd) did not send keep-alive message for x number of times

----------------
\m/_(>_<)_\m/

Teddy_Brewski

Hi @AkosBakos

Yes, but SK mentions that it's CPD daemon that is reporting the error. In my case it's MSGD -- do you think it's the same?

There are no traces of core dumps in the logs and no high CPU observed. SIC is also fine. I haven't tried to push the policy on the affected VS though.

AkosBakos

Hi @Teddy_Brewski

Because the lack of information, I can't say this is the same or not, but there are symptomes which are the same.

In this case, the best thing what you can do to ask the TAC about this issue.

Akos

----------------
\m/_(>_<)_\m/

Chris_Atkinson

Concur, anything of note flagged in HCP ?

Otherwise some suggestions:

- Attempt policy install

- Restart cpd per sk97638

- Failover / Reboot gateways

- Patch with latest recommended JHF

- Open a TAC case (attach HCP & CPinfo)

CCSM R77/R80/ELITE

Teddy_Brewski

Thank you for the hint @AkosBakos !

With the help of our CP partner the issue has been identified. CPD crashed and was failing to restart since then. No CPEPS database corruption has been observed, so killing the stale process and stopping/starting CPD manually in the context of affected VS fixed the issue.

Are you a member of CheckMates?

State: Connection with 'fw-vsx-n01' is lost