Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
maller2
Participant
Jump to solution

VSX R81.20 Policy installation fails randomly.

Hello

Last days  we are having an issue about policy compilation in our VS1 .  Error message is not always the same but lately is this 

" Installation failed. Reason: TCP connectivity failure ( port = 18191 )( IP = 198.18.0.20 )[ error no. 10 ]."

This VS1 is a cluster and on the other member  , installation always works

In /var/log/messages  we can see a lot of spike messages like this

             spike_detective: spike info: type: cpu, cpu core: 8, top consumer: cpd,

Sep 26 02:12:55 2024 cpd: Destroying the lists of sensors
Sep 26 03:14:08 2024 cpd: Destroying the lists of sensors
Sep 26 07:14:39 2024 cpd: Destroying the lists of sensors
Sep 26 07:28:39 2024 cpd: Destroying the lists of sensors

We are going to increase assigned coreXL cpu , but I'm not sure if this is the root cause of this issue.

Any suggestion?

thanks

 

 

 

 

0 Kudos
1 Solution

Accepted Solutions
AkosBakos
Leader Leader
Leader

Upgrade to the latest GA take.

----------------
\m/_(>_<)_\m/

View solution in original post

11 Replies
Tal_Paz-Fridman
Employee
Employee

Depending on the JHF you are running it might be connected to the following issue:

Multi-Domain Management Server or Security Management Server do not respond because of a high number of CPD processes in a zombie state

https://support.checkpoint.com/results/sk/sk182370

 

0 Kudos
maller2
Participant

Hi 

Checked , it doesn't  to be related with defuncts

[Expert@mrtdca01vsxfw:1]# ps aux | grep -i cpd | grep -i defunct | wc -l
0

thanks

 

0 Kudos
Chris_Atkinson
Employee Employee
Employee

And which JHF is used with this deployment?

CCSM R77/R80/ELITE
0 Kudos
maller2
Participant

This is Check Point CPinfo Build 914000250 for GAIA
[MGMT]
HOTFIX_R81_20_JUMBO_HF_MAIN Take: 41
[IDA]
No hotfixes..
[CPFC]
No hotfixes..
[FW1]
HOTFIX_R81_20_JUMBO_HF_MAIN Take: 41
HOTFIX_GOT_TPCONF_AUTOUPDATE
HOTFIX_PUBLIC_CLOUD_CA_BUNDLE_AUTOUPDATE

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Take 41 has known memory leaks which can present as policy install issues when the memory gets low, how is the memory utilisation?

Fixes in the context of CPD are also noted multiple times in more recent takes.

CCSM R77/R80/ELITE
0 Kudos
AkosBakos
Leader Leader
Leader

Hi @Tal_Paz-Fridman 

When the Policy install fails the CPD process is restarting? 

There are a lot of CPD related fixes after the take 41 especially.

We ran into this in the summer:

PRJ-51068,
PRHF-31283

Security Management

In a rare scenario, the FWK and CPD processes may exit with core dumps at approximately the same time.

PRJ-47797,
PRHF-29709

VSX

A memory leak may occur in the CPD process.

 

I hope it helps.

Br

Akos

----------------
\m/_(>_<)_\m/
0 Kudos
maller2
Participant

Hi 

Memory seems to be ok 

Virtual System Capacity Summary:
Physical memory used: 42% (11446 MB out of 27074 MB) - below watermark
Kernel memory used: 12% (3484 MB out of 27074 MB) - below watermark
Virtual memory used: 6% (1636 MB out of 27074 MB) - below watermark

I've noticed the following pattern in /var/log/messages 

1-Before try install . Nothing new in messages

2-After install failed with message 'Installation failed. Reason: TCP connectivity failure ( port = 18191 )( IP = 198.18.0.20 )[ error no. 10 ].

A lot of messages like this  for 8-10 minutes aprox.

Sep 26 17:09:24 2024 mrtdca01vsxfw spike_detective: spike info: type: thread, thread id: 23018, thread name: cpd, start time: 26/09/24 17:09:17, spike duration (sec): 6, initial cpu usage: 99, average cpu usage: 99, perf taken: 1

Sep 26 17:09:53 2024 mrtdca01vsxfw spike_detective: spike info: type: cpu, cpu core: 5, top consumer: cpd, start time: 26/09/24 17:09:46, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 1

Sep 26 17:09:53 2024 mrtdca01vsxfw spike_detective: spike info: type: thread, thread id: 23018, thread name: cpd, start time: 26/09/24 17:09:46, spike duration (sec): 6, initial cpu usage: 99, average cpu usage: 99, perf taken: 0

Sep 26 17:10:04 2024 mrtdca01vsxfw spike_detective: spike info: type: thread, thread id: 23018, thread name: cpd, start time: 26/09/24 17:09:58, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 0

Sep 26 17:10:10 2024 mrtdca01vsxfw spike_detective: spike info: type: cpu, cpu core: 3, top consumer: cpd, start time: 26/09/24 17:09:52, spike duration (sec): 17, initial cpu usage: 95, average cpu usage: 74, perf taken: 0

And node is marked AS LOST in MDS

3- After 8-10 min  I see this in messages . 

Sep 26 17:11:04 2024 mrtdca01vsxfw xpand[14067]: show_asset CDK: asset_get_proc started.
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: init sensors
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Using /etc/hw_info/sensors.xml as active sensors data file (for thresholds and translation data)
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Loading driver name [nct7904]
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Loading driver name [lm63]
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Loading driver name [pac1014a]
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Loading driver name [i2c-i801]
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU0 Vcore
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU1 Vcore
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU0 DDR4-1
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU0 DDR4-2
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU1 DDR4-1
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU1 DDR4-2
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor VCC 12V
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor VCC 3V
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor VCC 5V
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor 3VSB
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor 5VSB
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor VBAT
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU0 Temp
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU1 Temp
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor Intake Temp
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor Outlet Temp
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor System Fan 1
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor System Fan 2
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor System Fan 3
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor System Fan 4
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Checking whether to add Power supply sensors
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor BIOS
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU0 Vcore
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU1 Vcore
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU0 DDR4-1
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU0 DDR4-2
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU1 DDR4-1
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU1 DDR4-2
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor VCC 12V
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor VCC 3V
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor VCC 5V
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor 3VSB
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor 5VSB
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor VBAT
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU0 Temp
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor CPU1 Temp
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor Intake Temp
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor Outlet Temp
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor System Fan 1
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor System Fan 2
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor System Fan 3
Sep 26 17:11:04 2024 mrtdca01vsxfw cpd: Adding sensor System Fan 4
Sep 26 17:11:04 2024 mrtdca01vsxfw xpand[14067]: show_asset CDK: asset_get_proc started.

Node is OK again in MDS and policy install now works.

It seems that CPD daemon restarts and then installation works.

 

0 Kudos
Lesley
Leader Leader
Leader

Upgrade the firewall and issue will be solved.

Rebooting system is a temp fix issue will come back. 

What has been posted before I can confirm. Update and done

-------
If you like this post please give a thumbs up(kudo)! 🙂
AkosBakos
Leader Leader
Leader

Upgrade to the latest GA take.

----------------
\m/_(>_<)_\m/
the_rock
Legend
Legend

I agree with that Akos.

Lesley
Leader Leader
Leader

I agree with the rock that he agrees with Akos

-------
If you like this post please give a thumbs up(kudo)! 🙂

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events