We are experiencing short (30 sec-2 min) communications distruptions, where all the connectivity is gone and the main cluster member doesn't respond anymore (while the standby member does).
looking through /var/log/messages we can find some patterns here. Every time there is something like:
Starting CUL mode because CPU-02 usage (81%) on the local member increased above the configured threshold (80%).
Then multiple logs like:
cerbero1 kernel: [fw4_1];[censored_public_ip:44288 -> Censored_public_ip:53] [ERROR]: malware_res_rep_rad_query: rad_kernel_api_async_get_resource() failed with error: Service is down
And then:
cerbero1 kernel: [fw4_1];CLUS-120202-1: Stopping CUL mode after 80 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
what may cause this problem?
Thanks in advance