Hi all,
Setup is 2 Quantum Spark 1590's in active/standby cluster running latest version - R81.10.17, build 996004721.
I consistently have problems with TED interface showing as a "problem" and therefore the standby member showing as down. Seems to happen every few days or week and to solve, it's usually some combination of cpstop/cpstart, rebooting and I've even had to go as far as re-initializing SIC (this may have been unrelated). I typically notice the red icon in SmartConsole, which reports a problem with clusterXL and usually the message about not being able to connect to the cloud. I ssh into the gateway and cphaprob list shows the TED interface in a problem state.
The facility where these are located is very remote and our only ISP options are point-to-point wireless connections, of which we have 2. One is connected to the Check Points and the other is used by our SD-WAN (not Check Point). When I see this issue, I typically chalk it up to something flaky happened overnight with our ISP - in my (limited) experience with point-to-point wireless, a gust of wind can cause connectivity issues.
As I said, I can usually get it back relatively quickly by rebooting the offending gateway and I forget about it until the next time I see the red X. However, yesterday was the first time I saw it break in "real time." GW1 was the active member and GW2 had the TED problem, but there was another issue - GW1 was not responding to ssh or https. I had someone local physically reboot GW1, at which point GW2 became active. Once GW1 booted, it became the active member again and GW2 still showed the TED problem. I ran cpstop/cpstart on GW2 and once services started, it became the active member with everything reporting OK. When I checked the cluster status again, GW1 was now showing the TED problem. (I should note, I was multitasking after running cpstart, so it was anywhere from 15 - 45 minutes before I checked the status again.)
I got pulled away for the rest of the day on a mission-critical task, so didn't get back to this until today. GW1 was still showing as down, with the TED interface in a problem state. I rebooted GW1 and it's been back to normal since (3-ish hours).
Anybody seen something like this before? Could it be related to the less-than-ideal ISP? Anything I can do to tweak settings to try and resolve? I've not yet opened a TAC case on it, primarily because of me assuming it's just a weird ISP thing. I also feel like I'll open a case with TED broken, they'll ask me to reboot, which will "fix" it and - voila - case closed. And I there is no rhyme or reason to why it's sometimes fixed for 2 days and other times 2 weeks...