- CheckMates
- :
- Products
- :
- General Topics
- :
- Re: Issue on the sync interface
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Issue on the sync interface
Hi guys!
Currently, I have one ticket opened in TAC for this case, but till now nothing...
Therefore I decided hear others opinions for while.hahah
The issue is that, my customer have a cluster 80.10 (appliance model 5800 in HA mode), where the syncronization interface between the members is through cable.
Everyday the sync interface flapping and the member 2 (in Standby) try to assume the Active state of the cluster. (in a random time of the day). And in most of the time, some VPNs falling down in same minute.
In the /var/log/messages I get always the same log strcture:
"
Sep 27 13:37:10 2018 fw02 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to DOWN due to pnote Interface Active Check (desc eth8 interface is down, 8 interfaces required, only 7 up)
Sep 27 13:37:10 2018 fw02 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to DOWN
Sep 27 13:37:10 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to DOWN
Sep 27 13:37:10 2018 fw02 kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down (the change may not be allowed).
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 2021025.4 and now 2021028.4
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];fwha_set_backup_mode: Try to update local state to ACTIVE because of ID 0 is not ACTIVE or READY. (This attempt may be blocked by other machines)
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to READY
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to READY
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_state: ID 0 (state ACTIVE -> DOWN) (time 2021028.4)
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1]; member 0 is down
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_state_change_implied: Try to update local state from READY to ACTIVE because all other machines confirmed my READY state
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to ACTIVE
Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to ACTIVE
Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to ACTIVE due to pnote Interface Active Check (desc <NULL>)
Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];FW-1: fwha_process_state_msg: Update state of member id 0 to ACTIVE due to the member report message
Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];fwha_set_backup_mode: Try to update local state to STANDBY because of ID 0 is ACTIVE or READY and with higher priority
Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to STANDBY
Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to STANDBY
Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_state: ID 0 (state DOWN -> ACTIVE) (time 2021029.5)
"
Have someone any idea what can cause this behavior?
OBS: Until now, I did some configurations, like:
- Updated the jumbo_hotfix to take 121;
- Altered the syncronization interface from SYNC to ETH8;
- Switched the cable that connected the members of cluster;
- Changed the CCP mode from multicast to broadcast.
Thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Carrier indicates the number of times that the interface lost link integrity (green light) with the attached switch. Usually caused by a loose cable but could be a bad NIC.
--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com
CET (Europe) Timezone Course Scheduled for July 1-2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Tim. Now I just don't know when those errors occurred; if it was during the replacement process or if they started before. I will monitor. Thanks again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my case, TAC RMA'd the second 5800 in the cluster in question and now the Sync interfaces are operating at 1000Mbps / full , using auto negotiation. Hopefully this resolved the unexpected failover issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How many inside interfaces (logical or physical) are defined in your customer's environment? I have the exact same symptoms like yours. The only difference is that the logs only appear on the active member. The issue has started after I have upgraded from R77.30 to R80.10. I have tried various takes and currently on take 121 and the issue is still occurring. Have you tried the latest take 169?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello friends!
Timothy Hall, sorry about the delay in my feedback, but let's the news...
I asked your question to TAC and they answer with this:
" I think it is unlikely that is the cause as we have many customers on version 103 and above and I have never seen that be a cause for this problem. We are also only seeing this on the sync interface but interface active check is performed on all cluster interfaces. If this was the cause of the issue I would expect to see the problem on multiple interfaces not just the sync interface flapping. The next time the issue occurs please upload a fresh cpinfo the next time the issue occurs and in the mean time provide the output of the command below. Thanks."
After that, on this week, I changed a kernel parameter (fwha_timer_cpha_res), as TAC's request, from 1 to 2.
Until now, the flapping doesn't happened anymore.
I'll keep monitoring the gateway until next week (before celebrating victory..hahaha).
@Kevin Tran, the gateways has the Jumbo take 154.
If the problem remains (I hope not ), I probably will install a new GA take.
Thank you for suggest.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello friends!
Great news!
After changing the kernel parameter, the flaps did not happen anymore.
Note: To survive a boot, I put the new configuration in the fwkern.conf file:
fwha_timer_cpha_res=2
Thanks everyone for the help, especially Timothy Hall

- « Previous
-
- 1
- 2
- Next »