Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
MATEUS_SALGADO
Contributor
Jump to solution

Issue on the sync interface

Hi guys!


Currently, I have one ticket opened in TAC for this case, but till now nothing...

Therefore I decided hear others opinions for while.hahah

The issue is that, my customer have a cluster 80.10 (appliance model 5800 in HA mode), where the syncronization interface between the members is through cable.

Everyday the sync interface flapping and the member 2 (in Standby) try to assume the Active state of the cluster. (in a random time of the day). And in most of the time, some VPNs falling down in same minute.

In the /var/log/messages I get always the same log strcture:

"

Sep 27 13:37:10 2018 fw02 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to DOWN due to pnote Interface Active Check (desc eth8 interface is down, 8 interfaces required, only 7 up)

Sep 27 13:37:10 2018 fw02 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to DOWN

Sep 27 13:37:10 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to DOWN

Sep 27 13:37:10 2018 fw02 kernel: [fw4_1];fwha_state_change_implied: Try to update state to ACTIVE because member is down (the change may not be allowed).

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];check_other_machine_activity: Update state of member id 0 to DEAD, didn't hear from it since 2021025.4 and now 2021028.4

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];fwha_set_backup_mode: Try to update local state to ACTIVE because of ID 0 is not ACTIVE or READY. (This attempt may be blocked by other machines)

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to READY

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to READY

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_state: ID 0 (state ACTIVE -> DOWN) (time 2021028.4)

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1]; member 0 is down

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_state_change_implied: Try to update local state from READY to ACTIVE because all other machines confirmed my READY state

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to ACTIVE

Sep 27 13:37:11 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to ACTIVE

Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];fwha_report_id_problem_status: Try to update state to ACTIVE due to pnote Interface Active Check (desc <NULL>)

Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];FW-1: fwha_process_state_msg: Update state of member id 0 to ACTIVE due to the member report message

Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];fwha_set_backup_mode: Try to update local state to STANDBY because of ID 0 is ACTIVE or READY and with higher priority

Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];FW-1: fwha_set_new_local_state: Setting state of fwha_local_id(1) to STANDBY

Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_local_state: Local machine state changed to STANDBY

Sep 27 13:37:12 2018 fw02 kernel: [fw4_1];FW-1: fwha_update_state: ID 0 (state DOWN -> ACTIVE) (time 2021029.5)


"
Have someone any idea what can cause this behavior?

OBS: Until now, I did some configurations, like:

- Updated the jumbo_hotfix to take 121;

- Altered the syncronization interface from SYNC to ETH8;

- Switched the cable that connected the members of cluster;

- Changed the CCP mode from multicast to broadcast.

Thanks in advance!

35 Replies
Timothy_Hall
Legend Legend
Legend

Carrier indicates the number of times that the interface lost link integrity (green light) with the attached switch.  Usually caused by a loose cable but could be a bad NIC.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Mike_Jensen
Advisor

Thank you Tim.  Now I just don't know when those errors occurred; if it was during the replacement process or if they started before.  I will monitor.  Thanks again.

0 Kudos
Mike_Jensen
Advisor

In my case, TAC RMA'd the second 5800 in the cluster in question and now the Sync interfaces are operating at 1000Mbps / full , using auto negotiation.  Hopefully this resolved the unexpected failover issue.

Kevin_Tran
Explorer

How many inside interfaces (logical or physical) are defined in your customer's environment? I have the exact same symptoms like yours. The only difference is that the logs only appear on the active member. The issue has started after I have upgraded from R77.30 to R80.10. I have tried various takes and currently on take 121 and the issue is still occurring. Have you tried the latest take 169?

0 Kudos
MATEUS_SALGADO
Contributor

Hello friends!

Timothy Hall, sorry about the delay in my feedback, but let's the news...

I asked your question to TAC and they answer with this:

" I think it is unlikely that is the cause as we have many customers on version 103 and above and I have never seen that be a cause for this problem.  We are also only seeing this on the sync interface but interface active check is performed on all cluster interfaces.  If this was the cause of the issue I would expect to see the problem on multiple interfaces not just the sync interface flapping.  The next time the issue occurs please upload a fresh cpinfo the next time the issue occurs and in the mean time provide the output of the command below.  Thanks."

After that, on this week, I changed a kernel parameter (fwha_timer_cpha_res), as TAC's request, from 1 to 2.

Until now, the flapping doesn't happened anymore.

I'll keep monitoring the gateway until next week (before celebrating victory..hahaha).

@Kevin Tran, the gateways has the Jumbo take 154.

If the problem remains (I hope not  ), I probably will install a new GA take.
Thank you for suggest.

MATEUS_SALGADO
Contributor

Hello friends!

Great news!

After changing the kernel parameter, the flaps did not happen anymore.

Note: To survive a boot, I put the new configuration in the fwkern.conf file:

fwha_timer_cpha_res=2

Thanks everyone for the help, especially Timothy Hall

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events