Routed Daemon Issues and Failovers

Mark_Devanney · ‎2020-04-23

hi all ,

So background to this issue, we have around 600 VSEC gateways running with clusterXL HA, sitting on underlying VM infra, running R80.10 with some specific hotfixes relevant to our setup, since January we have started to see the devices flip (Cluster XL) due to what looks like routed daemon seeing errors then core dumping, we have had loads of these now, nothing service affecting as failover is doing it job (we also have BGP terminating on the same GW's ) but again have setup to minimize any flips which is saving us seeing service issues. We see the issues on multiple devices across multiple sites and with traffic levels very different from low end test ( 2 cores )setups to live high throughput production firewalls with Multiple blades in use ( 8 cores ). We have had Tac's out and our diamond engineers hasn't picked up anything as yet, we also have enabled debugging on a mix of devices hoping we see the issue and get more data to analyse.

some info in messages below as a starter for 10, example below see's a failover event at 06:15 today but happy to supply more info around our setup. Thanks in Advance

Apr 23 06:15:25 2020xxx clish[27173]: cmd by cronuser: Start executing : show route (cmd md5: 32edc6d9ebbb96f075ea7f0477b6285c)
Apr 23 06:15:25 2020xxx clish[27173]: cmd by cronuser: Processing : show route (cmd md5: 32edc6d9ebbb96f075ea7f0477b6285c)
Apr 23 06:15:25 2020xxx kernel: routed_mon[28067]: segfault at 0000000000000210 rip 00000000081131dc rsp 00000000f6764a30 error 4
Apr 23 06:15:25 2020xxx kernel: do_coredump: corename = |/etc/coredump/compress.sh /var/log/dump/usermode/routed_mon.28065.core
Apr 23 06:15:25 2020xxx kernel: do_coredump: argv_arr[0] = /etc/coredump/compress.sh
Apr 23 06:15:25 2020xxx kernel: do_coredump: argv_arr[1] = /var/log/dump/usermode/routed_mon.28065.core
Apr 23 06:15:26 2020xxx xpand[5086]: invalid binding <iterate rt:instance:default:af:inet:rt >, connection closed by routed

PhoneBoy · ‎2020-04-25

The errors suggest it's dumping core which means TAC will have to analyze the core files to see what's going on.

Chris_Atkinson · ‎2020-04-25

Are your hotfixes on top of Jumbo T249 or another take?

TAC is definitely the best route to analyse and resolve given the additional variables described here.

CCSM R77/R80/ELITE

Mark_Devanney · ‎2020-04-27

thanks both, only issue and maybe related to the problem is although the devices seem to core dump , they all seem to get corrupt so when we send up to TAC they cant get anything use full from them, were putting on debug across all our sites with the same code/setup, nearly 600 devices in hope we see issues and get more useful info, will also ping TAC/Diamond engineer as to reason why core dumps all corrupt which seems odd but more likely something common/obvious like disk space. we had 3 clusters over weekend that have done this again but unfortunately we didn't have debugging on them due to them being some of our more important devices from an app/service point of view.

Thanks

Chris_Atkinson · ‎2020-04-30

For awareness R80.20 and above include some improvements in this space that may be worth considering once the issue is better understood.

R80.20 What's NEW

"Improved clustering infrastructure for RouteD (Dynamic Routing) communication"

(Source: sk122485)

CCSM R77/R80/ELITE

Are you a member of CheckMates?

Routed Daemon Issues and Failovers