Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted

Routed Daemon Issues and Failovers

hi all , 

So background to this issue, we have around 600 VSEC gateways running with clusterXL HA, sitting on underlying VM infra, running R80.10 with some specific hotfixes relevant to our setup, since January we have started to see the devices flip (Cluster XL) due to what looks like routed daemon seeing errors then core dumping, we have had loads of these now, nothing service affecting as failover is doing it job (we also have BGP terminating on the same GW's ) but again have setup to minimize any flips which is saving us seeing service issues. We see the issues on multiple devices across multiple sites and with traffic levels very different from low end test ( 2 cores )setups to live high throughput production firewalls with Multiple blades in use ( 8 cores ). We have had Tac's out and our diamond engineers hasn't picked up anything as yet, we also have enabled debugging on a mix of devices hoping we see the issue and get more data to analyse.

some info in messages below as a starter for 10, example below see's a failover event at 06:15 today but happy to supply more info around our setup. Thanks in Advance

 

Apr 23 06:15:25 2020xxx clish[27173]: cmd by cronuser: Start executing : show route (cmd md5: 32edc6d9ebbb96f075ea7f0477b6285c)
Apr 23 06:15:25 2020xxx clish[27173]: cmd by cronuser: Processing : show route (cmd md5: 32edc6d9ebbb96f075ea7f0477b6285c)
Apr 23 06:15:25 2020xxx kernel: routed_mon[28067]: segfault at 0000000000000210 rip 00000000081131dc rsp 00000000f6764a30 error 4
Apr 23 06:15:25 2020xxx kernel: do_coredump: corename = |/etc/coredump/compress.sh /var/log/dump/usermode/routed_mon.28065.core
Apr 23 06:15:25 2020xxx kernel: do_coredump: argv_arr[0] = /etc/coredump/compress.sh
Apr 23 06:15:25 2020xxx kernel: do_coredump: argv_arr[1] = /var/log/dump/usermode/routed_mon.28065.core
Apr 23 06:15:26 2020xxx xpand[5086]: invalid binding <iterate rt:instance:default:af:inet:rt >, connection closed by routed

 

 

0 Kudos
4 Replies
Highlighted
Admin
Admin

The errors suggest it's dumping core which means TAC will have to analyze the core files to see what's going on.
0 Kudos
Highlighted
Employee++
Employee++

Are your hotfixes on top of Jumbo T249 or another take?

TAC is definitely the best route to analyse and resolve given the additional variables described here.

Highlighted

thanks both, only issue and maybe related to the problem is although the devices seem to core dump , they all seem to get corrupt so when we send up to TAC they cant get anything use full from them, were putting on debug across all our sites with the same code/setup, nearly 600 devices in hope we see issues and get more useful info, will also ping TAC/Diamond engineer as to reason why core dumps all corrupt which seems odd but more likely something common/obvious like disk space. we had 3 clusters over weekend that have done this again but unfortunately we didn't have debugging on them due to them being some of our more important devices from an app/service point of view.

Thanks

 

 

Highlighted
Employee++
Employee++

For awareness R80.20 and above include some improvements in this space that may be worth considering once the issue is better understood.

 

R80.20 What's NEW 

"Improved clustering infrastructure for RouteD (Dynamic Routing) communication"

(Source: sk122485)

0 Kudos