hi all ,
So background to this issue, we have around 600 VSEC gateways running with clusterXL HA, sitting on underlying VM infra, running R80.10 with some specific hotfixes relevant to our setup, since January we have started to see the devices flip (Cluster XL) due to what looks like routed daemon seeing errors then core dumping, we have had loads of these now, nothing service affecting as failover is doing it job (we also have BGP terminating on the same GW's ) but again have setup to minimize any flips which is saving us seeing service issues. We see the issues on multiple devices across multiple sites and with traffic levels very different from low end test ( 2 cores )setups to live high throughput production firewalls with Multiple blades in use ( 8 cores ). We have had Tac's out and our diamond engineers hasn't picked up anything as yet, we also have enabled debugging on a mix of devices hoping we see the issue and get more data to analyse.
some info in messages below as a starter for 10, example below see's a failover event at 06:15 today but happy to supply more info around our setup. Thanks in Advance
Apr 23 06:15:25 2020xxx clish[27173]: cmd by cronuser: Start executing : show route (cmd md5: 32edc6d9ebbb96f075ea7f0477b6285c)
Apr 23 06:15:25 2020xxx clish[27173]: cmd by cronuser: Processing : show route (cmd md5: 32edc6d9ebbb96f075ea7f0477b6285c)
Apr 23 06:15:25 2020xxx kernel: routed_mon[28067]: segfault at 0000000000000210 rip 00000000081131dc rsp 00000000f6764a30 error 4
Apr 23 06:15:25 2020xxx kernel: do_coredump: corename = |/etc/coredump/compress.sh /var/log/dump/usermode/routed_mon.28065.core
Apr 23 06:15:25 2020xxx kernel: do_coredump: argv_arr[0] = /etc/coredump/compress.sh
Apr 23 06:15:25 2020xxx kernel: do_coredump: argv_arr[1] = /var/log/dump/usermode/routed_mon.28065.core
Apr 23 06:15:26 2020xxx xpand[5086]: invalid binding <iterate rt:instance:default:af:inet:rt >, connection closed by routed