R80.40 VSX - Temporary loss of routing

Alex- · ‎2020-12-30

I have a TAC SR open but I'm posting this here in case in case someone experienced the same issue.

System is running VSX R80.40 Take 83 on a new appliance for some time and since a few days there actual losses of connectivity on one VS. Investigation network-wise showed nothing specific, but on /var/log messages this can be correlated with the creation of "messages_routed.vs1" and VS1 is the one with the issue.

There are always that kind of entries in the file.

Dec 29 17:34:09.749148 [routed] ERROR: recv(header) returns 0
Dec 29 17:34:09.749148 [routed] DEBUG: cpcl_recv: deleting peer task 0x984d13c

Dec 29 17:34:09.749148 [routed] DEBUG: peer_remove(130): Entering !!!!
Dec 29 17:34:59.681893 [routed] NOTICE: Exit routed[132332] version routed-09.25.2020-01:19:13
Dec 29 17:35:00 routed_syslog_on: Routed syslog to "/var/log/routed_messages_vs1" started

Followed by a bunch of "KRT REMNANT <routing prefix>: ignored"

Issue last several minutes after which the connectivity is restored without any specific intervention.

The start of the issue can't be linked to any specific architectural changes on the FW or the network to which it's connected.

PhoneBoy · ‎2020-12-30

Messages about routed and the behavior do suggest that routes appear to disappear for a period of time.
Why, I can't say, and it's good you have a TAC case open 🙂

Alex- · ‎2020-12-30

Indeed, I see core dumps which match exactly the times of the outages, I will follow-up with TAC.

the_rock · ‎2020-12-30

Not VSX expert by any means, but I recall in the past (though mind you this was R77 and before), easiest way to fix issues like that was to either restart routed process or simply soft reboot the box from the ssh. Not sure if you tried that, but technically, cprestart on master fw if its a cluster should suffice too.

Best,
Andy
"Have a great day and if its not, change it"

Alex- · ‎2020-12-31

The issue in itself is self-fixing, routed gets its act together after a while and starts working again. The main thing is why it happens and how to fix it permanently, because now there's obviously a big target on the FW each time something happens in the network.

the_rock · ‎2020-12-31

K, understood. In that case, maybe have TAC case opened and have it worked hopefully by a routing expert. I recall in the past couple of tickets like that went to R&D and it took months for any logical solution, so just be prepared, in case you are expecting a quick resolution.

Happy New year!

Best,
Andy
"Have a great day and if its not, change it"

Are you a member of CheckMates?

R80.40 VSX - Temporary loss of routing