Routing bug

Amir_Arama · ‎2019-11-19

so we have r80.20 cluster gaia, with fw vpn and ia enabled. corexl and securexl also enabled.

couple of days ago i added new vlan on empty interface for point to point against remote site FW, which connected through layer 2 line. so far so good. FWs are having vpn sts with each other. no static routes on that line, only encrypted traffic.
this GW actually connect HQ with all branches through main isp line on another interface.

today we had downs at least 7 times between HQ and all branches, each down time was for about 10-20 seconds, and go back up by itlsef., after checking with fw monitor i discovered that instead of routing packets directed to branches through the main isp line, the fw routed those packets through the new vlan interface that i meantioned above. and this is why the packets never arrived to the destination.
i thought first that maybe i had some duplicate routes, so i have checked, and there is no single route on this vlan interface except of course the directly connected point to point network which is in completely different subnet.
the things occured today before it started:
they go to this remote site to install pcs and printers etc.. which i don't believe relevant, and i fwaccel off and back on on this GW.

in messages i got a lot of :
kernel: [fw4_1];fwconn_recover_old_conn: connection is accelerated - cannot set handler.
kernel: [fw4_1];fwconn_recover_old_conn: handler (322) VERIFICATION_HANDLER. dropping packet

and also a lot from those: kernel: dst_release: dst:ffff8808147852c0 refcnt:-2

have no idea what these messages means..

it was happening for around 2 hours randomally and stopped about when they left the remote site. which again i don't believe related..

to me it looks very like a bug but i'm not sure why it happens just now and why with this new vlan specifically..

fwaccel off didn't solve the issue right away, but i just read that in r80.20 it not take effect on all connections as it was before.

Timothy_Hall · ‎2019-11-19

You said that you created a new VLAN interface at "Layer 2" and you have having intermittent outages that last 10-20 seconds. Sure sounds like Spanning Tree Protocol (STP) going through Listening/Learning on your switches over and over due to a rogue root bridge, a flap on a trunk port causing a constantly shifting root bridge, or an actual legit bridging loop forming that STP is breaking for you (and briefly most of the rest of the network with it). This topic was covered in my "Max Power" book and is a real bear to figure out...

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Amir_Arama · ‎2019-11-20

Hi,

it's not seems to be the case.
as i mentioned on fw monitor the gw route the packet to wrong interface, i don't see how switch can affect gw route decision. and the vlan interface is l3 not l2, what i meant is that this interface is point to point to remote site GW, and between the sites there is layer 2 line.

Maarten_Sjouw · ‎2019-11-20

When layer 2 goes down, the next hop is no longer available and the route is no longer available and will be sent to next best hop from the routing table.
So yes STP can still be the issue.

Regards, Maarten

Amir_Arama · ‎2019-11-20

1. ok, but nothing goes down, i ping the whole time from the gw to the main router and i got replies all the time, while it was directing packets through wrong interface.

2. even if the main link was go down, it should have route it through other interfaces, the default GW for example is in another interface, it routed packets through interface that even has no routes at all, in contrast to other interfaces..

so i don't believe this is it

PhoneBoy · ‎2019-11-22

Highly recommend getting the TAC involved.

Amir_Arama · ‎2019-11-23

i did.

Unforunatly it doesnt replicate anymore

Are you a member of CheckMates?

Routing bug