Solved: Re: Radius auth failover issue - Page 2

the_rock · ‎2023-12-21

Hey guys,

Happy holidays! I wanted to see if someone could provide some thoughts/suggestions on this. So our customer has 2 radius servers, onprem and Azure. All this works fine, BUT, for 2 years now and multiple TAC cases, we still cant solve failover problem.

Btw, management is S1C and gateways are 6400s, R81.20 jumbo 41 (the latest)

What I mean by that is that say if onprem is priority 1 and Azure is priority 2 and you shut down onprem server, one would think that Azure would take over, but no, auth requests still seem to go to onprem server, as we can clearly see by doing tcpdump on port 1812. By the way, same issue happens if Azure is main auth server. One way to quickly solve issue when it happens is simply change the priorities of the radius servers and then all works fine after installing policy.

Also tested with both servers as priority 1, no luck.

We even set global auth to radius, made sure generic object in legacy dashboard was set to radius and tried any, radius group that contains both servers, no luck in any of scenarios.

TAC confirmed more than once that config is right, so it truly begs a question...WHY does failover scenario not work? Im not sure if anyone out there is using 2 radius server, but if you are, PLEASE let us know how you made this work (if you did lol)

Thanks again for all the suggestions!

Best,

Andy

the_rock · ‎2023-12-29

This is what TAC confirmed is fine (see below). Personally, I want to be positive this will make a difference, but based on having tried who knows how many different values in the last 2 years, I doubt it, but lets see...hopefully, it can be tested again next week.

Best,

Andy

the_rock · ‎2023-12-27

@PhoneBoy Happy holidays mate! I wanted to pick your brain on this and see if you had any suggestions. Honestly, its nothing urgent, as issue has been there for more than 2 years now, so customer does not expect it to be fixed magically lol

Just wanted to see if you have anything on your mind that may help, thats all.

Happy New Year.

Best,

Andy

PhoneBoy · ‎2023-12-27

I believe it merely tries the RADIUS severs in priority order versus "failing over" to make one active or not.
At least that's how I remember this feature working back in the day.

the_rock · ‎2023-12-27

Really? Hm, interesting...so I guess it sort of defeats the purpose then of having 2 radius servers for authentication. Any way to make it work with 2 of them in a group if say one is priority 1 and other is 2? We even tested the other night both as same priority and exact same issue.

Best,

Andy

PhoneBoy · ‎2023-12-27

It's supposed to try priority 1 first, then priority 2.
If it's not doing that, the TAC may need to investigate.

Chris_Atkinson · ‎2023-12-27

Have you tried recreating the Radius server group in SmartConsole, does the current group name include special characters?

CCSM R77/R80/ELITE

the_rock · ‎2023-12-27

Yes we did, while back actually and name is Radius_group, but there was time it was called simply Radius and made no difference either.

Best,

Andy

Chris_Atkinson · ‎2023-12-27

And to confirm when you say the Radius is shutdown they are not providing any response at all correct?

(Generally auth fail and timeout are not the same from a liveliness perspective)

CCSM R77/R80/ELITE

the_rock · ‎2023-12-27

Thats right, as if you shut down windows PC, not just rebooted it. Check out below, 192.168.x.x was the one that was shut down and 10.x.x.x is Azure one that was up and running.

Best,

Andy

[Expert@FW-1:0]# tcpdump -enni any host 192.168.32.210 and port 1812

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

20:46:20.795393 Out 00:1c:7f:a1:42:47 ethertype IPv4 (0x0800), length 116: 10.240.0.3.55059 > 192.168.32.210.1812: RADIUS, Access-Request (1), id: 0x29 length: 72

20:46:20.795396 Out 00:1c:7f:a1:42:47 ethertype 802.1Q (0x8100), length 120: vlan 20, p 0, ethertype IPv4, 10.240.0.3.55059 > 192.168.32.210.1812: RADIUS, Access-Request (1), id: 0x29 length: 72

20:46:25.795160 Out 00:1c:7f:a1:42:47 ethertype IPv4 (0x0800), length 116: 10.240.0.3.55059 > 192.168.32.210.1812: RADIUS, Access-Request (1), id: 0x29 length: 72

20:46:25.795163 Out 00:1c:7f:a1:42:47 ethertype 802.1Q (0x8100), length 120: vlan 20, p 0, ethertype IPv4, 10.240.0.3.55059 > 192.168.32.210.1812: RADIUS, Access-Request (1), id: 0x29 length: 72

^C

4 packets captured

16 packets received by filter

0 packets dropped by kernel

[Expert@FW-1:0]#

Azure radius was not responding, so I changed priority to 1 for both and tested

[Expert@FW-1:0]# tcpdump -enni any host 10.200.11.14 and port 1812

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

^C

0 packets captured

66 packets received by filter

2 packets dropped by kernel

[Expert@FW-1:0]#

the_rock · ‎2023-12-27

Just verified now, group is called RadiusGroup, so no special characters anywhere.

Best,

Andy

the_rock · ‎2024-01-10

Hey guys,

Thanks to everyone who helped and responded. We were able to finally get this working with help of TAC on remote session and below are settings in global properties for Radius that worked 100% when onprem radius was shut down, which is primary and auth worked flawlessly to azure one and then also when onprem was powered back on.

Customer was very happy its finally fixed after 2 years.

Thanks again!

Best,

Andy

Grateful to @SenpaiNoticed_U @Chris_Atkinson @PhoneBoy @mccabe for all the advice and guidance ✌️

mccabe · ‎2024-01-11

🙂

the_rock · ‎2024-01-11

All I will say is this...maybe there are not too many customers out there using 2 radius servers for authentication (just my educated guess), but for those who are, it would be nice to update sk you gave initially with the calculation that was mentioned in this post and what DTAC esc. guy gave us as well. That way, there is no guessing what those values should be...now we know thats why it took 2 years to fix this permanently.

Anyway, not a huge huge deal, considering we had workaround when it would happen, but still...just annoying : - )

Best,

Andy

the_rock · ‎2024-01-11

This is what I was referring to @mccabe

Straight from TAC case by T3 guy in Dallas and after I read it few times, makes total sense to me. I really hope sk is updated with this info

Andy

*********************

120 seconds (for auth, radius_user_timeout)
2 re-attempt per server (radius_retrant_num)
40 Seconds total, for the whole auth attempt (radius_connect_timeout)
5 seconds per server (radius_retrant_timeout)

This gives the gateway 15 seconds to try the first RADIUS server (1 initial and 2 re-attempts at 5 seconds each) and then it will go to the second RADIUS server for 15 seconds (1 initial and 2 re-attempts at 5 seconds each) but the window to for all RADIUS server attempts is 40 seconds which will allow the gateway enough time for the 30 seconds it needs to reach out to the two RADIUS servers.

********************

mccabe · ‎2024-01-11

I'll chase the owner internally, Andy, and ask for something to be added as an 'example', using what you had above. Many thanks for your persistence on this.

the_rock · ‎2024-01-11

No worries mate, no rush. Its always the team effort, so thank you and other guys who helped, along with great help from TAC, of course.

Best,

Andy

JozkoMrkvicka · ‎2024-01-23

User authentication to RADIUS server times out

Kind regards,
Jozko Mrkvicka

the_rock · ‎2024-01-24

Ironically enough, we followed that sk 2 years ago and when I mentioned that to esc. engineer, he told me specifically NOT to change value from 5 to 30 seconds, like st states. Anyway, issue is fixed now, thats all I really care about : - )

Best,

Andy

Are you a member of CheckMates?

Radius auth failover issue