Solved: Re: Radius auth failover issue

the_rock · ‎2023-12-21

Hey guys,

Happy holidays! I wanted to see if someone could provide some thoughts/suggestions on this. So our customer has 2 radius servers, onprem and Azure. All this works fine, BUT, for 2 years now and multiple TAC cases, we still cant solve failover problem.

Btw, management is S1C and gateways are 6400s, R81.20 jumbo 41 (the latest)

What I mean by that is that say if onprem is priority 1 and Azure is priority 2 and you shut down onprem server, one would think that Azure would take over, but no, auth requests still seem to go to onprem server, as we can clearly see by doing tcpdump on port 1812. By the way, same issue happens if Azure is main auth server. One way to quickly solve issue when it happens is simply change the priorities of the radius servers and then all works fine after installing policy.

Also tested with both servers as priority 1, no luck.

We even set global auth to radius, made sure generic object in legacy dashboard was set to radius and tried any, radius group that contains both servers, no luck in any of scenarios.

TAC confirmed more than once that config is right, so it truly begs a question...WHY does failover scenario not work? Im not sure if anyone out there is using 2 radius server, but if you are, PLEASE let us know how you made this work (if you did lol)

Thanks again for all the suggestions!

Best,

Andy

Best,
Andy

the_rock · ‎2024-01-10

Hey guys,

Thanks to everyone who helped and responded. We were able to finally get this working with help of TAC on remote session and below are settings in global properties for Radius that worked 100% when onprem radius was shut down, which is primary and auth worked flawlessly to azure one and then also when onprem was powered back on.

Customer was very happy its finally fixed after 2 years.

Thanks again!

Best,

Andy

Grateful to @SenpaiNoticed_U @Chris_Atkinson @PhoneBoy @mccabe for all the advice and guidance ✌️

Best,
Andy

View solution in original post

JozkoMrkvicka · ‎2024-01-23

User authentication to RADIUS server times out

Kind regards,
Jozko Mrkvicka

View solution in original post

mccabe · ‎2023-12-22

Hi Andy,

I'm not sure from your post, but have you tweaked the settings for "radius_retrant_num" and "radius_retrant_timeout" as yet?

There's a long-standing SK here:

https://support.checkpoint.com/results/sk/sk42449

the_rock · ‎2023-12-22

@mccabe Thanks for the reply. As a matter of fact, that was one of very first thing we did and did not change anything, TAC was even on the phone when it was done.

Best,

Andy

Best,
Andy

the_rock · ‎2023-12-22

Just for the reference, we even tried both values to 1, but below are values TAC asked us to configure, exact same issue, no change.

Best,

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-27

Try with Radius_connect_timeout at 20 seconds
keep the rest of the settings the same.

Let me know what the results are.

the_rock · ‎2023-12-27

We did that long time ago and absolutely made no difference. We tried, 5,10,15,20,30 and so on, exact same issue.

Best,

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-27

What about in the Iked files?
do we see it stopping and claiming all servers are down in the IKED due to timeout?

What I read that configuration as is
Attempt each server 2 times,
for 5 seconds between attempts
10 seconds for the whole authentication attempt before claiming all servers are down.

Thus never getting enough time to attempt a 2nd server due to the 1st server taking up (5+5) = 10 seconds

the_rock · ‎2023-12-27

Absolutely nothing...TAC asked us for literally every log file you can imagine before and there was no solution. I think what @PhoneBoy said would explain why this does not work, but in all honesty, I find it shocking, because in my mind, it totally defeats the purpose of even having 2 radius servers at all.

Best,

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-27

Seems to work with my Test Lab,
Radius servers in a group
Radius priority 1 = 10.250.250.1
Radius priority 2 = 10.150.150.2

Here are my settings in Global Properties

SenpaiNoticed_U · ‎2023-12-27

Here is the auth page example

the_rock · ‎2023-12-27

I hear ya, lots of things work in my lab too that dont work in production lol

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-27

Then I suggest to work on your open TAC case, and showcase the issue, the traffic, and provide debugs/captures.

the_rock · ‎2023-12-27

Thats the plan, yea. But, if you are willing to attach a word doc with screenshots of your radius lab servers and global properties settings, I am happy to suggest those to the customer next time they approve the maintenance window for this.

Best,

Andy

Best,
Andy

Chris_Atkinson · ‎2023-12-27

Just to clarify are you authenticating VPN users or SmartConsole admins etc?

At this stage going deeper with TAC seems the likely path...

CCSM R77/R80/ELITE

the_rock · ‎2023-12-27

VPN users Chris, correct.

Andy

Best,
Andy

the_rock · ‎2023-12-27

What priorities did you give those 2 radius servers?

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-28

I posted my server priority in my previous post.

Radius servers in a group
Radius priority 1 = 10.250.250.1
Radius priority 2 = 10.150.150.2

it is Unsupported to have 2 Radius servers in a group with the same priority.
recommended different priorities

the_rock · ‎2023-12-28

Yea, sorry about that, I noticed it right after I responded, my bad. Well, not sure what else to say, because thats exctly how we had it too, no difference. Btw, Azure radius server works fine, there are no issues with it, it is pingable and 100% reachable from both cluster members via BGP xpress route.

Best,

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-28

Any of the Radius servers over a VPN tunnel?
or are they both reachable without VPN.

If VPN tunnel involved, Verify that you have allowed Radius traffic to not be controlled by Implied Rules before VPN traffic is considered. meaning, to disable the implied rules for Radius traffic and make a policy rule accept and allow Radius traffic.

the_rock · ‎2023-12-28

Glad you asked, thats 100% valid question and totally relevant in this case. Answer is no, neither Radius server communicates over VPN, as mentioned before the Azure one is via xpress route and onprem is reachable from their office.

Best,

Andy

Best,
Andy

the_rock · ‎2023-12-28

See, another super challenging part here is that we obviously cant expect TAC or even ask them to try replicate this, because it involves Azure radius server. So as much as its greatly appreciated you also tested this in the lab and Im happy it worked for you, but it does not sadly represent true config customer uses.

Anyway, I reached out offline to Ilya Yusupov, as he helped us big time last year for the same customer with ISP redundancy script, dont think we would have solved that issue without his help in some time. Lets see what he finds out...

Best,

Andy

Best,
Andy

the_rock · ‎2023-12-28

Its worth mentioning that we also changed generic object in legacy dashboard under auth to use radius group, rather than one server, that was there since the beginning, but it was exact same behavior. I was really hopeful that would make a difference, but sadly no.

Best,

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-28

I ran a Test using your settings for Radius Global properties and yes it only attempted the Priority 1 server, and not the 2nd.
due to the Settings not allowing it more time to attempt.

Your settings based on your screenshot

radius_user_timeout - Timeout interval for the user to respond to a RADIUS challenge (in seconds)
radius_retrant_num - Maximum number of connection attempts to the RADIUS server
radius_connect_timeout - Timeout interval until all RADIUS servers are considered down for this authentication attempt. (in seconds)
radius_retrant_timeout - Timeout interval for each RADIUS server connection attempt (in seconds)
radius_ignore - When handling RADIUS authentication, FireWall-1 verifies that the RADIUS attributes are RFC compliant. If your system uses non-standard RADIUS attributes, you can force FireWall-1 to ignore these attributes

Thus, you only allowed your user:

120 seconds (for auth, radius_user_timeout)
2 attempt per server (radius_retrant_num)
10 Seconds total, for the whole auth attempt (radius_connect_timeout)
5 seconds per server (radius_retrant_timeout)

Thus meaning you only allow the radius attempts to be for a total of 10 seconds = 2 attempts of 5 seconds per server attempt
1st server gets 2 attempts, 5 seconds each = 10 seconds
10 seconds being the max, so your auth attempt ends.
if you want to reach the other servers, you need to adjust your timers to allow enough time to reach to all servers and all attempts.

For 2 servers at 2 attempts each with 5 seconds.
I would recommends 25 seconds for radius_connect_timeout

the_rock · ‎2023-12-28

As mentioned yesterday, we tried multiple settings there and it made no difference at all.

TAC was even on the phone when we did it before.

Best,

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-28

I am looking at your active case ending in 6-000xxxx753,
You have not provided debugs or a active session for TAC to review the behavior.
What is your plan for next steps?

I have Labed out both my settings and your settings, and proving it to work as expected.
I would have to advise to arrange a meeting or collect debugs per the TAC case request.
If needed, your case owner can arrange a session with you and me so that I can show case my lab set up.

the_rock · ‎2023-12-28

Thats right, as customer has to approve maintenance window for this, so proper troubleshooting can be done.

You are welcome to attach your lab setup via word doc, just take relevant screenshots, thats what I always do.

Best,

Andy

Best,
Andy

the_rock · ‎2023-12-28

I emailed T3 guy in DTAC, lets see if he can do quick zoom remote today, so I can show him the config. Here is the way I look at all this...just me personally, but in my mind, I dont think it makes any difference what those timeout values are, because at the end of the day, that would simply prolong OR shorten time it takes for things to fail or time out. The main issue here is that failover to working radius server never happens, but logic would indicate that it simply should happen without any issues.

UNLESS as @PhoneBoy indicated, this works only in a way that it takes into consideration whichever server has higher priority, but then again, that would totally defeats the purpose of even having 2 Radius servers for auth to begin with.

Anyway, lets see what TAC comes back.

Best,

Andy

Best,
Andy

the_rock · ‎2023-12-28

For what its worth, we even had DTAC escalation guy tell us to set radius_retrant_timeout to 5 seconds and issue remianed the same.

Best,

Andy

Best,
Andy

the_rock · ‎2023-12-28

I spoke with T3 guy Andrew and his esc. buddy Zack from DTAC on the case and they asked us to change below values as per screenshot, which I did, so lets see if it helps on the next maintenance window. Considering we changed these values who knows how many times, I want to be positive it will make a difference, so lets see : - )

Best,

Andy

Best,
Andy

SenpaiNoticed_U · ‎2023-12-29

I would not do radius_retrant_timout for 15 seconds if you have Radius_connect_timeout as 40 with the amount of servers and server attempts you have set.

I would do this.

120 seconds (for auth, radius_user_timeout)
2 re-attempt per server (radius_retrant_num)
40 Seconds total, for the whole auth attempt (radius_connect_timeout)
5 seconds per server (radius_retrant_timeout)

This would give each server 3 attempts of communication, each 5 seconds apart.
meaning server 1 would get 15 seconds of attempt time, before moving on to the 2nd server.
2nd server would get its 3 attempts over another 15 seconds.
Totaling 30 seconds out of the 40 seconds that is permitted (radius_connect_timeout).

so you would see a tcpdump as this if both servers are failing
in seconds
00s source >>> destination_server1
05s source >>> destination_server1
10s source >>> destination_server1
15s source >>> destination_server2
20s source >>> destination_server2
25s source >>> destination_server2

note that 5 seconds per server may need adjusting based on your environment needs and you have to adjust as needed

Follow this train of thought:
Number of Radius servers + (1+radius_retrant_num) + radius_retrant_timeout = radius_connect_timeout +10 extra seconds
Example
2 servers + (1+2) + 5 = X +10 extra seconds
2 + (3) + 5 = X +10 extra seconds
X = 30 + 10 extra seconds
radius_connect_timeout = 40

*note: (radius_retrant_num)
you can set this to zero, and the gateway will still attempt once,
(radius_retrant_num) is more a value for Re-attempts so its 1 + # of retries

Are you a member of CheckMates?

Radius auth failover issue