Re: Domain object failure in R80.10 (sk120558)

Kaspars_Zibarts · ‎2018-01-29

Just running pass if anyone else has come across this one

DNS resolves OK manually on CLI but the problem is that both SK and SR engineer wants kernel debug that potentially may overload the FW. Considering that this is business critical firewall potentially causing $100k loss every minute it is dead and it's remote, I'm very reluctant to run debugs. Asked support engineer to come up with something else but hit the stone wall (that's a topic i really want to start - when will CP will come up with better debugging )

Doing graceful cluster reboot (standby reboot > failover > new standby reboot) seems to have "fixed" it for now.

Anyone with better ideas regarding domain object "checks" / tricks in R80.10 before diving into kernel debug?

This is plain firewall cluster (5900) with only firewall and IA blades, nothing fancy.

PhoneBoy · ‎2018-01-29

Moving this to General Product Topics‌

It's possible the kernel DNS lookup timed out somehow before it got a response.

Which would explain why it worked when you checked on the appliance.

I'm guessing that would show on the debugs.

Kaspars_Zibarts · ‎2018-01-29

I guess my description was not good enough - it wasn't just a temporary issue but full stop on domain object based rules. They didn't work and logs were full with those alerts. Whilst manual lookup worked just fine.

PhoneBoy · ‎2018-01-30

That sounds like the DNS resolution process got hung/crashed somehow.

And yeah, we'd probably need some detailed debugs to see what's going on.

Kaspars_Zibarts · ‎2018-01-30

Turned out that we had it all over the place including VSX firewalls running R80.10 and regular ones. Since I saw it on standby cluster members too, I collected debug from one of them and it turned out to be the same bug as SK.

But I would like to bit more info on actual root cause as SK is very short on it. How come that most firewalls have got the same problem now - is it DNS specific? Is it actual object specific?

"Internal failure in DNS health check state of Domain Objects"

PhoneBoy · ‎2018-01-30

Looking at the various internal information I have access to, there's not much more than is in the SK.

The good news is that there is a hotfix for the issue.

Kaspars_Zibarts · ‎2018-01-30

Thanks for trying Dameon!

Kaspars_Zibarts · ‎2018-01-31

Couldn't just give up on it as I wanted the explanation. Eventually pulled together information from our DNS team that did some planned work on weekend which meant that DNS was not totally down but could have apparently had "slow" responses. And it looks like that's enough to kill the DNS cache for domain objects in majority of our firewalls.

I was able to replicate it in the lab (sort of) by changing DNS IP temporary to a dummy IP address. To accelerate the process I did cpstop/cpstart so FW started using new IPs which in turn would not respond. And it didn't take that long before I got the same alerts there. And it seems like it never recovers from it, until you cpstop/cpstart again the gateway.

In nutshell - if

you use domain objects and
have a DNS hiccup in the network and
start seeing DNS alerts in logs (like one above) and
don't have the hotfix

cpstop/cpstart on the gateway seems to restore the functionality as long as DNS is functioning again correctly.. In cluster case you may do graceful cpstop/cpstart on each member (standby/fail over/standby)

PhoneBoy · ‎2018-01-31

Appreciate your diligence in tracking this down.

phlrnnr · ‎2018-11-15

Kaspars Zibarts, was TAC able to tell you what the hotfix actually did? Does it monitor the service and restart it if it crashes?

Kaspars_Zibarts · ‎2018-11-15

Unfortunately I couldn't find explanation in the SR even I asked for it. Just hotfixes

phlrnnr · ‎2018-11-15

Does this bug exist in R80.20 as well? The SK only shows R80.10 as being affected, but I'm seeing similar symptoms in R80.20.

More specifically, I'm seeing something similar with R80.20 Mgmt and R80.10 + Jumbo 112 VSX GWs.

PhoneBoy · ‎2018-11-15

The gateway in your example is still R80.10, which is where the name resolution takes place.

phlrnnr · ‎2018-11-15

Ok. Well, the SK says it is fixed in R80.10, Take 42. Since this is running Take 112, I guess I'll reach out to TAC. thanks.

Ray_Xiao · ‎2019-06-13

Did anyone try to do dns flush?

Are you a member of CheckMates?

Domain object failure in R80.10 (sk120558)