Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Kaspars_Zibarts
Employee Employee
Employee

Domain object failure in R80.10 (sk120558)

Just running pass if anyone else has come across this one

DNS resolves OK manually on CLI but the problem is that both SK and SR engineer wants kernel debug that potentially may overload the FW. Considering that this is business critical firewall potentially causing $100k loss every minute it is dead and it's remote, I'm very reluctant to run debugs. Asked support engineer to come up with something else but hit the stone wall (that's a topic i really want to start - when will CP will come up with better debugging Smiley Happy )

Doing graceful cluster reboot (standby reboot > failover > new standby reboot) seems to have "fixed" it for now.

Anyone with better ideas regarding domain object "checks" / tricks in R80.10 before diving into kernel debug?

This is plain firewall cluster (5900) with only firewall and IA blades, nothing fancy.

0 Kudos
14 Replies
PhoneBoy
Admin
Admin

Moving this to General Product Topics

It's possible the kernel DNS lookup timed out somehow before it got a response.

Which would explain why it worked when you checked on the appliance.

I'm guessing that would show on the debugs.

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

I guess my description was not good enough - it wasn't just a temporary issue but full stop on domain object based rules. They didn't work and logs were full with those alerts. Whilst manual lookup worked just fine. 

0 Kudos
PhoneBoy
Admin
Admin

That sounds like the DNS resolution process got hung/crashed somehow.

And yeah, we'd probably need some detailed debugs to see what's going on.

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Turned out that we had it all over the place including VSX firewalls running R80.10 and regular ones. Since I saw it on standby cluster members too, I collected debug from one of them and it turned out to be the same bug as SK.

But I would like to bit more info on actual root cause as SK is very short on it. How come that most firewalls have got the same problem now - is it DNS specific? Is it actual object specific?

"Internal failure in DNS health check state of Domain Objects"

0 Kudos
PhoneBoy
Admin
Admin

Looking at the various internal information I have access to, there's not much more than is in the SK.

The good news is that there is a hotfix for the issue.

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Thanks for trying Dameon!

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Couldn't just give up on it as I wanted the explanation. Eventually pulled together information from our DNS team that did some planned work on weekend which meant that DNS was not totally down but could have apparently had "slow" responses. And it looks like that's enough to kill the DNS cache for domain objects in majority of our firewalls.

I was able to replicate it in the lab (sort of) by changing DNS IP temporary to a dummy IP address. To accelerate the process I did cpstop/cpstart so FW started using new IPs which in turn would not respond. And it didn't take that long before I got the same alerts there. And it seems like it never recovers from it, until you cpstop/cpstart again the gateway.

In nutshell - if

  • you use domain objects and
  • have a DNS hiccup in the network and
  • start seeing DNS alerts in logs (like one above) and
  • don't have the hotfix

cpstop/cpstart on the gateway seems to restore the functionality as long as DNS is functioning again correctly.. In cluster case you may do graceful cpstop/cpstart on each member (standby/fail over/standby)

PhoneBoy
Admin
Admin

Appreciate your diligence in tracking this down. Smiley Happy

0 Kudos
phlrnnr
Advisor

Kaspars Zibarts, was TAC able to tell you what the hotfix actually did?  Does it monitor the service and restart it if it crashes?

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Unfortunately I couldn't find explanation in the SR even I asked for it. Just hotfixes  

phlrnnr
Advisor

Does this bug exist in R80.20 as well?  The SK only shows R80.10 as being affected, but I'm seeing similar symptoms in R80.20.

More specifically, I'm seeing something similar with R80.20 Mgmt and R80.10 + Jumbo 112 VSX GWs.

0 Kudos
PhoneBoy
Admin
Admin

The gateway in your example is still R80.10, which is where the name resolution takes place.

0 Kudos
phlrnnr
Advisor

Ok.  Well, the SK says it is fixed in R80.10, Take 42.  Since this is running Take 112, I guess I'll reach out to TAC.  thanks.

0 Kudos
Ray_Xiao
Explorer

Did anyone try to do dns flush?

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events