Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Jan_Kleinhans
Collaborator

Virtual Standby member cannot reach internal DNS or Internet

Jump to solution

Hello,

after upgrade to R80.40 HFA 48 we encounter the following problem.

If the standby member of a VS (VS2 for example) tries to reach a system, for example the internal DNS, it doesn't work.

In the log we can see, that the package doesn't get send from the interface of the virtual machine but gets send out from an interface of the VS0 (in this example the Mgmt interface).

So the standby member has a Threat Emulation Error because it cannot reach the DNS or something else.

When we change the standby member to active state we get "Firewall - Domain resolving error. Check DNS configuration on the gateway (0)" errors in the log and have distorted internet access. The new standby member has now the same issue as the old standby member and cannot reach any address. If the new standby member will work as expected after some time we did not test as is was in production.

 

 

 

Has anybody else such a problem?

 

Best regards,

 

Jan

2 Solutions

Accepted Solutions
Jan_Kleinhans
Collaborator

After hours of Investigation a Checkpoint Engineer disabled the new routing behaviour of R80.40 with the fwkern enty:

fwha_cluster_hide_active_only=0

This works as a workaround. Checkpoint is also trying to create a hotfix.

 

 

View solution in original post

genisis__
Advisor

I've just installed R80.40 with JHFA91, and it appears this is still an issue.  The default value for fwha_cluster_hide_active_only = 1.  After changing this value to 0 access to the internet from the gateway now works so the standby member can now get its AV/ABOT updates.

Also for reference sk169154.

View solution in original post

20 Replies
PhoneBoy
Admin
Admin
0 Kudos
Jan_Kleinhans
Collaborator
Hi,

yes. It dind't work. But the problem got bigger now. If we switch to the standby member traffic is not working anymore because of thousands of RAD errors. Only if we stop the now standby member traffic flows as espected.
We opened a TAC as we are now working with one member only.

Regards,

Jan
Chris_Atkinson
Employee
Employee

We're you able to resolve this, what was the solution?

Jan_Kleinhans
Collaborator

No we didn't.

The case is open but there is no real progress at the moment. 

We found the following SK.

sk168075 (Created 5 days ago). It says that reaching the internet or somthing else isn't possible since R80.10. But it worked till R80.30. We have another cluster with R80.20 where there is connectivity to DNS etc..

At the moment we have to debug TED on the standby member. But it already says that it cannot do a name resolving in the normal ted.elg.

 

gethostbyname() failed for: threat-emulation.checkpoint.com

 

 

Chris_Atkinson
Employee
Employee

Kindly PM the SR number and I will take a look, thanks.

0 Kudos
Jan_Kleinhans
Collaborator

Thank you for your offer.

It's already under investigation. But I think we will have to revert to R80.30.

 

Henrik_Noerr1
Collaborator

I'm sure you checked, just want to stress that the standby node is not included in the implied rulebase.

So you have to explicitly allow the other member on the active members rulebase - which is of course the same 🙂

if not you will see a drop on the active node where standby member is source.

Jan_Kleinhans
Collaborator

After hours of Investigation a Checkpoint Engineer disabled the new routing behaviour of R80.40 with the fwkern enty:

fwha_cluster_hide_active_only=0

This works as a workaround. Checkpoint is also trying to create a hotfix.

 

 

View solution in original post

genisis__
Advisor

I've just installed R80.40 with JHFA91, and it appears this is still an issue.  The default value for fwha_cluster_hide_active_only = 1.  After changing this value to 0 access to the internet from the gateway now works so the standby member can now get its AV/ABOT updates.

Also for reference sk169154.

View solution in original post

Attiq786
Participant

Hi

I have exactly the same issue as one of my VSs (R80.40 take 89) has no internet access on standby member. I have added the fwkern entry as per attached but no change. do we have to reboot the cluster member? tried cpstop;cpstart as well.

Attiq786
Participant

Ok Sorted. It needed a reboot 🙂

0 Kudos
Attiq786
Participant

It was strange though. I have 3 other VSs and none of them has this issue on standby member except this one. which was newly created.

0 Kudos
Chris_Wilson
Contributor

So, I had this same problem on 2 clusters of 23800's(R80.40) and adding the fwha_cluster_hide_active_only = 0, fixed it.  The odd thing is that I also have 2 clusters of R80.40 on 5200's, and they worked like normal and didn't need the fix.  one difference was the 5200's weren't running TE or TX, whereas the others were.  Not sure why the inconsistency.

0 Kudos
Dale_Lobb
Collaborator

We also ran into this problem on upgrade to R80.40 from R80.20. The upgrade release notes do not tell you that the parameter fwha_cluster_hide_active_only is now set to 1 by default. The issue is that while the standby cluster members are now forwarding packets to the active member, there are no implied rules to allow this traffic. Adding in explicit access policy rules to allow the cluster members to accept and forward packets for each other fixed the issue for us and left the parameter turned on, as the R80.40 upgrade wants.

The new behavior is kinda sorta documented in an implied way through a chain of SKs: sk169154 & sk167874 & sk169975

0 Kudos
Dale_Lobb
Collaborator

In addition, we ran into a slight twist on this issue.  We used the Multi-Version Cluster upgrade option.  While MVC was on, the parameter fwha_cluster_hide_active_only  was set to "1" on all FW worker cores except one, where it was set to "0".  Apparently, it was being reset to "0" on one FW worker core after initial boot, but before the boot process ended.  Adding "fwha_forw_packet_to_not_active=1" to $FWDIR/boot/modules/fwkern.conf did not resolve problem.  TAC gave us an update to the startup script "/opt/CPsuite-R80.40/fw1/bin/fwstart" to force it back to "1" later in the boot process so that all FW workers were behaving the same.   An email I have from TAC said this this bug is labelled "PRJ-20491" and will be fixed in a future HFA for R80.40.

 

To check if you have run into this bug in your own R80.40 upgrade, you can use a special feature of the "fw" command, "fw -i", to test the value of the param on each FW worker core:

 

# fw -i <fw_worker_number> ctl get int fwha_cluster_hide_active_only

 

Example: fw -i 0 ctl get int fwha_cluster_hide_active_only  (to see the value set for FW Worker 0)

Darina2019
Explorer

Hi Date,

In case you want to set that command on the fly which will not survive reboot what will be the exact command:

To get it is clear:

# fw -i <fw_worker_number> ctl get int fwha_cluster_hide_active_only

Example: fw -i 0 ctl get int fwha_cluster_hide_active_only  (to see the value set for FW Worker 0)

How is the command when you want to set it?

Can not find anything for setting it in SKs.

Thanks!

0 Kudos
Dale_Lobb
Collaborator

Hi Darina,

  You can set the parameter for all cores with the command:

    fw ctl set int fwha_cluster_hide_active_only 1

  Or for individual cores (workers) via:

    fw -i <worker #> ctl set int fwha_cluster_hide_active_only 1

Best Regards,

Dale

0 Kudos
Chris_Wilson
Contributor

So, I thought I would post this up.  After I had my problem, I had a case open and talked with a checkpoint engineer and he gave me the following info:

 

In previous versions a workaround was done by disabling cluster NAT for local connections, with fwha_cluster_hide_active_only=1, this workaround should be deleted.


For example from the CheckMates thread, this workaround (which advised for R80.30) is not good for R80.40.

I will explain to make our designs more clear: 

1) New R80.40 design:

With fwha_cluster_hide_active_only=1 the design is that the packet flow will be:

Standby -> Sync -> Active member -> going out with cluster VIP (source) ->  Peer getting the packet -> Peer responses to cluster VIP  -> Meaning Active member -> forwarding to Standby using Sync

2) Old design fwha_cluster_hide_active_only=0:

Standby -> going out with cluster VIP (source) ->  Peer getting the packet -> Peer responses to cluster VIP  -> Meaning Active member -> forwarding to Standby using Sync

3) Old design fwha_cluster_hide_active_only=0 + special cases like disabling cluster NAT:

Standby -> Going out with physical Standby IP (cluster NAT disabled) -> Peer getting the packet  -> Peer responses to Standby physical IP -> Standby

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

We moved to use case 1 as a default because it works for all the topologies.
Case 2,3 has problem with some topologies and explained in sk169154  -> 3.4.

0 Kudos
Dale_Lobb
Collaborator

That is essentially what was told to me in my TAC case as well.

The bigger issue is that it became the default in R80.40 without being mentioned in the release notes.  And it also requires some sort of rulebase support to allow the active firewall to forward packets for the passive nodes.  There does not appear to be any implied rule to allow the traffic.

Then there is also the PRJ-20491 issue with fwha_cluster_hide_active_only getting set back to "0" for one or more firewall workers if you use the Multi-Version Cluster upgrade option, which, as far as I know, has not yet been resolved.  At least, it is not yet listed on the R80.40 HFA list of fixes.

 

0 Kudos
Jan_Kleinhans
Collaborator

Hello,

does anybody have the "new way" running with urlf/av/ab on VSX VSLS? I tried to revert to default fwha_cluster_hide_active_only=1 but we always have RAD problems when we do a failover.

Our VS0 have three intrfaces:
Sync
Mgmt: Only DNS and Management is reachable from this interface
Internet: Interface for reaching the internet without Proxy.

If we move VS2 (which has URLF etc. ) to the Standby member2 we will run into RAD timeouts. If we do a cpstop on member1 VS0 so that all machines move to member2 everything works without problems.

Regards,

Jan

 

0 Kudos