Solved: Re: Virtual FIrewalls running R80.10 take 154 Hang...

Tom_Stala · ‎2018-12-26

Virtual FIrewalls running R80.10 take 154

We are having issues with the firewall where we need to run zdebug drop to see traffic is actually being dropped.

it seems like when we run these commands the firewall becomes unresponsive

fw ctl debug 0

fw ctl debug -buf 32000 -v 4

we can run top or cpstop

we can look at directories

but cpview, cpinfo do not work.

Nothing looks like it is stuck at 100% under TOP

The boxes are 15600's and they are hardly even breathing.

There are no crash or dump files created.

even fw monitor will not show anything. But anything that should be passing is not able to pass anymore.

we issue a cpstop and the firewall jumps over to one of the other two VM's and everything starts to work.

We have even had this on a trouble call with checkpoint tech running the commands, there is just no way to get any information from the failing firewall.

I guess we are wondering if anyone else has had this issue.

Yes we do have tickets open

Tom_Stala · ‎2019-01-29

Still no answer as to why running debug even the most minimalist locks the OS up and we have to cpstop cpstart.

We have added a line to the fwkern.conf file to keep the firewall from dropping the reverse look ups.

Also we removed all of the objects that are not FQDN

We are looking to go to R80.20 for the Microsoft 365 cough stuff

View solution in original post

PhoneBoy · ‎2018-12-26

The secondary gateway doesn't exhibit the same behavior after failing over?

Do issuing the debug commands on the secondary cause the same symptoms as on the primary?

Tom_Stala · ‎2018-12-26

After giving up on debugging the issue we leave it alone and let business happen.

But we have had all three VM's exhibit this same issue. So I am thinking if we were to run it on the other box after the move it would do the same.

It has us at a point where we don't even want to login to the devices lol.

We have ran into this issue trying to troubleshoot our Stealth rule dropping proxy traffic that is heading out to the internet.

I have checked and the logs say it is going out but zdebug says it is being dropped so CheckPoint wants to run a debug and since we have had so many strange issues we asked CheckPoint to assist.

As soon as they ran the buffer it stopped responding to commands

fw ctl debug 0

fw ctl debug -buf 32000 -v 4

PhoneBoy · ‎2018-12-26

Rather than assuming what would happen on the other node on a fail over, you should actually do it and confirm what the behavior is.

Also, please send me the case # in a PM.

Hugo_vd_Kooij · ‎2018-12-28

If increasing the buffer size causes issues I think I would start at looking at things like:

fw ctl pstat

As it seems to indicate the unit is running out of memory of sorts when you increase the debug buffers.

<< We make miracles happen while you wait. The impossible jobs take just a wee bit longer. >>

Tom_Stala · ‎2018-12-28

It would seem to be that but we have 64 Gig total and 55 Gig free.

These are model 15600

PhoneBoy · ‎2018-12-28

It's also possible it may be some sort of memory-related hardware issue.

This is one of the reasons I suggest actually failing over and seeing if the problem reproduces.

Tom_Stala · ‎2019-01-02

it has happened on two of the three machines.

We are in a change freeze right now, we are looking at a date next week.

we have a two hour window to troubleshoot the issue. I have someone from checkpoint that is going to be on the call.

we are going to reboot all three machines at the same time.

So I am looking for ideas on how to trouble shoot this issue.

I am looking at getting some stuff up and running before we set the buffer size like cpview top

But we have seen this on node one and three.

Thanks for any suggestions, they all are welcome

Tom

Tom_Stala · ‎2019-01-02

I was asked to check the vs_bits -stat

Houston I think we have a problem lol ;?)

[Expert@sef-cp-vsxn1:0]# vs_bits -stat

All VSs are at 32 bits

[Expert@sef-cp-vsxn1:0]# vsenv 4

Context is set to Virtual Device sef-cp-vsxn1_SEF-CP-VS-Internet (ID 4).

[Expert@sef-cp-vsxn1:4]# vs_bits -stat

All VSs are at 32 bits

[Expert@sef-cp-vsxn1:4]#

PhoneBoy · ‎2019-01-02

I don't think it would hurt anything to change VS Bits to 64 (though it requires a reboot, so you'd have to do it in an outage window).

Tom_Stala · ‎2019-01-03

yeah, next week we will find out, we are going to change to 64 reboot and then do the debugs.

I will keep the -vs in mind if we have an issue we will change that.

Tom_Stala · ‎2019-02-18

So we are going to R80.20 for the Microsoft office 365 (Stuff).

We are also hoping that the SecureXL changes fix the ' connection not found' we are seeing sk101134

we are still having to play shuffle the firewalls around if we do not we start having issues down the road like there is a mem leak of some kind.

or maybe it is some table that is not being cleaned up and it runs out of room. Not sure but it will run without any issues for a few days and then it just starts dropping traffic.

fw ctl zdebug drop | grep ' connection not found'

[kern];[tid_29];[fw4_0];fw_log_drop_conn: Packet <dir 1, someIP -> SomeIP>, dropped by handle_outbound_pac, Reason: connection not found;

SSlater · ‎2019-01-02

Hey @Tom Stala,

A Debug Buffer Command should not cause the lack of response that you mention.

No additional flags have been enabled, no Debug Output is being generated, so no Increased Load.

If we saw that the load increased, or the device became unresponsive after adding debug flags, I would agree with the above conclusions.

I would suggest that the command syntax is not correct.
"-v 4"
I've seen in several sk's that we need "-vs <VSID>" rather than "-v"

- Verify this with TAC, they may have provided bad syntax.

I wouldn't worry about your 32-bit VS's as this is the default for the systems.
((Provided you're not running higher than 4GB of Memory, or 2TB of disk space per VM, it should never be an issue in VSX))

- Additionally, The majority of our UserMode processes are still 32-bit.

Tom_Stala · ‎2019-01-03

we are running a lot more mem on the firewalls so I am thinking this might be an issue for our config

Tom_Stala · ‎2019-01-03

the device did become unresponsive after running the buffer size command, twice

This was ran by checkpoint and it failed right after that

Timothy_Hall · ‎2019-01-03

The behavior you are describing sure sounds like a shortage of memory to me, either the kernel itself can't get enough or the kernel is utilizing a large percentage of RAM leaving very little left over for processes to use. If the latter is occurring, you should see a high percentage of wio in top as the system transfers process memory pages to and from disk in an attempt to free up memory.

Either way, setting for 64-bit should help. A lot.

--

CheckMates Break Out Sessions Speaker

CPX 2019 Las Vegas & Vienna - Tuesday@13:30

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Tom_Stala · ‎2019-01-10

So this morning we set the vsenv to 64 bit and it still locks up when we set buffer size

we cpstoped it and cpstarted it

when it came back up it locked back up

we then set the buffer size to 0 and it all started working again.

This is the third time we had checkpoint on to work with us and we have finally gotten that this needs to go to R&D

Support aint what it use to be

Tom_Stala · ‎2019-01-14

More testing and more failures trying to run Debug.

we set the buffer size to 9086

Then we tried to set some flags and that is where we actually are failing when pugging the flag in.

The tech tried to run just the bare minimum debug and that locks it up to where we have to cpstop cpstart to get the firewall to where it will respond to checkpoint commands

cpview and cpinfo give nothing back, they act like they are waiting for a response

interestingly enough we can run cpstop and cpstart

Tom_Stala · ‎2019-01-29

Still no answer as to why running debug even the most minimalist locks the OS up and we have to cpstop cpstart.

We have added a line to the fwkern.conf file to keep the firewall from dropping the reverse look ups.

Also we removed all of the objects that are not FQDN

We are looking to go to R80.20 for the Microsoft 365 cough stuff

Tom_Stala · ‎2019-03-06

Upgrading the Management and then the Firewalls has fixed the issue with debug. we are able to run debug on the firewalls.

Dmitry_Krupnik · ‎2019-01-30

Hello Tom,

Please tell me, what is status of the TAC's ticket, which was opened. Could you provide it's number?

Regards,

Dmitry

Tom_Stala · ‎2019-02-18

Sorry for the delay in the reply. <6-0001071745>

We are still unable to run debug of any kind. it locks up pretty much but we are able to run cpstop and cpstart and get it back to running.'

We have discovered this error

sk101134

SecureXL drops traffic with "... dropped by handle_outbound_pac, Reason: connection not found"

Tom_Stala · ‎2019-03-06

Upgrading from R80.10 to R80.20 has fixed this issue. It was not the primary reason for the upgrade, office 365 was.

But this seems to have solved the issue with running debug and the Firewalls seem stable now, been a couple weeks no real issues.

For some reason we had to exclude a subnet out of secureXL but we are not concerned with that.

Are you a member of CheckMates?

Virtual FIrewalls running R80.10 take 154 Hangs