1100 resource exhaustion after moving to R80.40?

Ted_Serreyn · ‎2020-10-28

So I wanted to post this to let people know I had a customer experience this after upgrading to R80.40.

Starting point:

R80.30 recent jumbo at the time of upgrade

1100 with R77.20.80 in field as a remote office. Centrally managed, only FW, VPN and IPS blades enabled. Tuned IPS for SMB/older firewall versions disabling high cpu impact protections, others to minimize impact.

VPN, local connectivity, remote connectivity all worked to local 1100 appliance.

Upgraded to R80.40 JHF 77 (may have even been one GA patch earlier).

Started experiencing odd problems including but not limited to:

VPN drops permanently, reboot of box sometimes fixed.

Local connectivity loss including icmp ping, and web management loss.

Inability to login remotely via web.

Inability to login remotely via ssh.

Login, but error about role not assigned to user, then appliance stated that it needed to run initial configuration again.

most command would fail in this state:

vpn11-test1100 login: admin

Password:

Role is not assigned to user

vpn11-test1100> top

Unexpected error: /usr/local/share/lua/5.1/sys/permissions.lua:0: attempt to index upvalue '' (a nil value)

vpn11-test1100> top

Unexpected error: /usr/local/share/lua/5.1/sys/permissions.lua:0: attempt to index upvalue '' (a nil value)

vpn11-test1100> expert

Unexpected error: /usr/local/share/lua/5.1/sys/permissions.lua:0: attempt to index upvalue '' (a nil value)

Reboots would sometimes fix the problem, so we limped along.

So my initial conclusion after doing some troubleshooting was that maybe this was a hardware issue, so we swapped it with a spare 1100 appliance.

Initially looked like it was going to work, but within a few hours saw the same issues.

Finally decided to swap (AGAIN?!) for a 1400 series appliance, same policy same network configuration, same blades enabled.

Worked perfectly!

Had customer ship me both boxes to put online locally where I could see serial consoles.

Wiped boxes to factory default with newer firmware, so newer firmware was the factory default. Configured box with basic ip connectivity and connected to the internet.

No issues.

Connected box to central management, got policy.

on console started seeing things like this:

vpn11-test1100 login: Out of memory: kill process 3263 (fw) score 3475 or a child

Killed process 3264 (fw)

Out of memory: kill process 3263 (fw) score 3056 or a child

Killed process 3263 (fw)

Out of memory: kill process 1832 (fw) score 1528 or a child

Killed process 1832 (fw)

Out of memory: kill process 2256 (fw) score 1528 or a child

Killed process 2256 (fw)

At this time I opened a case with TAC. (I can provide SR if interested).

After some basic diagnostics (do we not have any other tools other them memtest.sh?) it was decided to RMA both boxes.

At the same time, we also ordered new 1530 appliances to see if they would have the same issue.

New boxes arrived, 1530 came up just fine with no issues and have no stability issues so far.

Plugged in new RMA 1100 box and it experienced the same memory issues. See same memory issues.

So the questions are:

1. Why are there no tools to better diagnose problems on the SMB (sometimes used even in enterprise) firewalls?

2. Why is there an increase in resource usage on 1100 appliances when centrally managed?

3. Is there any reason one should even run the 1100 appliance with R80.40.

4. Is there a way out of this situation? Or is hardware upgrade the only solution?

5. Is there a problem with my testing methodology that I am making?

PhoneBoy · ‎2020-10-28

One question: did you push policy to the 1100 AFTER upgrading the management to R80.40?

Ted_Serreyn · ‎2020-10-28

yes, and that seems to be when the problems start.

PhoneBoy · ‎2020-10-28

That seems to point to an issue with the policy compiled by the backward compatibility package for R80.40.
The 1400 and 1500 have a bit more memory than the 1100 series appliances also (and use different BC packages for compilation).
TAC will definitely have to investigate this.

G_W_Albrecht · ‎2020-10-29

1. Why are there no tools to better diagnose problems on the SMB (sometimes used even in enterprise) firewalls?

There is a variety of SMB tools from monitor/spike scripts to debug firmware !

2. Why is there an increase in resource usage on 1100 appliances when centrally managed?

This depends on the rulebase used, kind and number of objects a.o. Locally managed, the rulebase will usually be much less complicated...

3. Is there any reason one should even run the 1100 appliance with R80.40.

You can not install R80.40 on SMB - but R80.40 Management should work. You can also stay with R80.30 if you do not need additional features...

4. Is there a way out of this situation? Or is hardware upgrade the only solution?

I would try to streamline the rules used on the SMB install targets to save memory - but the 1100 footprint is rather small as it was released in May-20138)

Engineerig Support for 1100 ended last June, and in June 22, all Support will end - this is a good reason for a hardware upgrade.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Are you a member of CheckMates?

1100 resource exhaustion after moving to R80.40?