So I wanted to post this to let people know I had a customer experience this after upgrading to R80.40.
Starting point:
R80.30 recent jumbo at the time of upgrade
1100 with R77.20.80 in field as a remote office. Centrally managed, only FW, VPN and IPS blades enabled. Tuned IPS for SMB/older firewall versions disabling high cpu impact protections, others to minimize impact.
VPN, local connectivity, remote connectivity all worked to local 1100 appliance.
Upgraded to R80.40 JHF 77 (may have even been one GA patch earlier).
Started experiencing odd problems including but not limited to:
VPN drops permanently, reboot of box sometimes fixed.
Local connectivity loss including icmp ping, and web management loss.
Inability to login remotely via web.
Inability to login remotely via ssh.
Login, but error about role not assigned to user, then appliance stated that it needed to run initial configuration again.
most command would fail in this state:
vpn11-test1100 login: admin
Password:
Role is not assigned to user
Role is not assigned to user
vpn11-test1100> top
Unexpected error: /usr/local/share/lua/5.1/sys/permissions.lua:0: attempt to index upvalue '' (a nil value)
vpn11-test1100> top
Unexpected error: /usr/local/share/lua/5.1/sys/permissions.lua:0: attempt to index upvalue '' (a nil value)
vpn11-test1100> expert
Unexpected error: /usr/local/share/lua/5.1/sys/permissions.lua:0: attempt to index upvalue '' (a nil value)
Reboots would sometimes fix the problem, so we limped along.
So my initial conclusion after doing some troubleshooting was that maybe this was a hardware issue, so we swapped it with a spare 1100 appliance.
Initially looked like it was going to work, but within a few hours saw the same issues.
Finally decided to swap (AGAIN?!) for a 1400 series appliance, same policy same network configuration, same blades enabled.
Worked perfectly!
Had customer ship me both boxes to put online locally where I could see serial consoles.
Wiped boxes to factory default with newer firmware, so newer firmware was the factory default. Configured box with basic ip connectivity and connected to the internet.
No issues.
Connected box to central management, got policy.
on console started seeing things like this:
vpn11-test1100 login: Out of memory: kill process 3263 (fw) score 3475 or a child
Killed process 3264 (fw)
Out of memory: kill process 3263 (fw) score 3056 or a child
Killed process 3263 (fw)
Out of memory: kill process 1832 (fw) score 1528 or a child
Killed process 1832 (fw)
Out of memory: kill process 2256 (fw) score 1528 or a child
Killed process 2256 (fw)
At this time I opened a case with TAC. (I can provide SR if interested).
After some basic diagnostics (do we not have any other tools other them memtest.sh?) it was decided to RMA both boxes.
At the same time, we also ordered new 1530 appliances to see if they would have the same issue.
New boxes arrived, 1530 came up just fine with no issues and have no stability issues so far.
Plugged in new RMA 1100 box and it experienced the same memory issues. See same memory issues.
So the questions are:
1. Why are there no tools to better diagnose problems on the SMB (sometimes used even in enterprise) firewalls?
2. Why is there an increase in resource usage on 1100 appliances when centrally managed?
3. Is there any reason one should even run the 1100 appliance with R80.40.
4. Is there a way out of this situation? Or is hardware upgrade the only solution?
5. Is there a problem with my testing methodology that I am making?