Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
BLD
Contributor
Jump to solution

1550 Appliance unexpected reboots

Hi.

We have had the appliance for a few weeks.

In the past 5 days our notification logs show 3 "unexpected reboot" notices. We have had no power or other issues in our site. How can we get more information to find the cause of these reboots? We have found nothing in the logs. Do logs survive a reboot?

Firmaware version is R80.20 (992000668)

Thanks.

 

119 Replies
BLD
Contributor

We have 15 days up with R80.20.02.

Probably solved.

TAC never answered.... Ticket still open... 

BLD
Contributor

Magic!

 

They answered 2 seconds ago with the solution I was offered here 2 weeks ago.

 

Maarten_Sjouw
Champion
Champion
Only thing is that mine is running Version: R80.20.02 (992000936) also and I'm still seeing these crashes, but only at moments of heavy load. Lots of smaller files that are combined to a bigger file.
Regards, Maarten
0 Kudos
HristoGrigorov

Reminds me of the "good" old days when my 1470 had similar issues.

If this was happening nowadays I would be long dead killed by our employees working from home... 😁

0 Kudos
John_Fleming
Advisor

Mind posting a vmcore? Last messages I was seeing was out of memory and killing random procs until kernel panic. I didn't report this because i do unthinkable things with my 1550 and didn't want to wast someone's time with something that could be related to something unsupported.

0 Kudos
HristoGrigorov

John, may you please paste output of following command on your 1550:

# sysctl -a | grep panic_on_oom

0 Kudos
Maarten_Sjouw
Champion
Champion
last part of the dump:
[292234.925919] [fw4_1];ws_mux_perform_resume: ERROR: Session id (67104) is different from the one that was provided at hold request (67105).
[292234.925923] [fw4_1];ws_mux_host_only_perform_resume: ERROR: Failed to prepare resume.
[330394.930735] [fw4_0];ws_mux_perform_resume: ERROR: Session id (77796) is different from the one that was provided at hold request (77799).
[330394.930739] [fw4_0];ws_mux_host_only_perform_resume: ERROR: Failed to prepare resume.
[368788.642844] Unable to handle kernel paging request at virtual address 77b07c47d9
[368788.650370] Mem abort info:
[368788.653265] Exception class = DABT (current EL), IL = 32 bits
[368788.659301] SET = 0, FnV = 0
[368788.662462] EA = 0, S1PTW = 0
[368788.665703] Data abort info:
[368788.668684] ISV = 0, ISS = 0x00000005
[368788.672624] CM = 0, WnR = 0
[368788.675694] user pgtable: 4k pages, 48-bit VAs, pgd = ffff800008a52000
[368788.682340] [00000077b07c47d9] *pgd=000000000e41b003, *pud=0000000000000000
[368788.689429] Internal error: Oops: 96000005 [#1] SMP
[368788.694413] Modules linked in: qca_ol(O) qca_da(O) smart_antenna(PO) ath_dev(PO) tm(PO) hst_tx99(PO) ath_rate_atheros(PO) ath_pktlog(PO) ath_hal(PO) umac(O) mem_manager(PO
) ath_spectral(PO) ath_dfs(PO) qdf(O) asf(PO) fResetmod(O) vpntmod(PO) fw_3(PO) fw_2(PO) fw_1(PO) fw_0(PO) simmod(PO) umi(PO) marvellmod(PO)
[368788.722103] CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: P O 4.14.76-release-1.3.0 #1
[368788.730838] Hardware name: Marvell Armada 8040 Sunspear V1_dvt Software 0.0.3 (DT)
[368788.738525] task: ffff800069354e00 task.stack: ffff800069380000
[368788.748168] PC is at skbuff_packet_get_inner_protocol+0x0/0x40 [fw_0]
[368788.758321] LR is at fwmultik_process_packet_kernel+0x80/0x160 [fw_0]
[368788.764875] pc : [<ffff000001a2aa68>] lr : [<ffff0000019d14a0>] pstate: 20000145
[368788.772387] sp : ffff800069383330
[368788.775800] x29: ffff800069383330 x28: 0000000000000006
[368788.781222] x27: 00000000ffffffff x26: 0000000000000000
[368788.786642] x25: 00000000ffffffff x24: ffff00001ecbf8a0
[368788.792062] x23: 0000000000000000 x22: 0000000000000000
[368788.797483] x21: 00000000b07c4721 x20: 00000077b07c4721
[368788.802903] x19: ffff000002a5e3c8 x18: 0000000000000000
[368788.808324] x17: 0000000000000000 x16: ffff0000108edf18
[368788.813744] x15: 0000000053806002 x14: ffff80004163f420
[368788.819165] x13: ffff0000109cc020 x12: 0000000000000000
[368788.824585] x11: 0000000000000000 x10: 0000000053806002
[368788.830006] x9 : 0000000000000000 x8 : 000000001a6b3c6b
[368788.835426] x7 : 0000000000000000 x6 : ffff00001ecbf8a0
[368788.840846] x5 : 00000000ffffffff x4 : 0000000000000000
[368788.846267] x3 : 0000000000000006 x2 : 00000000b07c4721
[368788.851688] x1 : 00000000b07c4721 x0 : 00000077b07c4721
[368788.857110] Process ksoftirqd/0 (pid: 7, stack limit = 0xffff800069380000)
[368788.864099] Call trace:
[368788.866642] Exception stack(0xffff8000693831f0 to 0xffff800069383330)
[368788.873195] 31e0: 00000077b07c4721 00000000b07c4721
[368788.881144] 3200: 00000000b07c4721 0000000000000006 0000000000000000 00000000ffffffff
[368788.889092] 3220: ffff00001ecbf8a0 0000000000000000 000000001a6b3c6b 0000000000000000
[368788.897039] 3240: 0000000053806002 0000000000000000 0000000000000000 ffff0000109cc020
[368788.904987] 3260: ffff80004163f420 0000000053806002 ffff0000108edf18 0000000000000000
[368788.912935] 3280: 0000000000000000 ffff000002a5e3c8 00000077b07c4721 00000000b07c4721
[368788.920882] 32a0: 0000000000000000 0000000000000000 ffff00001ecbf8a0 00000000ffffffff
[368788.928831] 32c0: 0000000000000000 00000000ffffffff 0000000000000006 ffff800069383330
[368788.936780] 32e0: ffff0000019d14a0 ffff800069383330 ffff000001a2aa68 0000000020000145
[368788.944728] 3300: 00002823538055f6 0000000000000006 ffffffffffffffff 00000077b07c4721
[368788.952676] 3320: ffff800069383330 ffff000001a2aa68
[368788.961285] [<ffff000001a2aa68>] skbuff_packet_get_inner_protocol+0x0/0x40 [fw_0]
[368788.972499] [<ffff0000019d5a84>] fwmultik_process_entry+0x50c/0x8d0 [fw_0]
[368788.983108] [<ffff0000019d5f0c>] fwmultik_queue_async_dequeue_cb+0x6c/0x2a0 [fw_0]
[368788.994419] [<ffff000001a1f3a0>] kiss_kqueue_async_dequeue_entry+0xc0/0x528 [fw_0]
[368789.005717] [<ffff0000019cd5a4>] fwmultik_sync_dequeue+0x64/0xc0 [fw_0]
[368789.016119] [<ffff0000019d6690>] fwmultik_process_synchronous_inbound_ex+0x48/0xd8 [fw_0]
[368789.028054] [<ffff0000019d8274>] fwmultik_process_synchronous_inbound+0xc/0x18 [fw_0]
[368789.036779] [<ffff0000009ffaa4>] handle_inbound_packet+0x9ac/0x1ffc0 [simmod]
[368789.044802] [<ffff0000009caa78>] sim_fromlinux+0x238/0x808 [simmod]
[368789.051191] [<ffff0000107854e0>] __netif_receive_skb_core+0x288/0x8c0
[368789.057745] [<ffff000010787d14>] __netif_receive_skb+0x14/0x60
[368789.063689] [<ffff00001078b584>] netif_receive_skb_internal+0x24/0xc8
[368789.070242] [<ffff00001078c04c>] napi_gro_receive+0xa4/0xc8
[368789.075926] [<ffff0000105bb4b4>] mvpp2_poll+0x584/0xc58
[368789.081259] [<ffff00001078b984>] net_rx_action+0xf4/0x2b0
[368789.086767] [<ffff000010081a2c>] __do_softirq+0x12c/0x228
[368789.092274] [<ffff0000100ded98>] run_ksoftirqd+0x40/0x58
[368789.097695] [<ffff0000100fce20>] smpboot_thread_fn+0x178/0x1a0
[368789.103639] [<ffff0000100f8f0c>] kthread+0x12c/0x130
[368789.108711] [<ffff000010084d18>] ret_from_fork+0x10/0x18
[368789.114134] Code: 942588d6 a8c17bfd d65f03c0 d503201f (79417002)
[368789.120345] SMP: stopping secondary CPUs
[368789.124446] Starting crashdump kernel...
[368789.128471] Bye!
Regards, Maarten
0 Kudos
Maarten_Sjouw
Champion
Champion
sysctl -a | grep panic_on_oom
sysctl: error reading key 'net.ipv6.conf.DMZ.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN1.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN2.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN3.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN4.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN5.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN6.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN7.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN8.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN8.10.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN8.15.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.LAN8.25.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.WAN.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.all.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.default.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.eth0.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.eth1.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.lo.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.wifi0.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.wifi1.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.wlan0.stable_secret': Input/output error
sysctl: error reading key 'net.ipv6.conf.wlan1.stable_secret': Input/output error
vm.panic_on_oom = 0
Regards, Maarten
0 Kudos
John_Fleming
Advisor

[Expert@fw]# sysctl vm.panic_on_oom
vm.panic_on_oom = 0
[Expert@fw]#

 

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/...

 

Looks like that doesn't mean take a dump, it means start killing ramdon process to try to free memory.

 

If I had to guess I'd say the kernel panic was no free kernel memory.

0 Kudos
John_Fleming
Advisor

The last few kernel panics I had seemed to happen at strange times. Like midnight or something. I'm wondering if its a bug up with sig updates somehow. 

0 Kudos
HristoGrigorov

Definitely case for R&D to investigate. Although, my experience is that it can be a hardware issue (corrupt mem) as well.

I wonder why is kernel not set to panic on OOM on some systems. Like when OOM killer is invoked system will be left in stable state. No way! Reboot to be good...

0 Kudos
John_Fleming
Advisor

You say that like there has never been a case of sfwd eating all memory.

 

SFWD is just jelly of CPD and FWD.

0 Kudos
HristoGrigorov

SFWD has always been limited by fw_sfwd_max_rss_enforce to I think 300MB.

0 Kudos
John_Fleming
Advisor

yeah, i have a customer that rolled out 1200Rs when they came out. That for sure didn't exist back then... i'm pretty sure.. Then again that was old R75 days.

0 Kudos
BorisL
Collaborator

Most of our unexpected reboots happened at midnight GMT

0 Kudos
Maarten_Sjouw
Champion
Champion
the ones I know about happen at the moment I start a few bigger downloads, but it does not happen with every larger download.
Regards, Maarten
0 Kudos
John_Fleming
Advisor

I just pushed 135 gig through mine without issue (replicating my lab ISO folder). Granted this was a rsync and not say a http which might have more inspect bits hitting it.

0 Kudos
Kevin_Zeitler
Contributor

Hi All,

How is the latest Gaia OS working out for the 1500 are you still seeing random reboots?

Anyone up over 30 days without reboot?

0 Kudos
BorisL
Collaborator

For us  it has been 60 days without reboots.

Very normal loads, except for the eventual operating system update in the LAN.

We have 600meg fiber in both directions and get the full throughput.

0 Kudos
HristoGrigorov

Can you tell what was the OS load average right before it rebooted ? Was it HTTP or HTTPS traffic and have you tried to disable HTTPS Inspection to see if it makes any difference ?

0 Kudos
Maarten_Sjouw
Champion
Champion
For me it was just that I could not identify yet where it failed and I was downloading some stuff where it all the sudden rebooted, this was happened before I remembered at that point. I do not have https scanning on. Next time I will be downloading a bigger bunch through NTTP I will check the load.
Regards, Maarten
0 Kudos
Maarten_Sjouw
Champion
Champion
I have been issued 3 consecutive versions after the official version and I'm still having issues under load.
There is no difference between the type of load, CIFS, NNTP, HTTPS or FTP and in or outbound, it all makes no difference.
Mostly it crashes during a period of load but sometimes it just survives a small period and then crashes after 5 min's
Regards, Maarten
0 Kudos
Amir_Ayalon
Employee
Employee

Hi Maarten

can you please send more details? 

amiray@checkpoint.com

Thanks

 

0 Kudos
BlueGrass
Contributor

 

We use 1590 and also suffer the same these days.

 

After we upgrade to Version: R80.20.05 (992001169)

 

The system is up for 6 days now.

 

Anyway, we are quite disappointed with this.

 

 

0 Kudos
paulossa
Explorer

1550 here, the appliance keeps rebooting, already in the 3rd firmware, the latest one R80.20.05 (992001179) i was hoping it was stable but after 4 days it rebooted again unexpectedly and started rebooting minute after minute till i had to turn it off and on again.

I already tried to turn off ssl inspection and anti spam to test if the cause is from high load but the problem persists. The internet connection is a 100/100mbps and the company has like 10 people, half of them working from home due to covid so this is the worst time to happen, it's frustrating.

I'm really disappointed with this 1500 series, the first post regarding this issue is from January and there is no sign of a resolution. I have 2 more appliances to install in another customer and i'm already afraid that this will happen and have another headache with the customer complaining every day.

0 Kudos
HristoGrigorov

Tagging @Amir_Ayalon 

0 Kudos
Amir_Ayalon
Employee
Employee

thanks for tagging,

we are already in contact with the customer.

we will update once we get to the bottom of it.

 

thanks

 

Anubis
Explorer

Until now at my office we have the same unexpected reboots, any fix @Amir_Ayalon ??

We have a 1550 R80.20.05 build 992001169

Best regards

0 Kudos
Amir_Ayalon
Employee
Employee

Hi

thanks again for raising this issue

we have several fixes over 1169, and in addition, we have identified an issue with memory utilization during signature updates, that may cause the issue. image including this fix, will be release as HF toward the end of next week.

if you would like to use it before official release, you are welcome.

if you will encounter any issue, we would love to hear.

amiray@checkpoint.com

 

ftp://rndftp:QJxkj1Vf@ftp.checkpoint.com/outgoing/Zachis/EA/Firmware/fw1_vx_dep_R80_992001229_20.img

 

thanks

 

0 Kudos
Maarten_Sjouw
Champion
Champion
Hey @Amir_Ayalon we are still struggeling with the 1590 that keeps crashing on Plex Streams...
I have been running memory scripts as requested and I'm now more than 3 months down the road and still no closer to a solution.
Regards, Maarten
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events