Solved: 15600 - RAID Cold Swap

VikingsFan · ‎2024-05-02

Mistakenly cold swapped a RMA drive and seeing the messages found in this SK: https://support.checkpoint.com/results/sk/sk181269

The solution doesn't explain if the RAID will eventually self-heal and go healthy or is the spare drive no good now? Not seeing a ton of activity on the drives like it's trying to rebuild and if I reboot with the drives in this status it fails to boot until I remove the 'spare/DISC_FAILED' drive.

VikingsFan · ‎2024-05-07

Just a final update on this. Worked with our VAR and TAC. Ended up being what the theory was and we had to wipe the spare drive and clear it's partition layout and then remap it with the good drive's partition layout since it already had an OS installed on it. Once this happened, the drive automatically began to rebuild in the RAID. If we would have received just a regular spare drive this would have been avoided but was a good learning experience.

View solution in original post

the_rock · ‎2024-05-02

Hm...I see says this, but not sure if that means technically it will self-heal..

Andy

When Storage Devices are configured in a RAID, it is mandatory to replace Storage Devices when the appliance is up and running.

After you replace a Storage Device, it can take several hours for the RAID State to become "ONLINE" and "Flags" to become "NONE"

Bob_Zimmerman · ‎2024-05-02

See, that's funny because hot-swapping a failed disk also doesn't work. You have to take manual steps to get the system to recognize the new drive, as described in sk157874.

the_rock · ‎2024-05-02

But I guess doing this would be easier...

The following workaround is also available:

Reboot the appliance. A reboot will also "wake up" the SATA port that shut down after you swapped the failed disk with a new one.

VikingsFan · ‎2024-05-02

What's odd is I did try rebooting with the drive in the server and it fails to boot. If I remove the spare drive and reboot, it boots fine.

insmod: error inserting '/lib/crct10dif_common.ko': -1 File exists
insmod: error inserting '/lib/crc-t10dif.ko': -1 File exists
insmod: error inserting '/lib/sd_mod.ko': -1 File exists
mdadm: /dev/md/2 has been started with 1 drive (out of 2).
mdadm: /dev/md0 has been started with 2 drives.
mdadm: /dev/md1 has been started with 2 drives.
mdadm: /dev/md2 is already in use.
mdadm: /dev/md2 is already in use.
Reading all physical volumes. This may take a while...
Found volume group "vg_splat" using metadata type lvm2
4 logical volume(s) in volume group "vg_splat" now active
mount: error mounting /dev/root on /sysroot as ext3: Invalid argument
setuproot: moving /dev failed: No such file or directory
setuproot: error mounting /proc: No such file or directory
setuproot: error mounting /sys: No such file or directory
switchroot: mount failed: No such file or directory
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100

CPU: 0 PID: 1 Comm: init Not tainted 3.10.0-1160.15.2cpx86_64 #1
Hardware name: CheckPoint PH-30-00/To be filled by O.E.M., BIOS 5.6.5 01/13/2016
Call Trace:
[<ffffffff817b02ee>] dump_stack+0x1e/0x20
[<ffffffff817adc6f>] panic+0xe5/0x20a
[<ffffffff810984bd>] do_exit+0xb2d/0xb30
[<ffffffff81098543>] do_group_exit+0x43/0xc0
[<ffffffff810985d4>] SyS_exit_group+0x14/0x20
[<ffffffff817c8361>] sysenter_dispatch+0x7/0x25
Kernel Offset: disabled
Rebooting in 15 seconds..

the_rock · ‎2024-05-02

Maybe get in touch with TAC see what they say.

Bob_Zimmerman · ‎2024-05-02

Boot it with the single drive, insert the replacement drive, then use the process in sk157874 to get it to recognize the replacement drive and start resilvering the set.

VikingsFan · ‎2024-05-03

I don't think that's my issue as when I run fdisk -l I see two 1TB drives listed. I am working with support and will update the post with what I find out.

the_rock · ‎2024-05-03

Sounds good, let us know.

Bob_Zimmerman · ‎2024-05-03

That's normal. Here's some relevant command output from a 15600 upgraded from R81.10 (maybe R80.40, I forget) to R81.20 with a healthy RAID:

[Expert@SomeCluster1 STANDBY]# fw ver
This is Check Point's software version R81.20 - Build 012
[Expert@SomeCluster1 STANDBY]# raid_diagnostic 
Raid status: 
VolumeID:0 RaidLevel: RAID-1 NumberOfDisks:2 RaidSize:447GB State:OPTIMAL Flags:ENABLED 
DiskID:0 DiskNumber:0 Vendor:ATA      ProductID:SAMSUNG MZ7KM480 Revision:104Q Size:447GB State:ONLINE Flags:NONE 
DiskID:1 DiskNumber:1 Vendor:ATA      ProductID:SAMSUNG MZ7KM480 Revision:104Q Size:447GB State:ONLINE Flags:NONE 
[Expert@SomeCluster1 STANDBY]# fdisk -l

Disk /dev/sda: 480.1 GB, 480103981056 bytes, 937703088 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x0008f6de

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *          63      610469      305203+  fd  Linux raid autodetect
/dev/sda2          610470    67713974    33551752+  fd  Linux raid autodetect
/dev/sda3        67713975   937697984   434992005   fd  Linux raid autodetect

Disk /dev/sdb: 480.1 GB, 480103981056 bytes, 937703088 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x0008f6de

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *          63      610469      305203+  fd  Linux raid autodetect
/dev/sdb2          610470    67713974    33551752+  fd  Linux raid autodetect
/dev/sdb3        67713975   937697984   434992005   fd  Linux raid autodetect

Disk /dev/md0: 312 MB, 312410112 bytes, 610176 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/md1: 34.4 GB, 34356920320 bytes, 67103360 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/md2: 445.4 GB, 445431742464 bytes, 869983872 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

[Expert@SomeCluster1 STANDBY]# mdadm --misc -Q --detail /dev/md0
/dev/md0:
           Version : 0.90
     Creation Time : Wed Aug  8 06:10:37 2018
        Raid Level : raid1
        Array Size : 305088 (297.94 MiB 312.41 MB)
     Used Dev Size : 305088 (297.94 MiB 312.41 MB)
      Raid Devices : 2
     Total Devices : 2
   Preferred Minor : 0
       Persistence : Superblock is persistent

       Update Time : Thu May  2 10:36:18 2024
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              UUID : 00112233:44556677:8899aabb:ccddeefd
            Events : 0.36

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
[Expert@SomeCluster1 STANDBY]# mdadm --misc -Q --detail /dev/md1
/dev/md1:
           Version : 0.90
     Creation Time : Wed Aug  8 06:10:32 2018
        Raid Level : raid1
        Array Size : 33551680 (32.00 GiB 34.36 GB)
     Used Dev Size : 33551680 (32.00 GiB 34.36 GB)
      Raid Devices : 2
     Total Devices : 2
   Preferred Minor : 1
       Persistence : Superblock is persistent

       Update Time : Tue Mar 19 00:23:22 2024
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              UUID : 00112233:44556677:8899aabb:ccddeefe
            Events : 0.8

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
[Expert@SomeCluster1 STANDBY]# mdadm --misc -Q --detail /dev/md2
/dev/md2:
           Version : 0.90
     Creation Time : Wed Aug  8 06:10:32 2018
        Raid Level : raid1
        Array Size : 434991936 (414.84 GiB 445.43 GB)
     Used Dev Size : 434991936 (414.84 GiB 445.43 GB)
      Raid Devices : 2
     Total Devices : 2
   Preferred Minor : 2
       Persistence : Superblock is persistent

       Update Time : Fri May  3 09:31:11 2024
             State : active 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              UUID : 00112233:44556677:8899aabb:ccddeeff
            Events : 0.766

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
[Expert@SomeCluster1 STANDBY]#

On boot, the system tries to identify all disks. Any disks which are part of an existing md(4) set are attached to that set. Any disks which aren't part of an existing set instead result in a new set being created with the new disk attached to it. This is the problem you have hit.

It's possible to fix this live, but risky. It's far simpler to shut down, remove the new drive, boot the system, attach the new drive (after you can log in), then probe the SATA links using the command in sk157874.

VikingsFan · ‎2024-05-03

Thanks. I'm working through our VAR which opened a case on our behalf so definitely don't want to waste the forums' time. My theory is... Check Point shipped us an entire chassis instead of just one spare drive so guessing that there's an OS on those drives and causing issues with the RAID rebuild. Waiting for feedback though to confirm but that's where I'm at currently. Will update with the outcome.

the_rock · ‎2024-05-03

That sounds logical to me as well.

Bob_Zimmerman · ‎2024-05-03

In that case, the drive from the spare box will definitely have its own md(4) set definition stored on it. At boot, the system is seeing both existing sets. Each has two configured members, and each has only one member present.

The fix is still to boot without the new drive (so the system only has one set), insert the new drive after you can log in, then probe the SATA links to convince the system to take the new drive and stick it in the existing set.

the_rock · ‎2024-05-03

Maybe verify with TAC first, but here is my logic...

1) ONLY boot with existing hdd

2) Make sure you can log in

3) If yes, shut down the appliance by running halt from expert mode

4) Unplug the power cable

5) insert hdd sent

6) power on the appliance

7) verify raid status -> cpstat os -f raidInfo from expert mode

8 ) if still failing, reboot

9) check again

10) if good, great, if not, I would call TAC

Andy

VikingsFan · ‎2024-05-07

Just a final update on this. Worked with our VAR and TAC. Ended up being what the theory was and we had to wipe the spare drive and clear it's partition layout and then remap it with the good drive's partition layout since it already had an OS installed on it. Once this happened, the drive automatically began to rebuild in the RAID. If we would have received just a regular spare drive this would have been avoided but was a good learning experience.

the_rock · ‎2024-05-07

Thanks for letting us know.

Robert_Sutton · ‎2024-10-15

Just had a similar situation on my 23500. The replacement drive would not rebuild not matter what and the partitions on the replacement were missing. After escalation, TAC had me run echo 0 > /boot/SW_RAID and then activate_sw_raid which miraculously rebuilt the partition structure to match (I used fdisk -l for comparison). RAID still did not rebuild. We pulled the replacement drive back out and rebooted just with the good drive, reinserted the replacement, and it started rebuilding and we are now good.

Are you a member of CheckMates?

15600 - RAID Cold Swap