Solved: Gaia partition misalignment

nmelay · ‎2022-10-28

During Gaia installation (appliance/open server), partitions are NOT aligned on a 1MB boundary, but are instead cylinder-aligned, in a MS-DOS compatible way.

This alignment turns to a real performance problem with today's RAID, SSDs and AF HDDs.
Filesystem blocks being misaligned with storage blocks leads to read-before-write operations, which can incur a severe performance hit.
My own measurements showed storage performance being more than halved on some specific workload.
(The worst case scenario probably is heavy SmartEvent activity.)

This issue was fixed in WS 2008 and RHEL 6, when the performance hit first became obvious.
Gaia should have inherited the fix from RHEL, but this did not happen due to the use of a custom installer.
The packaged fdisk utility was fixed, the installer was not.

Fixing an installed Check Point system is almost impossible and requires LVM wizardry.

Please fix the installer and make sure partitions are 1MB-aligned at installation time.

itzhakd · ‎2022-10-31

Indeed as mentioned above, R81.20 release will have a new installer (with new fdisk of course) so clean installation will align the partitions to a 1MB boundary.

View solution in original post

Bob_Zimmerman · ‎2022-11-11

On the topic of an updated R81.10 installer, I just tested something. I built a VM with the R81.20 ISO. I updated CPUSE, imported the R81.10 CPUSE package, and ran 'installer clean-install <R81.10 package>'. It left me with a working R81.10 system with the fixed R81.20 partition layout.

I accidentally installed a jumbo before running config_system, so I'll need to rebuild it before running performance tests. Still, if there's no fix for R81.10 directly, at least this should work for people who can't migrate to R81.20 yet.

Edit: Confirmed this also works for R80.40. I assume R81 as well.

View solution in original post

nmelay · ‎2022-10-28

The above was posted as an RFE to https://usercenter.checkpoint.com/ucapps/rfe/ (reference number: lH5U9X43H), and I'm bringing it here to raise general awareness.

I know this has been reported before, and wrongly dismissed as an issue of the past.
The issue is very real, and very current on Check Point software.
I really want this to get ironed out now, there's no excuse for this 10-15 years after it was fixed by everyone else.
Sadly, this is probably too late for R81.20.

PhoneBoy · ‎2022-10-30

R81.20 has a newer installer—might be worth checking the Public EA to see if it has the same issue.

nmelay · ‎2022-10-31

I did not get to play with the R81.20 ISO, only the upgrade package.

_Val_ · ‎2022-10-31

When you are upgrading, you get stuck with your "old" file system anyway

nmelay · ‎2022-10-31

Indeed.

The "easiest" way to fix the alignment on an existing system is to create a new LVM PV, move LVs to the new PV, recreate the original PV correctly aligned, and move everything back.
That's something you can manage on a VM with enough available storage.
(Also, the misalignment/realignment will break deduplication on smart NAS/SAN storage, so you won't benefit from this.)

On a physical server or Smart-1 appliance, you need to shrink the existing PV, resize the hosting partition, create a new partition/PV, move LVs...

Anyway, you need to know what you're doing and make sure solid backups are not too far away.

nmelay · ‎2022-11-02

OK, I did so, and verified that R81.20 EA correctly aligns partitions. 👍

G_W_Albrecht · ‎2022-10-31

How to check for that in a simple way ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

nmelay · ‎2022-10-31

fdisk -l /dev/sda

Every partition's first sector (Start) should be a multiple of 2048.
That's especially true for the LVM PVs.

_Val_ · ‎2022-10-31

I have asked R&D owners to comment here, please give them a bit of time to do that.

Spoiler alert: AFAIK, R81.20 clean install should resolve the issue.

nmelay · ‎2022-10-31

Thanks Val for the good news.
I'm delighted to hear this topic is getting the attention it deserves.

_Val_ · ‎2022-10-31

Sure thing. I can promise you, this particular issue is addressed very seriously.

itzhakd · ‎2022-10-31

Indeed as mentioned above, R81.20 release will have a new installer (with new fdisk of course) so clean installation will align the partitions to a 1MB boundary.

nmelay · ‎2022-11-02

Thanks Itzhak for your confirmation.

It will be a while before R81.20 is widely adopted.
Do you know if there will there be an updated R81.10 ISO? And fix instructions for existing setups?

_Val_ · ‎2022-11-02

I hope you are happy now 🙂

nmelay · ‎2022-11-02

Yes I am!
Problem was fixed before I even got to report it, that's quite a feat!

Still I'd like to know where Check Point stands on this issue regarding current releases and upgrades.

_Val_ · ‎2022-11-02

I will let R&D answer the question about clean install, although I find it very unlike that we will be able to change that, too many moving parts for the released version: production lines, install tolls, Blink images, etc.

For the upgrades, you are stuck with your pre-existing file system, unless you perform a clean install.

_Val_ · ‎2022-11-02

Okay, after some internal chatter, it seems we will not be fixing the alignment with the previous versions.

Also, a personal note, in my 24 years with CP, I have never seen that being an actual issue. Any reason?

nmelay · ‎2022-11-02

Thanks for your time Val, I really appreciate this.

I take note that previous/existing versions won't be updated, I can understand this.
I'm really glad the fix is part of the R81.20 release, whether is was intentional or a side effect of the new installer. 😉

I wish the issue was clearly documented -- maybe this will happen once R81.20 is released.
Right now, Best practices for running SMS on VMware (sk104848) states you should "make sure the disk partitions within the guest are aligned"... but omits to say Gaia installer (up to R81.10) will forbid you from doing so.

What brought the issue back to my attention recently was a customer with abysmal SmartEvent performance.
A colleague of mine spent days relocating the log server to a new hosting infrastructure and reindexing everything, only to get a minor performance improvement.
When I had a look at it, the misalignment issue was obvious to me. With a few decades of systems/storage experience, misaligned partitions bring back old memories of poorly performing database workloads, wrongly designed/poorly documented storage vendor fixes (I'm looking at you, NetApp!) and so on.
I knew I saw it before on Check Point installs, but only then realized ALL of them are affected: old and new appliances, open servers, even CloudGuard IaaS Azure instances.
We did NOT spend more days fixing this for this specific customer -- we probably won't until he brings it back and we get an officially supported procedure for this. But I did run some tests on my own and it did not look good.

_Val_ · ‎2022-11-02

Thanks for the thorough write-up.

However, I will ask again, what were the issues you experienced with Check Point, related to the cylinder-aligned partitions?

nmelay · ‎2022-11-02

Slow disk writes.
For example, according to Gaia installation logs, after aligning partitions, packages installation time went down from 4'32" to 1'52".

Also, just to make things clear for everyone else reading this, "cylinder-aligned" partitions are not really cylinder-aligned, they're just misaligned -- apologies for using the term without quotes at the beginning of this thread.
Even with a single HDD and no RAID, the OS doesn't know about the actual physical cylinders/heads/sectors (CHS) arrangement within the disk drive, and real or virtual hardware will report totally made-up CHS values, maximizing usable disk space for... MS-DOS.

_Val_ · ‎2022-11-03

I do understand the new alignment is much more efficient. My question was different. Do you have any examples of when old alignment would affect the production functions of your security systems?

nmelay · ‎2022-11-08

Sorry about the delayed answer.
The most symptomatic case was the SmartEvent/SmartReporter customer issue I was referring to above.
I think their logs volume was about 800G at that time, re-indexing them on the new VM took about a month...
The hosting VM was graded by service provider as a performance VM, whatever that means.

_Val_ · ‎2022-11-09

Thanks for this example.

Just to be sure, this issue can also be related to non-optimal settings of the VM itself. It is always worth using the best practices from sk104848, when it comes to large management deployment.

Bob_Zimmerman · ‎2022-11-11

On the topic of an updated R81.10 installer, I just tested something. I built a VM with the R81.20 ISO. I updated CPUSE, imported the R81.10 CPUSE package, and ran 'installer clean-install <R81.10 package>'. It left me with a working R81.10 system with the fixed R81.20 partition layout.

I accidentally installed a jumbo before running config_system, so I'll need to rebuild it before running performance tests. Still, if there's no fix for R81.10 directly, at least this should work for people who can't migrate to R81.20 yet.

Edit: Confirmed this also works for R80.40. I assume R81 as well.

PhoneBoy · ‎2022-11-11

That's a clever way to get the benefit of the newer installer (which sets partitions) while still running an older release.
Think I'll mark this as a solution for that reason 🙂

nmelay · ‎2022-11-11

Thanks Bob for sharing this.
It actually never occurred to me that you could manually import an older Gaia release in CPUSE and downgrade this way. 😄

Another benefit of the new installer for those running Gaia on VMware is the PVSCSI driver is now available at install time.
Using this driver and controller should bring a small performance boost.
Of course we'll need an official statement from Check Point on this one, as sk104848 currently explicitly states the paravirtual controller is not supported.
With PVSCSI being the default disk controller for RHEL7 guests, once R81.20 is released, we're probably going to see new deployments with a PVSCSI controller in the wild. That is, if nothing is done to stop it from happening,

PhoneBoy · ‎2022-11-12

My guess is the SK will be updated once R81.20 is released.

RamGuy239 · ‎2022-12-07

This is expected as "clean-install" via CPUSE, and Blink is a somewhat misleading term, in my opinion. It doesn't do anything with the partitions or disk. It doesn't format anything during the process. CPUSE since R80.40 seems to have adopted some Blink logic as I have remaining items on /var/log/ after doing an installer clean-install to R80.40, R81 and R81.10. I don't recall precisely when this behaviour changes, but I'm pretty sure in the past when doing an installer clean-install, it would at least nuke all content on both lv_current and lv_log. But this no longer seems to be the case.

Today it seems like the only difference between using CPUSE and Blink is how Blink will retain your admin user and password, your external interface and the default route. CPUSE will clean the running Gaia configuration thoroughly. But they both seem to utilise the same underlying logic of swapping the lv_root partition and leaving the lv_log as-is, even if you select clean-installation or upgrade. When doing in-place upgrades, I can't see any reason why someone would opt for CPUSE over Blink when the underlying logic seems to be this similar. Blink only brings additional benefits like not stopping services during the process and gives you the capability of utilising a package that contains the recommended jumbo hotfix already applied.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

genisis__ · ‎2023-03-23

In VMWare Workstation built a R81.20 appliance and then ran 'fdisk -l /dev/sda'
# fdisk -l /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt

# Start End Size Type Name
1 2048 4095 1M BIOS boot parti
2 4096 618495 300M Microsoft basic
3 618496 17395711 8G Linux swap
4 17395712 209713151 91.7G Linux LVM

Then ran first time wizard and configured as a log server.

Disk /dev/sda: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt

# Start End Size Type Name
1 2048 4095 1M BIOS boot parti
2 4096 618495 300M Microsoft basic
3 618496 17395711 8G Linux swap
4 17395712 209713151 91.7G Linux LVM

As we can see nothing changed as I would expect.

Imported the R81.10 fresh install / Upgrade Package and ran this using "clean install" option, and then ran the fdisk command:
# fdisk -l /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 107.4 GB, 107374182400 bytes, 209715200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt

# Start End Size Type Name
1 4096 618495 300M EFI System
2 618496 17395711 8G Linux swap
3 17395712 209713151 91.7G Linux LVM

Anyone care to explain the above and if this ok to continue with in a live environment?

Are you a member of CheckMates?

Gaia partition misalignment