Re: MDS backup too big and slow in R80.10

Kaspars_Zibarts · ‎2018-08-22

For those running MDS management solution. What's your take on backup after R80.10? In our case in R77.30 backup was approx 3GB in size and it took less than half an hour to restore MDS and have it up and running. With R80.10 backup has grown to 18GB(!) within a year and actual process takes well over an hour if not closer to two. As an engineer I might accept the argument that R80.10 brought in so many new features thus increasing backup size but from business and disaster recovery point of of view it is complete shumbles.

Ironically it makes even support process painfully slow as I was asked to upload MDS backup yesterday and considering that CP FTP servers are over 50ms away from us, it will take couple of hours to complete that.

I have been raising SRs trying to point out inefficiency of MDS backup process for years - same MDS TGZ being archived and compressed 4 times... Seriously. In order to restore backup now (offcial MDS GAIA backup) we would need nearly 100GB free disk-space. Not that it costs too much money but it makes it so slow.

I'm not expecting many votes as probably not that many run MDS but still would be good to hear opinions about the matter

Tomer_Sole · ‎2018-08-23

Hi, please see https://community.checkpoint.com/thread/6312-how-can-i-control-the-size-of-my-r8010-security-managem...

Kaspars_Zibarts · ‎2018-08-23

I have already gone through it and it made no difference Tomer staright after updating to Take 42. Postgres just keeps growing like nuts every month

Tomer_Sole · ‎2018-08-23

how many revisions do you have & how many IPS updates? the difference is that we didn't take history in our backups prior to R80.

Kaspars_Zibarts · ‎2018-08-23

When this info came out originally, I went and deleted every single revision on every single CMA that we had (we're talking hundreds) but hardly made a dent in a backup size. I already have asked that we world rather not have revisions at all and instead rely on good old backup restore. But got nowhere. Is it possible to disable revisions across whole MDS?

Tomer_Sole · ‎2018-08-23

First of all, if you deleted all your previous revisions and size was still big, then even if you could disable revisions that wouldn't have solved your particular problem.

So now I'm thinking it's a support ticket worthy. Please send me privately your support ticket so that I can track its findings.

Brian_Deutmeyer · ‎2018-08-23

We have also seen our backup size grow. I've been running into a memory consumption issue while purging and haven't been able to purge for a while, so I assumed that was our case. We are working on getting that straightened out first. The other thing we've noticed is while backups are occurring, SmartConsole becomes slow or unresponsive.

Kaspars_Zibarts · ‎2018-08-23

We have always run backup out of hours as it needs to halt CMAs. memory consumption - i didn't notice yesterday when I purged over 10000 revisions. No problems there. But we have 128GB on that VM so I believe that should suffice.

Brian_Deutmeyer · ‎2018-12-06

After upgrading memory, the memory consumption issue went away.

Kaspars_Zibarts · ‎2018-08-23

OK, morning update: after purging over 10000 revisions across all CMAs backup size remained the same

Before

-rw-r--r--. 1 netbackup1 netbackup 17G Aug 22 04:10 backup_mds01_22_Aug_2018_02_30.tgz

After

-rw-r--r--. 1 netbackup1 netbackup 17G Aug 24 04:08 backup_mds01_24_Aug_2018_02_30.tgz

SR it is then

Tomer_Sole · ‎2018-08-30

I just want to say I'm tracking your support request and hope this comes to a resolution soon that might fit for other customers as well.

Kaspars_Zibarts · ‎2018-08-30

Much appreciated! thanks heaps

Sander_Zumbrink · ‎2018-08-30

Are you running the MDS on vmware or dedicated hardware?

I've also noticed the same, mostly caused by slow disk IO.

Running on dedicated hardware speeded up the backups.

And performance of MDS itself.

Kaspars_Zibarts · ‎2018-08-30

It's a VM with proper storage. So IO is not the issue. How big is your backup? How much did it grow from R77 to R80?

Sander_Zumbrink · ‎2018-08-30

Our backup is around 30 Gb. On dedicated hardware it takes 1:15.

Before our transition to dedicated hardware it took around 5 hours.

We also didn't expect disk IO issues, but we saw a lot of improvement.

The dump of the postgresql database was causing a lot of disk IO and took some time.

We've migrated a long time ago. I don't know the size of the backup from R77.30 anymore.

Kaspars_Zibarts · ‎2018-08-30

Did you mean 1hr to create backup? That's normal. Ours is 17GB and it takes approximately half the time to create it. The problem is restore time for me. Takes way too long imo

Sander_Zumbrink · ‎2018-08-30

The restore also made a big difference on the physical hardware. Restoring the 30 Gb and the reinitialize of the solr database to approx 14 hours in vmware. On physical hardware it was 2 hours I think. My opinion is that big MDS environments are not suitable for vmware.

Kaspars_Zibarts · ‎2018-08-30

Hard to say about 30GB, but our 17GB restore in ~2hrs in the lab VM which is slightly under-powered. Still a long time compare what it used to be in R77. And it keeps growing quite fast (~1GB per month) I just think the whole Gaia backup process for MDS can be improved dramatically as the basic mds backup file that forms the biggest chunk in the archive is actually compressed and archived 4 times! seems an overkill. They could have used simple tar without compression after it's been compressed once..

Sander_Zumbrink · ‎2018-08-30

The restore itself is 2 hours. But when you start the MDS, it takes up to several ours rebuilding the SOLR.

And yes... why the use the gzip archive multiple times... that takes time... and not necessary.

Kaspars_Zibarts · ‎2018-08-30

That's my point - if we whinge enough here we might get some attention. Sorry Tomer - nothing against you personally, on contrary, you have always gone extra mile and it is really appreciated. More of a constructive feedback about the product and yes, we will have another remote session next week, had issues with my lab yesterday (backup restore took too long to restore so I was not ready when meeting time was up..)

Mike_A · ‎2018-08-30

Are you just simply running an mds_backup?

1.) Are you by chance using the -l flag to not include your logs as well?

2.) What dir are you running the mds_backup from?

I recently upgraded from R77.30 MDS to R80.10 and my backups grew by maybe 300-400MB. Nothing crazy.

Kaspars_Zibarts · ‎2018-08-30

Hi, no this is box standard Gaia backup that will wrap in mds_backup. Logs are excluded. The biggest increase is postgres DB dump (part of mds_backup) that has grown from 3GB orginally to nearly 10GB in one year.. So keep eye on yours. Or it could be that ours is "broken" somewhere as we went from R77.30 to R80 and then R80.10. Who knows. We'll keep it posted here about my SR progress

Olavi_Lentso · ‎2018-08-31

Our case is the opposite, we had almost 50GB backup files under R77.30, probably because of many historical database revisions we had kept in each CMA. Under R80.10 the backup is about 22GB, because older revisions got lost during the migration. During the last 3-4 months the backup has grown 1GB, which is not terrible and usually it takes approximately 1h20-1h30 to create the backup file. Restoration time is about 1 hour in a vmware lab, but solr rebuild takes additional time indeed.

From time to time it happens that pg_dump processes will stay in the process table even after the backup file has been created, sometimes dbedit locks are not removed etc, I would say that older pre-R80 backup was more bullet proof than the current one. No positive comments about R80.10, no quick crossdomain search, the SmartConsole is buggy, it has short freeze moments all the time which can even observed on CP demo streams, full text search not working properly etc.

Kaspars_Zibarts · ‎2018-08-31

Yep - that's what I meant. Full restore including solr dB restore that may extend total time by an hour or even more.

Otherwise I actually liked R80.10. No doubt we raised fair bit of cases but that was expected with such major update on SW architecture. We use Tufin quite extensively that normally covers CP shortcomings like cross CMA search. SmartConsole has been rather stable since last JHF I must say.

I wanted to keep this thread purely for backup restore time but feedback is much appreciated!

Tomer_Noy · ‎2018-09-12

Hi Kaspars,

Thank you for raising this item for discussion.

Eran Habad‌ and I (in Management R&D) would like to further investigate this issue in order to better understand the reason for the backup growth.

We would greatly appreciate it if you can use your SR and ask Support to open a CFG task to be assigned to Eran's group. On this SR, please attach a recent backup file + ask support to run the CPM Doctor utility and provide the output. Both items should give us an overall view of your system and what is backed up.

I hope that you can share this info with us.

Thanks!

Kaspars_Zibarts · ‎2018-09-12

Hi Tomer, SR was created on 24th August. I'll pass the message above directly in the case. You guys have full backup there.

Mike_A · ‎2018-12-03

Kaspars,

Was there ever any resolution to this? Im interested to know what TAC found, if you can share, as to the reasoning behind the extremely large backups.

Thanks!

- Mike

Kaspars_Zibarts · ‎2018-12-03

Apparently some things have been discovered in our backup and we're not the only one. We have not received full fix, just partial. So in short, still waiting but likel like there's light end of the tunnel. I will definitely post here if we get noticeable results eventually

Mike_A · ‎2018-12-03

Great, thank you!

Kaspars_Zibarts · ‎2019-01-14

Some good news Mike Andretta‌ from R&D! Our current backup is 17GB going down to 7GB! Back to R77.30 size

I just wanted to let you know that I have tested the private hotfix on our replication with your database and managed to generate an mds_backup file which is 6.6GB.

I will update you once the fix is inside the R80.20 Jumbo hotfix.

Are you a member of CheckMates?

MDS backup too big and slow in R80.10