Re: Azure Data Center Objects - Inaccessible

wanartisan

Scenario

Smart-1 Cloud
Cloudguard Azure HA cluster R81.20

We configured integration with Azure as a data center object. Today it stopped working, giving

validation errors in Smart Console on objects derived from this integration (inaccessible/doesn't exist)
can't browse objects in the object explorer (after Import in rulebase)
Connection test works
curl_cli --verbose https://management.azure.com --cacert $CPDIR/conf/ca-bundle-public-cloud.crt is ok
azure_ha_test.py is ok
Azure side seems ok

I found an sk referencing HTTP/1.1 429 error and a forum article Understand how Azure Resource Manager throttles requests - Azure Resource Manager | Microsoft Learn.. Can't find either of them now!

Anyway, I found the azure_had.elg file which had loads of errors for a long time. But only today do we see any manifestation in SmartConsole. A couple of questions:

We have not configured any objects in the rulebase yet. If we had, would this be service affecting?
Does the cloud_proxy.elg file exist in R81.20? Is it in the management server?
Anywhere else I can look for clues?

Thanks in advance

avivs

Hello,

1. Data Center Objects has caching on the gateway.
The time this cache is stored varies according to the configuration. by default, it is 1 week (10080 minutes), and can be changed to be up to 1 month.
The configuration should take place in the file $FWDIR/conf/vsec.conf and uses the following values;
# TTL (mins) for objects expiration on GW in case there are no updates
# from the Controller
# min value=5
# max value=43200
# Default value: 10080
enforcementSessionTimeoutInMinutes=10080

See CloudGuard Controller configuration parameters documentation for additional information
R81.20 CloudGuard Controller Administration Guide

This is a security feature that aims to prevent cases of obsolete data being used in firewall rule enforcement.

2. $FWDIR/log/cloud_proxy.elg file exists in R81.20 and is the first log to look at when facing issues with data center objects.
If you wish to attach it here (or send it privatly) we will be happy to have a look and advise our inputs on it.

3. If you have already opened an SR, I will be happy to take a look if you can share the SR # of a dm.

Thanks,

Aviv Shabo

CloudGuard Network R&D

wanartisan

Thanks for the reply Aviv,

I have opened a case but it is with our Collaboration Support partner just now.

So cloud_proxy.elg is on the CloudGuard Controller i.e. the management server? (We have Smart-1 Cloud)

I just found the CloudGuard Controller ATRG (sk115657) and just checked the logs (blade:CloudGuard IaaS). It shows mapping ok with failures every few minutes.

I'll attach a few screenshots that might help.

AaronCP

Your first point is relevant if the SMS/MDS loses connectivity to the Data Centre Object. My understanding is if the SMS/MDS loses trust with the Generic Data Centre object (remote certificate is changed, certificate in local certificate store gets deleted/corrupted), then by design the gateway will clear the object cache, impacting traffic until trust is re-established - so worth bearing in mind.

avivs

So there are indeed 2 scenarios here:

Mgmt is no longer able to complete data center scanning
Mgmt is no longer able to communicate with the gateway

In the first case, as long as communication between mgmt and gw is working, Data Center Objects (DCOs) time to live (TTL) will get extended, this is because the CloudGuard Controller that is running on the Management, understands that this is a scanning issue, so enforcement should continue working using existing information.

In the second case, the Management is no longer able to send updates to the gateway, on the gateway side, we cannot assume the reason for this, so once the TTL will expire, these DCOs will not longer be enforced.

For this reason, our best practice is to use DCOs for whitelisting (allow rules) rather than blacklisting (blocking rules).

Looking at the validation errors you are getting suggest that the access you provided for scanning your Azure data center was enough to properly establish a connection, but not enough to scan any supported Data Center Object.

Our best practice for providing Azure access to CloudGuard Controller is to create a service principal.
The minimum recommended permission is Reader.
You can assign the Reader permission in one of these ways:
Assign to all Resource Groups, from which you want to pull an item
Add the permission on a subscription level

If you hadn't had a chance to have a look at the CloudGuard Controller for Azure section of the CloudGuard Controlelr admin guide, I might be worth your while to do so now.

wanartisan

So I logged in this morning and the Azure integration is working as expected again. The logs I posted showing failures on the afternoon of 14th March are the only ones and correlate with the object outage.

I will follow up with the support ticket and try and get a reason.

wanartisan

Seems like the issue started on at the time of the first mapping failure (13.14:29).

wanartisan

Another query I have is over the log entries for these mapping failures. On the day of the issue there were lots of these that say "Mapping of Data Center [Azure-DC_Integration] failed. Next mapping is in 300 seconds." (usually the time is 32 seconds). However, most of these are High severity alert and only a few were Critical.

I want to set up a SmartTask to alert us about issues with CloudGuard (as per the documentation). I understand the built-in trigger responds to critical alerts only. Can anyone advise on the difference between High and Critical in what looks like the same alert?

tomlev

Hi @wanartisan, you can open the object of the Data Center in SmartConsole and click 'test connection' to see possible reason for the failure, or look in cloud_proxy.elg for more information on the failure.

As for the logs level, there is no difference in the error type, CG Controller will send High level errors on cloud mapping failures or GW update failures, and if they failed for some number of times in a row, it would send a critical log.

The reason is that some failures may happen once in a while due to network or even the cloud vendor.

The reason that you got next scan in 300 seconds is due to a backoff mechanism that delays the next scan after a failure, and 300 is the default max value for the backoff.

wanartisan

Thanks for the clarification, Tom.

Check Point support are looking at the cloud_proxy.elg file now (Smart-1 Cloud; I have no access).

On the bright side, I have learned a lot from the forum on this. I've checked the TTL on our gateways and it is 7 days, as aviv suggested (I just rewatched a Cloudguard Controller Unleashed webinar and I'm sure it said 3 days...).

I can proceed now with more confidence that production traffic would be unaffected by these event and that we will have more assurance around catching them happening. All that is missing is a reason why........

I'll let you know how that goes.

AaronCP

Hey @wanartisan,

As I mentioned previously, if there is a loss of trust between your management platform and the remote server (https cert renewal on the remote server, for example), there will be an impact (this was confirmed to us by our Diamond Engineer). In the event of a loss of trust, the cache for the cloud object on the gateway will be cleared until trust is re-established. This is by design. You'll need a way of monitoring the renewal of the https certificates.

wanartisan

TAC are looking at this issue with another case and I will be having a remote session with them this week.

The issue appeared again on Monday but this time without any errors(?) so my SmartTask didn't let me know.... It is intermittent, so it's not due to https certs as far as I can tell.

Are you a member of CheckMates?

Azure Data Center Objects - Inaccessible