Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Duane_Toler
MVP Silver
MVP Silver

CloudGuard Controller Azure API outage

Hey all, 

One of my customers endured an outage across their firewalls when using the Azure API dynamic objects.  Best we can tell, Azure had an "issue" that they have yet to admit or explain.

I had a Critical TAC case yesterday and a group call with Check Point TAC, Microsoft Azure support (Sev A case), and more than enough customer managers, group directors, and team leaders.  

When "it" happened, the management server got a poll response from Azure API that said "you have no resources", and a mass "delete-identity" IA API command was sent to all CloudGuard gateways.. zapping hundreds upon hundreds of mapped identities.  "oops".  Even in SmartConsole, the Data Center browser wasn't showing the Azure subscriptions!  I eventually restarted the vSEC controller and they all came pouring in again, and the identities were added back to the gateways!  Yet, 3 minutes later, they all were stripped out.  However, this second time, the identities weren't deleted from the gateways; they just weren't visible in SmartConsole anymore.

Later, in the late afternoon, a second vSEC controller restart was done this time with debugging enabled, and everything has been stable.

We poured over the cloud_proxy.elg debugs during all of this.  After stripping out the Bearer token strings, we uploaded this debug to Microsoft Azure support who will relay it to their API people.

This morning, Check Point TAC came back saying they had multiple cases for this issue from other customers, but I haven't gotten any concrete info on what happened.  Best estimates at this point are "Azure API people did something".

Good luck to everyone and I hope you all were mostly spared!

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos
9 Replies
Duane_Toler
MVP Silver
MVP Silver

The briefest TAC update ever, but still pending more details:

I have been provided the following information:
The symptoms as a result of a glitch on Microsoft Azure side.

 TBD....

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos
PhoneBoy
Admin
Admin

When I hear the word glitch in this context:

image.png

0 Kudos
Duane_Toler
MVP Silver
MVP Silver

Hah for real!  

My customer is eager for updates, because we have instituted a temporary Change Freeze, and we'd like to get back into action as soon as we're comfortable this issue either won't recur or is resolved. 😨🤞

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos
noyerez
Employee
Employee

Hi @Duane_Toler,

We are aware of the issue and are currently investigating it as a top priority. My team is working with our Microsoft Azure peers to identify the root cause.

We will keep you updated on our progress.

Duane_Toler
MVP Silver
MVP Silver

By chance, any update available on this?  I got a message from TAC that a hotfix is being developed to compensate for the Azure API behavior, but I don't have any details about what that will involve.

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos
Jeff_Engel
Employee
Employee

Hi @Duane_Toler Checking...

0 Kudos
Duane_Toler
MVP Silver
MVP Silver

I got a message from TAC requesting a meeting with R&D to discuss an RCA for this issue.  Only once before have I had such a thing!  I do have some additional information, but I'm withholding that here until we have the meeting; I don't want to get ahead of any official messaging and add more noise to the signal.

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos
Duane_Toler
MVP Silver
MVP Silver

After the call with TAC and R&D this morning, the answer was that this issue was indeed an Azure API error.  The result in the cloud_proxy.elg log was very obvious:  Azure returned 0 objects/results.  The API call did not fail, either, which was clear and obvious in the logs.  What we wanted to know was "did something go wrong in the API call, and did something go wrong in the management-to-gateway connections for CPRID?".  The answer here was "No".  Even without vsec controller debugs, the logs showed a rather clear order of operations. 

Presumably, Check Point and Azure folks have had some back-and-forth, but it seems Microsoft is generally stonewalling their part of this and being obstinate about it (color me surprised!).  While on our initial group call, Microsoft support folks were friendly and helpful with being very specific about the logs they needed, which required vsec debug trace to be on.  Logs were uploaded to the Microsoft support case portal and... crickets.

Even after 10 days, Microsoft has yet to say anything.  The customer is pleased that TAC has at least made efforts to update us on things, even tho there hasn't been much to say.  The customer is not-so-pleased with Microsoft, especially given the cash they spend on Azure.

We're going to schedule another joint session with TAC/R&D and Microsoft to see if that will make the Azure people get moving.

We've been in a stable state since last Monday afternoon, when I think Azure folks "fixed" their API controller response issue.  Who knows...maybe they were scaling out their controller instances and the new instances started serving API requests without having connected to the backend provisioning service... or maybe they moved their instance to a MANA-enabled hypervisor. 😂

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos
Duane_Toler
MVP Silver
MVP Silver

Ok, Microsoft Azure support confirmed what we already knew.  It was an issue on the Azure API backend.

Screenshot 2026-06-02 at 10.25.23 AM.png

They sent a follow-up message reiterating that was indeed a problem on their side.  Microsoft Azure folks are still doing some additional review work to find out what happened internally.

If anyone else had issues during this original time window, this is why.  Just remember: "Cloud" doesn't mean "completely redundant", nor does it mean "fail-proof".  It's just someone else's computer far away with the same problems as yours.

Good luck to everyone!

 

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    Fri 12 Jun 2026 @ 09:00 AM (CEST)

    Netzwerk- & Cloud-Workshop: Wien

    Tue 16 Jun 2026 @ 09:30 AM (BST)

    DDOS MasterClass in London!
    CheckMates Events