Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted

Azure NIC issues - possibly waagent related

Jump to solution

Hi all,

  I noticed recurring issues with the Azure CP R80.20 cluster and was wondering if anyone else had this behavior.

Basically the interfaces related to Azure Accelerated Networking unregister and may come up with a different name which breaks the traffic completely.

Although this was supposed to be solved by Jumbo HF take 17 it occurred again.

I believe it may be related to outdated buggy version of the Microsoft Azure Linux Agent (waagent) v2.2.11 installed on the VM (the last available version is v2.2.42)

Now waiting for my SR to be picked up...

Two other issues with the agent that are resolved in newer version:

-agent's logs filling up the Azure Serial Console making it unusable

-does not use the configured proxy server

Entries in /var/log/messages:

 kernel: kernel: hv_netvsc 000d3a25-c27e-000d-3a25-c27e000d3a25 eth0: Data path switched from VF: enP1p0s2

 kernel: kernel: hv_netvsc 000d3a25-c27e-000d-3a25-c27e000d3a25 eth0: VF unregistering: enP1p0s2

 kernel: kernel: [SIM4];cphwd_api_forward_packet: sim_mgr_prepare_packet failed

 kernel: kernel: [SIM4];simlinux_br_port: dev == NULL !!!!!

 

 

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Employee+
Employee+

In-place upgrades are not supported for public cloud.

The recommended way to upgrade is by side-by-side deployment. You can have a look at the following SK for official CloudGuard IaaS Upgrading documentation:

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

You can also find here documentation on upgrading Management, Cluster, HA, increasing management server disk size etc.

Although it's possible to receive the R80.20 hotfixes by opening a support ticket, we highly recommend to upgrade. Note that R80.10 and R80.20 will be delisted from the Azure marketplace during November:

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

 

 

Regards,

Dmitry

View solution in original post

0 Kudos
9 Replies
Highlighted
Admin
Admin
From what I can tell, you will need to get a hotfix from TAC for this.
0 Kudos
Highlighted
Employee+
Employee+

Hi,

 

The issues you describe are similar to known issues related to Azure maintenance operations which are currently under investigation. sk160992 contains additional information. Note that they are not related to the version of linux agent that we deploy.

Our latest available version in Azure - R80.30 - contains the necessary fixes for the issue and we recommend to upgrade. In case upgrading is not possible, please contact TAC.

 

Thanks,

Dmitry

 

0 Kudos
Highlighted
Hi Dimitri,
thanks for your feedback. The issues do look similar but the kernel messages are different as in sk160992.
After the first incident we have installed the suggested Jumbo HF from sk146212 to address GAIA-5479 issue described below:

“Azure maintenance operations on the Azure Hosts can cause the NIC driver to be reloaded. Our SW did not correctly handle all the use cases and configurations in the event of a reload operation when the gateway VM is in "started" state in Azure. This fix (introduced in Take_17) fixes this issue and makes sure that even if the driver is reloaded during regular operation, the NIC and the Security Gateway will be configured correctly.”

However the issue reoccurred and on both occasions Azure support confirmed there were no Azure maintenance operations, no issues and no changes on the hosting servers.

As for outdated waagent, I do think it might be relevant and should be updated considering this agent is responsible for “Linux provisioning and VM interaction with the Azure Fabric Controller. Ensures the stability of the network interface name”
src: https://github.com/Azure/WALinuxAgent

Even the latest R80.30 image includes the old agent version with existing bugs.
0 Kudos
Highlighted
Employee+
Employee+

There were actually two separate issues. One was resolved in JHF Take 17, and the second one is fixed in R80.30 and related to a component of our operating system - this seems to be related according the output you've sent. For R80.20, it is possible to obtain a hotfix for the second issue by contacting TAC (and it will be included in a future JHF take).

This specific issue does not seem related to WALinuxAgent, but we do have plans to update it to a newer version in the future.

 

Please let me know if the issue is resolved by upgrade/hotfix deployment.

 

Thanks,

Dmitry

Highlighted
Thanks again Dimtry, this is good news. Although I've searched through release notes nothing similar came up.
I did request a new SR for this problem last week but it is still hanging...
0 Kudos
Highlighted

Is it possible to do an in-place upgrade of CloudGuard for Azure? I'm having the same issue as noted here, but I honestly cannot find a good way to recreate our security gateway in Azure as a new image without a lot of changes to existing objects we've created (NSGs, Route tables, public interfaces, etc.).

Please tell me there is a better way to do this, or is there an SK we can reference to get this hotfix from support and we'll just live and die on 80.20.

0 Kudos
Highlighted

Nope, upgrade is something that's not supported for cloud deployments (ESX Cloud Guard versions do support upgrades).

I deleted the resource group and redeployed the cluster; sometimes the template also changes and this might be the best way to do it. In my case the same IP addresses were assigned.

You can open a SR to get the hotfix for these issues and stay on R80.20 but IMHO it's worth moving to the new version.

 

As a side note, on this second outage, MS support finally found the related maintenance that triggered the issue:

"a routine Azure host side networking update was applied to your VM"

0 Kudos
Highlighted

Thanks. I think what I'll do in this case is to retain the previous resource group with VNET/subnets, routes, etc. in it and then create a new resource group solely for the Checkpoint VM objects which I can still assign network properties from that existing resource group's VNET and subnets. I can then decide whether I want to sunset the old box and go through the process of trying to reclaim the IP addresses (public and private) or if I just want to modify any of my NSGs and routes in the original resource group to reflect the new IP addresses.

I have another Checkpoint however where that approach isn't going to work because it is a log server with 4 TB of drive, but I can't move them because the log partition is defined partially on the original VM and the remainder through extending the drive. In this case, I'll have to start up a new log server just so I can get to 80.30 and then have it start handling logs in parallel to the other server for a period of time just so I don't lose the massive amount of data I have.

0 Kudos
Highlighted
Employee+
Employee+

In-place upgrades are not supported for public cloud.

The recommended way to upgrade is by side-by-side deployment. You can have a look at the following SK for official CloudGuard IaaS Upgrading documentation:

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

You can also find here documentation on upgrading Management, Cluster, HA, increasing management server disk size etc.

Although it's possible to receive the R80.20 hotfixes by opening a support ticket, we highly recommend to upgrade. Note that R80.10 and R80.20 will be delisted from the Azure marketplace during November:

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

 

 

Regards,

Dmitry

View solution in original post

0 Kudos