I've encountered some strange issues on centrally managed Spark Gateways.
I'm not a fan of the default "Internet" object, and as such I create a network object for each RFC range, add them to a group, then create a group with exclusion of the RFC group which then becomes my "Internet" object (so all addresses excluding RFC). I've used this approach in many places with no issues. I also create an object for each local network, and a group called ALLNETS that contains them all, which I guess is pretty standard anyway.
I have a customer with several remote Spark 1535/1555 gateways and a central 9100 cluster with SMS on the main site. All Sparks are centrally managed by the main SMS, the policies aren't huge, but I use the above object with no issues. Most of these have a central Windows server that handles DHCP/DNS, and the Spark is the default gateway. One of them has an additional network connected to LAN3 where the firewall handles the DHCP/DNS for this network. Again, this has been in place for over a year now with no issues.
I have another customer with a pair of Spark 1600's in a cluster, locally managed. The policy is fairly basic but has been in place for years and there are a lot of obsolete rules and objects. They wanted more visibility of logs and reporting capabilities, so we have moved the management to Smart-1 Cloud. (This was not without its issues, especially as the admin guide is missing important info, like you have to break the cluster to achieve this!) It was also decided to create a new policy to ensure it is clean going forward. Whilst the 1600 has lots of LAN ports, for unrelated reasons there is only a single connection to the network from LAN1, which has a Windows server controlling DHCP/DNS, and there are a couple of VLANs on that port too (VLAN30 which is managed from an internal device and VLAN60 for which the firewall handles the DHCP/DNS)
I created the new policy using my normal methodology (tried and tested), the ALLNETS group does not contain VLAN60 as they use it for guests and need it to be isolated from the LAN. I then add the following rules:
Allow traffic between the main LAN and VLAN30
Block all traffic between VLAN60 and any other local Network (ALLNETS)
Allow main LAN to my Internet object
Allow VLAN60 to my Internet object
Cleanup rule is Any Any Drop
There are other rules obviously, like block main LAN to Critical Risk etc, but nothing really relevant to the issue.
HTTPS is not enabled, Threat prevention is enabled using the default optimised policy.
When I started testing, the main LAN Network was fine, VLAN30 was fine but VLAN60 was not, it couldn't get out to the internet.
Testing was showing that I could ping a website by IP but not by name so it pointed to a DNS issue. Checked the logs and found traffic being blocked to the IP of the firewall in VLAN60 (the internal VIP). Tried pinging this by IP, no response, tried pinging the internal IP's of the individual gateways but again no response. Tried pinging another device on the VLAN and that was fine. So it's obvious that the firewall is blocking traffic to it's own IP's in the VLAN. (There is no stealth rule to hide the firewall internally.)
I added a rule to allow traffic from VLAN60 to VLAN60 and that resolved the issue, but I don't understand why it was an issue in the first place, especially as on the first site this is not a problem!
The three differences are that this site uses Smart-1 Cloud management whilst the other uses an SMS server, this site uses Spark 1600 whereas the other site uses Spark 1500's and finally this site has a cluster. I don't believe that any of these should make any real difference.
The other abnormalities that we're seeing are:
A couple of websites that simply don't work without a specific allow rule for their domain. When I get time, I am going to try to grab some fwmonitor and tcpdumps but I need to wait until it happens again and hope the customer will give me time to do so rather than just adding another rule! Also, it sort of feels like a Threat Prevention issue but before adding the rule for the domain I tried adding the domain as an exception to TP switching it to detect only, but it doesn't help and it doesn't show anything logged against the exception rule.
We're also seeing a very high number of CPNotEnoughDataForRuleMatch logs. Now I understand what this is, but I’ve never seen such a high number in any other environment, which is making me wonder if there may be something wrong somewhere. I spoke to TAC about this one and they say to ignore it, which isn't sitting right with me. I suppose it is possible that this is related to the customers environment and has always been happening, but the local management wasn't logging it. The logs are showing that this is mostly for TCP or UDP high port numbers, although there are some for https too. One very obvious thing is that there are a huge number with a destination IP on 239.255.255.250 which is related to IP Multicast but I’m not sure if this is relevant.
It’s not an easy one to test as it requires a Spark device. I do have access to a couple (1555 and 2000) but it’ll take a couple of days to get them to my lab for testing.