Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
israelfds95
MVP Gold
MVP Gold
Jump to solution

SIC Redundancy in Deployments with Internet-Based Management – Design Question

Hi Mates,

I’m currently working on a large-scale deployment and would like to get insights from the community regarding SIC behavior and redundancy in a fully Internet-based management scenario. I’ve opened a TAC case, but I’d really appreciate hearing from the community as well, any insights or real-world experiences are welcome. Let’s open this for discussion.


Architecture Overview

Headquarters (DC):

  • Cluster Check Point Quantum 9700 - R82 JH 60

  • 7 Internet links (all with fixed public IPs)

Remote Sites:

  • 37x Check Point Quantum Spark 2550 - R82.00.05

  • Each site with 2 Internet links (both with fixed public IPs)

Management R82 JH 60:

  • Single SMS exposed to the Internet via NAT

  • “Accept control connections” enabled

  • SIC communication over the Internet

Connectivity:

  • Using Check Point SD-WAN to build Overlay VPN (Star topology)


Key Questions

This design raises some important questions about SIC resiliency:

1. SMS NAT Dependency

If the Internet link used for NAT on the SMS goes down:

  • Will all gateways lose SIC connectivity?

  • Is there any mechanism for redundancy on the management side?


2. Remote Site Link Failover

Each site has dual Internet links:

  • If SIC is initially established over Link A, and this link fails:

    • Will SIC drop completely?

    • Will it automatically re-establish over Link B?

    • Or is SIC bound to the original IP/path?


3. SD-WAN Interaction

Since SD-WAN is managing overlay paths:

  • Does SIC benefit from SD-WAN failover capabilities?

  • Can the Overlay VPN be affected or go down if SIC connectivity is lost?

⚠️ Design Concern

From a design perspective, this seems like a potential single point of failure if:

  • SMS and remote sites depends on a single NAT IP

  • SIC does not dynamically failover between links


Looking for Insights

I’d really appreciate input on:

  • How SIC actually behaves in multi-link environments

  • Known limitations in this type of architecture

  • Recommended best practices for ensuring high availability of SIC management connectivity


Context

This will be a production environment with high availability requirements, and ensuring stable SIC across all sites is critical for operations.

1 Solution

Accepted Solutions
israelfds95
MVP Gold
MVP Gold

TAC answer: 

SR Step Type: Solution

Solution Subject: SIC Redundancy Behavior with Internet-Based Connectivity
Solution Description: There is no such thing as SIC redundancy; you can't have SIC to two different IPs at one time (currently not supported, and no roadmap for this). Therefore, if the NAT link of the SMS goes down, you will lose SIC, and you may need to make the change manually.

If you require a solution tailored for your customer environment, you can reach out to your sales or account manager.

Let me know if you have any other questions regarding this service request.

View solution in original post

0 Kudos
11 Replies
Lesley
MVP Gold
MVP Gold

SIC is certificate based, if one firewall switches over to other ISP it should not matter. You still should be able to send fw logs, push policy etc. You can even have SIC with DAIP gateways, that do not have a static IP. 

-------
Please press "Accept as Solution" if my post solved it 🙂
0 Kudos
israelfds95
MVP Gold
MVP Gold

Thanks for the input, that makes sense from a certificate perspective.

SIC is indeed certificate-based and not inherently tied to a specific IP, which allows scenarios like DAIP gateways. However, in practice, there is an important operational limitation in this type of design.

Even though the trust is based on certificates, the communication still relies on reachability to the management server. In SmartConsole, we define a single IPv4 address per gateway object, and the SMS is also exposed through a single NAT IP.

Because of that, in real-world deployments, SIC effectively becomes dependent on that specific NAT path.

I have another environment with a similar design, and we consistently observe SIC instability when the link used for NAT between the gateways and the SMS goes down. During these events, SmartConsole shows multiple gateways as disconnected/alarmed, even though the gateways themselves are still up and passing traffic normally.

So while SIC is not logically tied to an IP (due to certificates), operationally it behaves as if it is bound to a single reachable management IP/NAT, which introduces a potential single point of failure.

From this perspective, the key concern is not certificate validity, but management reachability.

This leads to a follow-up question: What are the recommended designs or best practices to provide high availability for SMS connectivity in Internet-based deployments, avoiding this single NAT dependency?

0 Kudos
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

SIC communication should not be via IPSec/SDWAN tunnels terminating on the gateways.

Using provider independent public IP addressing for the NAT of the management server is advisable where possible.

This way it is not bound to a specific ISP or link and routing can do it's thing to ensure availability.

On the gateway side this may also be helpful:

PRJ-41083.png

(Similar exists for SD-WAN on R82 and higher refer: PMTR-88398)

CCSM R77/R80/ELITE
0 Kudos
israelfds95
MVP Gold
MVP Gold

Thanks, and I agree — SIC should not rely on IPSec/SD-WAN tunnels terminating on the gateways, and that is not what I am referring to here. And we don't use ISP Redundancy, because after established SIC underlay through USPS internet, we will use SD-WAN for VPN Overlay, and specifics local breakout.

My concern is specifically about direct SIC communication over the Internet, where the Management Server is exposed through NAT.

In this design, SIC is not using the Overlay VPN path. The point is that, even being certificate-based, SIC still depends on reachability to the Management Server through the configured/public NAT path.

The main question is: what are the best-practice designs to provide true management reachability redundancy for SIC in Internet-based deployments?

0 Kudos
Lesley
MVP Gold
MVP Gold

You can build some NAT rules if you don't want to work with ISP redundancy and the dynamic objects in the NAT rules.

Here is some reading:

https://sc1.checkpoint.com/documents/R82/WebAdminGuides/EN/CP_R82_SecurityManagement_AdminGuide/Cont...

https://sc1.checkpoint.com/documents/R82/WebAdminGuides/EN/CP_R82_SecurityManagement_AdminGuide/Cont...

https://support.checkpoint.com/results/sk/sk171055

-------
Please press "Accept as Solution" if my post solved it 🙂
0 Kudos
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

Are you attempting to guard against transient / short term failures or something more significant?

Use a subnet not directly associated with the interface for the NAT.

If you have portable address space you can use dynamic routing to ensure the subnet is accessible via multiple ISP links.

Alternately if at least one or more of the links are from the same ISP they might be able to delegate you addresses that can at least achieve some level of redundancy between a subset of the links from different POP/POIs.

You might also opt fo deploy a HA SMS with a separate NAT that you can promote to active on a needs basis.

CCSM R77/R80/ELITE
0 Kudos
israelfds95
MVP Gold
MVP Gold

Hi @Lesley @Chris_Atkinson The documentation helps clarify how SMS behind NAT works, but it also highlights an important architectural concern in real-world deployments.

While SIC is certificate-based and NAT is fully supported, all communication still relies on a single defined reachable IP (original or translated) for the Management Server.

In this specific scenario, the customer is not an autonomous system (AS) and does not own portable public address space. Instead, they have multiple Internet links from two different providers, with several public IP ranges at the headquarters — but all of them are provider-dependent.

Because of that, even though there are multiple links and IPs available, each public IP is still tied to a specific ISP. This means we cannot achieve true reachability redundancy for a single management IP across providers.

Additionally, the environment has some important constraints:

There is only a single on-premises SMS

There is no SMS High Availability

There is no Smart-1 Cloud or secondary management/log server available

From a design perspective, this significantly increases the dependency on a single management path via one Static NAT.

If the SMS is exposed through a single static NAT IP, and the link associated with that IP fails, all remote gateways will lose reachability to the Management Server.

Operationally, this leads to:

Loss of SIC communication (from a reachability standpoint)

Gateways appearing disconnected/alarmed in SmartConsole

Inability to install policy or properly send logs

Significant operational impact in large distributed environments

In practice, this becomes especially critical when dealing with many remote sites, where restoring SIC communication can be complex and time-consuming.

For this problem that I'm trying to find a solution, or how check point projects SIC resilience for SMS behind NAT to connect multiples external gateways via internet. 

0 Kudos
israelfds95
MVP Gold
MVP Gold

TAC answer: 

SR Step Type: Solution

Solution Subject: SIC Redundancy Behavior with Internet-Based Connectivity
Solution Description: There is no such thing as SIC redundancy; you can't have SIC to two different IPs at one time (currently not supported, and no roadmap for this). Therefore, if the NAT link of the SMS goes down, you will lose SIC, and you may need to make the change manually.

If you require a solution tailored for your customer environment, you can reach out to your sales or account manager.

Let me know if you have any other questions regarding this service request.

0 Kudos
Lesley
MVP Gold
MVP Gold

$FWDIR/conf/masters file  on the gateway, with some creativity you can build some redundancy in here 

-------
Please press "Accept as Solution" if my post solved it 🙂
0 Kudos
israelfds95
MVP Gold
MVP Gold

In this scenario and topology, the issue is always on the SMS side, since exposing the SMS to the Internet is typically done using a single static NAT, which becomes the main point of failure. When adding a gateway in SmartConsole for SIC establishment, there is always a single IPv4 defined for communication, meaning SIC is effectively built between one IP on the gateway side and a single NAT IP on the SMS side. The masters file does not address the lack of redundancy for that management SIC question. 

0 Kudos
leonarit
Contributor

I have the same setup, I'm currently deploying 144 SG1535 with dual wan with public fixed ips. Usually DAIP solves this issue, but with DAIP I lose the capability of identity sharing, because of this I changed from DAIP to a standard gateway.

I made a custom logic with python and systemd in my sms that does the following:

1  - Using mgmt api, the script fetches the gateway description, the description contains a specific format with the public ips of both wan links of the gateway.

Description format: ID:SITEIDNAME;WANMON:ENABLED;WAN1IP:X.X.X.1;WAN2IP:X.X.X.2

2 - The script parses the description and checks if WANMON is enabled; it then uses the wan ips for the next steps.
3 - The script check if sic is initialized, if not bypasses the gateway.4 - The script is periodically testing both wan links with a tcp probe on port 18191
4 - It always considers the wan1 as primary, and if wan1 fails and wan2 is up, it uses the mgmt api to change the ip of the gateway from wan1 to wan2, after wan1 is up it changes from wan2 to wan1. 

The script saves a log in the filesystem and rotates the log file when it reaches 50MB.

Systemd service

[Unit]
Description=Check Point Gateway WAN Monitor
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash -l -c "exec /var/log/gw_monitor/gw_monitor.py"
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

 

Python script

#!/usr/bin/env python3
import os
import json
import time
import socket
import re
import concurrent.futures
import subprocess
import gzip
import shutil
import datetime

# --- Variables ---
VERSION = "1.1.0"
EXEC_DIR = os.path.dirname(os.path.abspath(__file__))
OUTPUT_DIR = os.path.join(EXEC_DIR, "data", "output")
LOG_FILE = os.path.join(OUTPUT_DIR, "gw_monitor.log")
SESSION_FILE = os.path.join(OUTPUT_DIR, "session.txt")
MAX_LOG_SIZE = 50 * 1024 * 1024  # 50MB

os.makedirs(OUTPUT_DIR, exist_ok=True)

# Regex to parse the comments field
REGEX_WANMON = re.compile(r"WANMON:\s*ENABLED", re.IGNORECASE)
REGEX_WAN1 = re.compile(r"WAN1IP:\s*([0-9\.]+)", re.IGNORECASE)
REGEX_WAN2 = re.compile(r"WAN2IP:\s*([0-9\.]+)", re.IGNORECASE)

def log_msg(msg):
    """Log messages to both console and a log file with rotation and compression."""
    timestamp = time.strftime('%Y-%m-%d %H:%M:%S')
    formatted = f"[{timestamp}] {msg}"
    print(formatted)
    
    # Check if rotation is needed
    try:
        if os.path.exists(LOG_FILE) and os.path.getsize(LOG_FILE) > MAX_LOG_SIZE:
            rotate_logs()
    except Exception as e:
        print(f"Error checking log size: {e}")

    with open(LOG_FILE, "a") as f:
        f.write(formatted + "\n")

def rotate_logs():
    """Compress the current log file and rename it using the timestamp of the oldest log found within."""
    if not os.path.exists(LOG_FILE):
        return

    try:
        # Default timestamp (current) in case file is empty or unparseable
        oldest_ts = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
        
        # Read the first line to find the oldest timestamp
        with open(LOG_FILE, 'r') as f:
            first_line = f.readline()
            match = re.search(r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]', first_line)
            if match:
                # Format to be filename friendly
                oldest_ts = match.group(1).replace(" ", "_").replace(":", "-")
        
        # Construction of the archive filename
        archive_name = LOG_FILE.replace(".log", "") + f"_{oldest_ts}.log.gz"
        
        # Compress the file
        with open(LOG_FILE, 'rb') as f_in:
            with gzip.open(archive_name, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        
        # Clear the original log file
        with open(LOG_FILE, 'w') as f:
            f.truncate()
            
        print(f"[Log Rotated: {archive_name}]")
    except Exception as e:
        print(f"CRITICAL: Failed to rotate logs: {e}")

def check_tcp_port(gw_dict, port=18191, timeout=3, retries=3, delay=2):
    """
    Checks the active IP for a single gateway with robust retry logic.
    Returns the gateway dictionary and the reachable IP (or None if down).
    """
    wan1 = gw_dict.get("wan1_ip")
    wan2 = gw_dict.get("wan2_ip")
    
    def is_reachable(ip):
        if not ip:
            return False
        for attempt in range(retries):
            try:
                with socket.create_connection((ip, port), timeout=timeout):
                    return True
            except (socket.timeout, socket.error):
                if attempt < retries - 1:
                    time.sleep(delay)
        return False

    if is_reachable(wan1):
        return gw_dict, wan1
            
    if is_reachable(wan2):
        return gw_dict, wan2
            
    return gw_dict, None

def run_mgmt_cli(command_list):
    """Run a mgmt_cli command using subprocess and return status code and parsed JSON."""
    try:
        full_cmd = ["mgmt_cli"] + command_list + ["-s", SESSION_FILE, "-f", "json"]
        result = subprocess.run(full_cmd, capture_output=True, text=True)
        
        if result.returncode == 0:
            return 200, json.loads(result.stdout)
        else:
            try:
                # mgmt_cli often outputs valid JSON errors when -f json is used
                return result.returncode, json.loads(result.stdout)
            except json.JSONDecodeError:
                return result.returncode, {"error": result.stderr or result.stdout}
    except Exception as e:
        return -1, {"error": str(e)}

def run_sic_check(gw):
    """Check SIC status for a gateway object."""
    status, res = run_mgmt_cli(["test-sic-status", "name", gw["name"]])
    if status == 200:
        return gw, res.get("sic-status", "unknown")
    return gw, "error"

def discover_gateways():
    """Query SMS to find gateways with WANMON:ENABLED and verified SIC."""
    all_candidates = []
    offset = 0
    limit = 500
    
    while True:
        cmd = ["show", "gateways-and-servers", "limit", str(limit), "offset", str(offset), "details-level", "full"]
        status, res = run_mgmt_cli(cmd)
        
        if status != 200:
            log_msg(f"ERROR: Failed to discover gateways: {json.dumps(res)}")
            break
            
        objects = res.get("objects", [])
        if not objects:
            break
            
        for obj in objects:
            if obj.get("type") not in ("simple-gateway", "CpmiGatewayPlain"):
                continue
                
            comments = obj.get("comments", "")
            if comments and REGEX_WANMON.search(comments):
                w1_match = REGEX_WAN1.search(comments)
                w2_match = REGEX_WAN2.search(comments)
                
                gw = {
                    "uid": obj.get("uid"),
                    "name": obj.get("name"),
                    "current_ip": obj.get("ipv4-address"),
                    "wan1_ip": w1_match.group(1) if w1_match else None,
                    "wan2_ip": w2_match.group(1) if w2_match else None
                }
                
                if gw["wan1_ip"]:
                    all_candidates.append(gw)
                else:
                    log_msg(f"WARNING: Gateway '{gw['name']}' has WANMON:ENABLED but no WAN1IP found in comments.")
                    
        total = res.get("total", 0)
        offset += limit
        if offset >= total:
            break
            
    if not all_candidates:
        return []

    log_msg(f"Performing SIC verification for {len(all_candidates)} candidates...")
    monitored_gateways = []
    # Use parallel threads to speed up SIC checks
    with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
        future_to_gw = {executor.submit(run_sic_check, c): c for c in all_candidates}
        for future in concurrent.futures.as_completed(future_to_gw):
            gw, sic_status = future.result()
            
            # We allow 'communicating' and 'not communicating'. 
            # 'not communicating' often means the link is down, which is exactly when we want to failover.
            # But 'uninitialized' means the gateway isn't ready at all.
            if sic_status in ["communicating", "not communicating", "not-communicating"]:
                monitored_gateways.append(gw)
            else:
                log_msg(f"INFO: Gateway '{gw['name']}' skipped (SIC: {sic_status}).")
                
    return monitored_gateways

def main():
    log_msg("--- Gateways Monitor Started ---")
    
    login_cmd = [
        "mgmt_cli", "login", "-r", "true", 
        "session-name", "WAN Monitor Failover System",
        "session-description", "Automated IP route updates"
    ]
    login_result = subprocess.run(login_cmd, capture_output=True, text=True)
    
    if login_result.returncode != 0:
        log_msg(f"ERROR: mgmt_cli login failed: {login_result.stderr or login_result.stdout}")
        return
        
    # Write the standard text stdout directly to the session file
    # `mgmt_cli` expects this specific multi-line format to parse the session correctly!
    with open(SESSION_FILE, "w") as f:
        f.write(login_result.stdout)
        
    log_msg("Login successful. Root session established.")
    
    try:
        # 2. Dynamic Discovery
        log_msg("Discovering gateways with WANMON:ENABLED...")
        gateways = discover_gateways()
        
        if not gateways:
            log_msg("No gateways found with WANMON:ENABLED in comments. Exiting.")
            return
            
        log_msg(f"Found {len(gateways)} gateways to monitor.")
        
        # 3. Concurrent TCP Reachability Check
        log_msg("Starting concurrent reachability checks...")
        to_update = []
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=150) as executor:
            future_to_gw = {executor.submit(check_tcp_port, gw): gw for gw in gateways}
            
            for future in concurrent.futures.as_completed(future_to_gw):
                gw, active_ip = future.result()
                name = gw["name"]
                
                if not active_ip:
                    log_msg(f"[{name}] UNREACHABLE on both WAN1IP and WAN2IP.")
                    continue
                    
                if active_ip == gw["current_ip"]:
                    pass # Already matching
                else:
                    log_msg(f"[{name}] IP change detected: {gw['current_ip']} -> {active_ip}")
                    gw["new_ip"] = active_ip
                    to_update.append(gw)
        
        # 4. Check and Update Gateways via mgmt_cli
        if to_update:
            log_msg(f"Applying {len(to_update)} updates to Management Server...")
            changes_made = False
            for gw in to_update:
                set_cmd = [
                    "set", "simple-gateway", 
                    "name", gw["name"], 
                    "ipv4-address", gw["new_ip"], 
                    "ignore-warnings", "true"
                ]
                set_status, set_res = run_mgmt_cli(set_cmd)
                if set_status == 200:
                    log_msg(f"Successfully updated '{gw['name']}'.")
                    changes_made = True
                else:
                    log_msg(f"ERROR: Failed to update '{gw['name']}': {json.dumps(set_res)}")
            
            # 5. Publish Changes
            if changes_made:
                log_msg("Publishing changes to SMS database...")
                pub_status, pub_res = run_mgmt_cli(["publish"])
                if pub_status == 200:
                    task_id = pub_res.get("task-id")
                    log_msg(f"Publish task started (Task ID: {task_id}). Changes will be active shortly.")
                else:
                    log_msg(f"ERROR: Publish failed: {json.dumps(pub_res)}")
        else:
            log_msg("No updates required. All gateways on active IP.")
            
    finally:
        # 6. Logout
        log_msg("Logging out from Check Point API...")
        subprocess.run(["mgmt_cli", "logout", "-s", SESSION_FILE, "-f", "json"], capture_output=True)
        # Clean up the session file
        if os.path.exists(SESSION_FILE):
            os.remove(SESSION_FILE)
        log_msg("--- Gateways Monitor Finished ---\n")

if __name__ == "__main__":
    log_msg(f"Starting Gateway Monitor v{VERSION} - Service Mode (systemd).")
    while True:
        try:
            main()
        except Exception as e:
            log_msg(f"FATAL ERROR in main loop: {e}")
        time.sleep(60)

 

 

(1)

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events