- Local User Groups
I am Dr. Dorit Dor
Ask Me Anything
Check Point for Beginners
Welcome to the
Review Check Point,
Win Apple AirPods!
CheckMates GO: Is There a Question
Dorit Can't Answer?
|Elephant Flow (Heavy Connections)|
In computer networking, an elephant flow (heavy connection) is an extremely large in total bytes continuous flow set up by a TCP or other protocol flow measured over a network link. Elephant flows, though not numerous, can occupy a disproportionate share of the total bandwidth over a period of time. When the observations were made that a small number of flows carry the majority of Internet traffic and the remainder consists of a large number of flows that carry very little Internet traffic (mice flows).
All packets associated with that elephant flow must be handled by the same firewall worker core (CoreXL instance). Packets could be dropped by Firewall when CPU cores, on which Firewall runs, are fully utilized. Such packet loss might occur regardless of the connection's type.
What typically produces heavy connections:
More interesting articles:
|Evaluation of heavy connections|
The big question is, how do you found elephat flows on an R80 gateway?
Evaluation of heavy connections (epehant flows)
A first indication is a high CPU load on a core if all other cores have a normal CPU load. This can be displayed very nicely with "top". Ok, now a core has 100% CPU usage. What can we do now? For this there is a SK105762 to activate "Firewall Priority Queues". This feature allows the administrator to monitor the heavy connections that consume the most CPU resources without interrupting the normal operation of the Firewall. After enabling this feature, the relevant information is available in CPView Utility. The system saves heavy connection data for the last 24 hours and CPDiag has a matching collector which uploads this data for diagnosis purposes.
Heavy connection flow system definition on Check Point gateways:
Enable the monitoring of heavy connections.
To enable the monitoring of heavy connections that consume high CPU resources:
# fw ctl multik prioq 1
Found heavy connection on the gateway with „print_heavy connections“
On the system itself, heavy connection data is accessible using the command:
# fw ctl multik print_heavy_conn
Found heavy connection on the gateway with cpview
# cpview CPU > Top-Connection > InstancesX
Thank you for all the interesting articles about Performance Tuning you wrote.
You could write a book out of this link collection 😀.
This article has helped me very well.
I followed the steps and actually found a database backup connection. The connection caused about 70% CPU load on one core. We have now limited the bandwidth of the connection via QoS.
So glad you asked this question. 🙂
I will be speaking at CPX New Orleans and Vienna on the CheckMates track with a presentation called "Big Game Hunting: Elephant Flows" that will go through how to track down elephant flows (a.k.a. heavy connections), all the different remediation options, and the pros and cons of each. PhoneBoy will be delivering this presentation for me at CPX Bangkok because I'll be very busy that week, with, uh, something else...
This is an interesting approach to detect heavy connections. I had checked this after this article and could identify some systems that were causing problems. We have now created QoS rules to limit the bandwidth. That worked well.
Priority Queues must be in mode 1 (Eviluator-only) to use that command; mode 1 is the default on a firewall that does not have USFW enabled. I'll be speaking about this very topic in detail at CPX New Orleans and Vienna.
Support for fw ctl multik print_heavy_conn was added in R80.20; I doubt it can be backported into earlier releases since I'm pretty sure it relies on the major changes introduced to SecureXL in R80.20.
Could someone explain why FW was moved from kernel space to user space by default? What is the benefit except alocation more memory when you have more cores? What will be impacted, what is behind? Thanks
That was discussed here in several posts, I think.
In a nutshell, with more than 48 cores, kernel mode cannot utilise them all. To allow CoreXL use more cores on high performance boxes, User Mode is the only option. Plus, user mode add stability. If FWK instance crashes, it does not affect the whole machine.
VSX is running User Mode FWK instances for ages, actually.
In “Kernel Mode Firewall” KMFW, the maximum number of running cores is limited to 40 because of the Linux/Intel limitation of 2GB kernel memory, and because CoreXL architecture needs to load a large driver (~42MB) dozens of times (according to the CPU number, and up to 40 times). Newer platforms that contain more than 40 cores e.g., 23900 or open server are not fully utilized.
The solution of the problem is a firewall in the user mode of the Linux operating system.
|GAIA version/ Kernel/ Cores||Firewall mode||Check|
|R80.30 kernel 3.10 more then 35* cores||UMFW is enabled||checked on HP DL 380 G10 2 * Platinum 8180MProcessor 28 cores = 56 cores|
|R80.30 kernel 3.10 less then 35* cores||KMFW is enabled||checked on HP DL 380 G10 1 * Platinum 8180MProcessor 28 cores|
|R80.30 kernel 2.6||KMFW is enabled||checked on VMWare with 30 cores and with 46 cores|
|R80.40 (default 3.10 kernel)||UMFW is enabled by default||checked on VMWare with 4 cores|
To make sure that UMFW is activated, run the following command:
# cpprod_util FwIsUsermode
1 = User Mode Firewall
0 = Kernel Mode Firewall
For more information or to change the mode, read more in my article here:
Kernel mode - faster, direct access to hardware but in case of crash everything goes down
User mode - slower, limited access to hardware but in case of crash only app crashes
Also, writing and maintaining code in kernel mode is often pure nightmare compared to user mode. With current hardware performance really does not suffer that much if you do it well in user mode.
True, but there are much better tools for detection and remediation of elephant flows when in kernel mode. With USFW enabled detection and remediation tools for elephant flows are quite limited, but based on a recent conversation I learned that Check Point is working on closing that capability gap as we speak. My CPX 2020 presentation summarizes all this here:
Also the Solution Center has a new feature available that allows the processing of a single elephant flow to be spread across multiple Firewall Worker instances, but this capability is not mainlined yet. This feature was alluded to at the end of my CPX presentation above.
Is there any way to detect elephant flows in fast path in R77.20 or earlier?
I have made the following summary reading your posts but I miss how to capture elephant flows in fast path in R77.20 or earlier.
Is this summary below correct? Am I missing anything?
- In R77.20 or earlier, you can detect elephant flows with:
* F2F traffic: with /proc/cpkstats/fw_worker_x_stats with or without cpview
* Any traffic: enabling accounting in a number of rules and looking at smartlog.
- Between R77.30 and R80.40:
* you can still use the above options
* Any traffic: priority queues and connection load tracking - cpview and smartlog
"fw ctl multik prioq 1"
- Between R80.20 Take 47 and R83.X
* you can still use all the above
* Any traffic: there is a new elephant flow detection mechanism for kernel mode
"fw ctl multik print_heavy_conn"
I don't have a R77.20 gateway handy to test, but if the elephant flows are in the fastpath fw_worker stats will not show them.
Accounting is supported directly by SecureXL/fastpath and should work.
I don't think the fwaccel conns command will help much for finding elephant flows in the fastpath but give it a shot. To my knowledge there are no direct elephant flow detection mechanisms in R77.20.
I can't remember if cpview has these screens and whether they will show elephant flows in the fastpath in R77.20, but look for these screens in cpview:
You can also try using the CPMonitor (sk103212: Traffic analysis using the 'CPMonitor' tool) and connstat (sk85780: How to use the 'connstat' utility) tools as described in my CPX 2020 presentation here: