Elephant flow/stream

Sean_Van_Loon · ‎2019-10-28

Dear all,

I've been told via-via that in the new kernel 3.10 the elephant stream/flow would be resolved.

However, I couldn't find any information to support this claim.

AFAIK, to solve this, I estimate it would require a great architectural change.

Does anybody know something more?

And if it's not fixed yet, is there an ETA?

Thanks in advance!

Kind regards,

Sean

Chris_Atkinson · ‎2019-10-28

sk156672 is likely planned for a future R80.30 JHF and will help in some scenarios.

I'm sure others will also share anything relevant further to the above.

Cheers,

Chris

CCSM R77/R80/ELITE

Timothy_Hall · ‎2019-10-28

Check Point calls elephant flows "heavy connections" and introduced many new tools and techniques to help deal with them in R80.20, if you search for "heavy connections" in SecureKnowledge you can read about them. I don't think the Gaia kernel version (3.10 vs. 2.6.18) will make a difference in this particular area (unless there are some new NIC offload capabilities or something) but I could be wrong.

All vendors have issues with elephant flows due to a more or less immutable rule: All the packets of a single connection (TCP or UDP) can only be handled by one core (whether it is a SND/IRQ or Firewall Worker), so in the case of an elephant flow all those packets will hit a single core and cannot be spread around to different cores to relieve the load. Connections already using that saturated Firewall worker are trapped on the same core with the elephant flow and cannot be shed to another core (this may change with USFW though), but new connections will "run away" from the saturated core via the Dynamic Dispatcher.

Trying to spread the packets for a single elephant flow connection around to multiple cores raises the specter of out-of-order delivery, which if it happens will cause TCP to sharply curtail its send rate due to the congestion/problems it is seeing in the network flow and is an unmitigated disaster from a performance perspective.

Should this single-core limitation per connection ever be lifted, there would need to be some kind of reordering mechanism on the firewall to ensure packets are delivered in order. However doing so would bottleneck the flow of a connection's packets down to the speed of the slowest or most congested core handling packets for that connection, as packets that have already been inspected and are ready for transmission get held/queued waiting for earlier packets to clear inspection on the slowest or most congested core.

By default starting in R80.20 Check Point tried to reorder UDP packets being held by the firewall (simi_reorder_hold_udp_on_f2v) and would simply drop the entire contents of the queue if it overflowed. This caused problems at several sites after they upgraded and is a cautionary tale about attempting something like this. You can read about it here:

https://community.checkpoint.com/t5/Enterprise-Appliances-and-Gaia/Message-seen-on-var-log-messages-...

This is a very tough problem and not specific only to Check Point.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Sean_Van_Loon · ‎2019-10-28

Thank you for your elaborate explanation Timothy!
I'm curious as well if R&D has any feedback regarding this.

idants · ‎2019-10-29

Hi,

As Timothy mentioned, elephant flow is indeed a tough problem.

Linux kernel 3.10 won't solve this problem.

Fast Accel is a possible solution when we are dealing with trusted connections (from the administrator PoV) .

These days we are designing a long term solution for elephant connections.

If you have any customer who want to try a solution, please contact me by email.

Thanks,

Idan Tsarfati.

Duane_Toler · ‎2019-11-05

Is this something that can be addressed with ECN (ToS and TCP flag)? Could the fw worker itself set ECN on what have been identified as heavy connections, and additionally apply WRED-like behavior assuming a kernel-based queueing mechanism were in place? Could this also be addressed with marking from Floodgate?

Yes, this can cause delays of some flows, but if the flow has been identified as a 'heavy connection' already, then it should get differentiated treatment due to its specific behavior. This is what the AF PHB is designed to address (RFC 2597), and long TCP flows would be assigned AF1 as "bulk data" anyway (RFC 4594).

Likewise, on upstream/downstream network devices outside the firewall, applying WRED and queueing should be able to help manage traffic flows as well. This may not be entirely a solved-by-the-firewall issue. If FG1 would allow an upstream/downstream device to mark the flows, and *trust them* (with a default no-trust, but allow a config option to trust incoming DSCP), then a lot of this could be solved elsewhere in the network.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Are you a member of CheckMates?

Elephant flow/stream