The best a single connection can do here is be handled on a single CPU for inspection and perhaps another one for encryption/decryption (see sk118097: MultiCore Support for IPsec VPN); Hyperflow only helps with certain types of Threat Prevention inspection and does not truly spread all inspection duties around multiple cores for an elephant flow.
The key for maximum performance of this traffic in a VPN tunnel here will be:
1) Get the connection into the fastpath, so that all inspection and encryption/decryption will be handled completely in the fast path on a single SND. When one of these connections is actively up and running, run fw tab -t connections -z. If you don't see the connection here at all that means the connection is not in the slowpath, so the connection is eligible to be forced into the fastpath via fast_accel: sk156672: SecureXL Fast Accelerator (fw fast_accel) for R80.20 and above
2) If your firewalls both support the AES-NI processor extension (they almost certainly do, use fw ctl get int AESNI_is_supported to check), use the AES-GCM-128 version of AES for IKE Phase 2/IPSec. GCM combines the encryption and hashing into a single operation that can be boosted 4-10X over the main CPU via AES-NI. If the processor architecture does not support AES-NI (unlikely), AES-128 will be slightly more efficient. GCM is not supported for use by IKE/Phase 1 until R82, but the vast majority of traffic sent in a VPN tunnel is in the IPSec/P2 tunnel, so using GCM in IKE/Phase1 won't make much of a difference performance-wise.
3) Leaving PFS disabled will avoid an expensive DH calculation every time the P2 tunnel expires, but that is only once every 60 minutes by default which won't make much of a difference.
4) It is assumed that the MTU is 1500 between all these systems so you won't have to deal with fragmentation or TCP MSS Clamping. If not get that fixed ASAP. Also it is assumed the network is clean and that netstat -ni is showing no RX-ERR/OVR/DRP.
5) One final thing to try is disabling SMT/Hyperthreading so that each SND instance is assigned a "full" physical core and not being shared with another SND instance. This might get you another 20-30% boost under high load.
If all this was done, when one of these heavy connections is running in the tunnel you should see one of your SNDs climb near 100% CPU utilization; that performance level is pretty much all you are going to get out of it unless you get a bigger box with faster individual CPUs. If the connection seems to be topping out and the SND is not near 100%, there may be something else bottlenecking it that could be rectified to pick up some more speed. Multi-Queue might be able to automatically spread out the processing across multiple SNDs but I'd say that is pretty unlikely, but make sure MQ is enabled on the relevant interfaces anyway.
I doubt GRE would be faster than 100% fastpath handling, even without the encryption/decryption overhead.
Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com