I'll try to share some information that might be helpful.
The fwd process runs on both the gateway and the log server / CLM. On the gateway side it is responsible for formatting the logs and sending them over the network. In high log rates, this can be quite compute intensive and indeed this process is single threaded so we might reach a bottleneck. In light of growing demand for more logs and more powerful gateways, we are developing a project that will allow multi-process fwd. Instead of making it multi-threaded, we will run multiple instances and greatly increase the maximum log throughput. This project will probably be targeted for R81.30, but we might have something earlier if someone is interested in an alpha.
On the log server side, fwd is responsible for accepting the logs and writing them into log files. On this end, fwd is actually simpler and less compute intensive. Most performance issues that we see on the log server are related to indexing and querying, since those operations have to "crunch" a lot of data. R81 brought significant performance improvements to both indexing and queries with the updated Solr indexing engine. R81.10 brought another enhancement that lets you distribute logs from a single gateway to multiple log servers / CLMs. This greatly helps with the scenario of a very powerful gateway with a lot of logging that exceeds the indexing capacity of a single log server. It also handles redundancy in a much more elegant way.
In your case, it indeed looks like fwd is maxing out on CPU. Beyond a stronger CPU or splitting to multiple log servers, I can suggest a few more things to check:
1) Since you are running on a virtualized environment, verify that resources are reserved for your MLM and not shared with other VMs.
2) Check if IO is high.
3) Check if you have many logs of type "Alert". You can check the rulebase "Track" definition, or do a query on the logs. Handling alerts in fwd adds additional computation flows and in large numbers might increase the load on it.