Re: Physical memory vs FW memory. Explanation need...

belteto · ‎2023-03-12

Hi All!

I try to understand the nature of these two parameters of the VSX vsls gateways.

What is the differences/similarities of these two parameters, Physical memory and the FW memory in the cpview.

It is a little bit foggy to me since we are investigating a behaviour.

Could someone able to explain it to me? Seems to me the Fw memory is more and more important than the physical to monitor.

I attached a picture from cpview, when the Fw memory is fully utilised but the physical is still on 50%.

In this case the gateway stop processing traffic, lot of 'internal rule base error' drops. But the gateway itself are available.

all input for this are highly welcome.

thanks in advance.

PhoneBoy · ‎2023-03-13

Physical memory refers to the entire appliance.
Firewall memory refers to the memory allocated to the various processes and such related to firewall functions.
More information is definitely required to assist in troubleshooting this (for example version/JHF level, precise error messages and such).

belteto · ‎2023-03-13

Thanks for the explanation and offer to help.

for my understanding, and correct me if I'm wrong:

The Physical memory usage is alway higher than the FW memory usage because this:

Physical memory usage = Fw memory usage + OS base memory usage

And PhysMem usage is increasing when the FWmem is increasing as well.

this is what we see on other VSX's (each has 10 vs on them) The physical men usage is ~3Gbit more than the fw men.

In this particular case after the reboot and latest hot fix (r81.10 T87) the fw memory usage is still higher than the physical men usages. And keep rising, very slowly.

pic attached.

This VSX cluster is a 3 node cp26000 96gb ram. r81.10 T87. (VSLS. with 39 VS and 5 switch on it)

No error message is visible now. only when the fw memory was 100% full, we got only 'internal rule base error' drop messages in the logs. nothing more.

Tac is already on in and possible RnD will be involved.

I'd like to pic your and the community's brain, maybe you saw similar like this.

PhoneBoy · ‎2023-03-13

It's possible there is a memory leak somewhere.
I recommend getting the TAC involved.

belteto · ‎2023-03-13

Yes we (Tac as well) are suspected memory leak, that is why they recommended to apply T87, which has memory fixes (as they told us)

Maybe not all memory issues was fixed. So they are still investigating.

the_rock · ‎2023-03-13

Yea, you got that right, but as @PhoneBoy said, its possible you have memory leak going on here. To be able to properly help you, can you send us outputs of below commands:

top

free -m

ps -auxw

cpview (look at initial screen for memory usage)

cpwd_admin list

enabled_bladed

cat /proc/cpuinfo

cat /proc/meminfo

cpstat fw -f all

Cheers,

Andy

belteto · ‎2023-03-13

Hi!

Attached the outputs.

on the 39 VS, there are 3 of them has its blade enabled

All others has only fw. and all the connections around 90-99% accelerated.

Thx

Balint

Teddy_Brewski

Hello @belteto . Did you find anything with the TAC?

We've recently experienced the same issue with R81.20 Take 90 on open servers. The FW memory got consumed and we ended up with 'internal rule base error' drops.

The case with the TAC went nowhere. They provided with the huge list of kernel settings that need to be enabled during the moment the memory is saturated, which hasn't happened so far.

Which brings me to another question: does anyone know how to monitor (SNMP) FW memory? I can get values for RAM - Real Active and RAM - Real Free, but it's no use.

Thank you.

the_rock

Hey Teddy,

Can you send what you see below when running cpview? You can also check history by running cpview -t and then t to enter the desired date onwards. By the way, do you see anything consuming high memory from top or pa -auxw commands? What does free -m show?

Andy

my lab:

[Expert@CP-GW:0]# free -m
total used free shared buff/cache available
Mem: 23309 6555 8494 32 8259 15303
Swap: 8191 0 8191
[Expert@CP-GW:0]#

Teddy_Brewski

Hello @the_rock

We didn't have enough patience and time to identify what was consuming high memory from top. It was a "Mad Max" emergency troubleshooting in the middle of the night. Even initial troubleshooting went in the wrong direction: FW memory values were overlooked and everybody was focused on the state of the cluster, which was perfectly fine and healthy. The failover fixed the issue and only in the morning we noticed 'internal rule base' errors and started replaying cpview which revealed FW memory exhaustion:

And this is how it looks now:

For some weeks we didn't experience any memory increase, so it's still under observation.

the_rock

Ok, fair enough. So, at this point, do commands top and ps -auxw show any process consuming high memory?

Andy

Teddy_Brewski

Nothing high:

As per 'ps -auxw', the most heavy talker (2.2%) is:

admin 19372 1.7 2.2 2646548 1438580 ? S<Ll Nov16 638:14 fwk

the_rock

That looks normal.

Teddy_Brewski

@the_rock, @belteto I think I found what causes the memory increase. Adding those two DOS rules increased the memory usage by ~1GB instantly, and then it continues to grow continuously. The counter never goes down, always up.

fwaccel dos rate add destination range:192.168.100.100 pkt-rate 1000

fwaccel dos rate add destination range:192.168.100.100 concurrent-conns 10000

Where 192.168.100.100 is one of ours, quite busy, publicly exposed, web servers.

I have around ~10 similar rules for other servers, but it seems that only these ones causes continuous and noticeable memory rise. Deleting those two rules stabilizes the memory usage.

We've had this rule activated for quite some time (around a year) in r80.40, so I think it's linked to r81.20.

the_rock

Excellent work @Teddy_Brewski . Im actually glad I saw your response, because I have a call with a customer later today and they asked me this exact question about the rule you added, so I will probably tell them not to do it, if it caused all these issues.

Andy

PhoneBoy

Have you opened a TAC case on this?
While I can see memory increasing somewhat, that amount doesn't seem reasonable.

Teddy_Brewski

Going to! Now I can reproduce it live.

I think it's somehow related to the load or protocols (http/s) used with that particular server. I don't see memory increasing with other 10 rules.

the_rock

Hey Teddy,

Just curious, whats different with those other rules if you dont mind expanding on it further?

Andy

Teddy_Brewski

Hi @the_rock , the syntax is exactly the same, it's just that the values are smaller:

fwaccel dos rate add destination range:xxx.xxx.xxx.x pkt-rate 500

fwaccel dos rate add destination range:xxx.xxx.xxx.x concurrent-conns 1000

I have a feeling it's linked to the nature of the traffic. The web server has http/s ports opened and is extremely busy, and perhaps DDoS continuously as we speak.

The rules above are applied to all publicly exposed DNS authoritative servers to mitigate fast flood DNS attacks. Could it be that they are not under attack at the moment and that explains no memory raise?

the_rock

Got it, makes sense, thanks a lot. Yes, I would agree as far as your question, most likely since those are not under attack. When I say those, Im referring to IP addresses in the rules.

Andy

Teddy_Brewski

I think I will be able to confirm this very soon, since we're 'fast flooded' every 4-5 days.

the_rock

Please keep us posted.

Andy

Teddy_Brewski

I think my assumption is correct, it seems it affects all rules. Actually it's even worse, since the memory doesn't seem to be released either. From my last post, according to 'fwaccel dos stats get', there was an increase in SecureXL packets dropped due to the rate limit, so the current rules were indeed used:

Rate Limit: 1725575

According to cpview, the FW memory has increased too:

The counter never goes down. Since there are no ongoing attacks it flaps between 5,115 and 5,119, but never significantly lower.

Anyone using 'fwaccel dos' rules under R81.20?

belteto

Hi Teddy!

In our case, there was a memory leak identified, the dynamic_balancing feature process(dsd) caused.
The next jumbo hotfix solved the issue.

There is no way to monitor the Fw memory directly. No specific OID assigned to that parameter.

The get the data via snmp, we created a custom oid which run a script that query the counters with the fw ctl pstat:

added in the /etc/snmp/userDefinedSettings.conf:

pass .1.3.555.1 /usr/local/bin/mem_pass.sh max

pass .1.3.555.2 /usr/local/bin/mem_pass.sh used

/usr/local/bin/mem_pass.sh

#!/bin/bash

max=$(fw ctl pstat | grep Physical | awk '{print $9}')

used=$(fw ctl pstat | grep Physical | awk '{print substr($5,2)}')

if [[ $1 =~ max ]]

then

echo .1.3.555.1.0

echo integer

echo $max

fi

if [[ $1 =~ used ]]

then

echo .1.3.555.2.0

echo integer

echo $used

fi

Hope it helps!

Balint

Teddy_Brewski

Thanks a lot!

Are you a member of CheckMates?

Physical memory vs FW memory. Explanation needed!