Re: R81.10 and BGP

pce17 · ‎2022-03-13

I have upgraded from R80.20 to R81.10 . I currently have 2 eBGP peers and 1 iBGP peer.

When switching from active to standby the (old active) now standby cluster member goes into down status briefly. ROUTED on the now standby member uses high (CPU 65% one cpu) for over 60 minutes.

Status so far,

- lots of debugs and cpinfo

- Checkpoint TAC's (ticket open 2 weeks) solution was to remove graceful restart which on causes all connections to be dropped and high CPU. I will continue to work with TAC .

FYI (In R80.20 the cluster lost all connections for 30 seconds when going from active to standby. Checkpoint said the solution was to turn on graceful restart. I turned on graceful restart and it resolved the dropping of all connections for 30 seconds in R80.20.)

But now Checkpoint TAC claims removing graceful restart will fix the issue.

Is anyone else using iBGP and R81.10? DO you have any ideas

Leo

Chris_Atkinson · ‎2022-03-13

How many routes are in the BGP table and do the adjacent peer/s have GR configured on their side?

Which JHF take is used on this gateway/cluster?

CCSM R77/R80/ELITE

pce17 · ‎2022-03-13

400,000+ routes, GR is on both sides (see below) Members at JHF 30

PeerID AS Routes ActRts State InUpds OutUpds Uptime
12.122.NNN.NNN 7018 46809 40356 Established 11888 3 06:57:37
50.220.NNN.NNN 7922 7222 5110 Established 1936 3 06:57:01
4.53.NNN.NNN 21NNN 408564 392414 Established 126974 2 06:56:33

----- Peer 12.122
State Established (Uptime: 07:00:38)
Peer Type eBGP Peer
Remote AS 7018
Peer Capabilities IPv4 Unicast,Route Refresh,Cisco Route Refresh,Graceful Restart,4-Byte AS Extension
Our Capabilities IPv4 Unicast,Route Refresh,Graceful Restart,4-Byte AS Extension,Enhanced Route Refresh

----- Peer 50.220
State Established (Uptime: 07:00:02)
Peer Type eBGP Peer
Remote AS 7922
Peer Capabilities IPv4 Unicast,Route Refresh,Cisco Route Refresh,Graceful Restart,4-Byte AS Extension
Our Capabilities IPv4 Unicast,Route Refresh,Graceful Restart,4-Byte AS Extension,Enhanced Route Refresh

----- Peer 4.53
State Established (Uptime: 06:59:40)
Peer Type iBGP Peer
Remote AS 21NNN
Peer Capabilities IPv4 Unicast,Route Refresh,Cisco Route Refresh,Graceful Restart,4-Byte AS Extension,Enhanced Route Refresh
Our Capabilities IPv4 Unicast,Route Refresh,Graceful Restart,4-Byte AS Extension,Enhanced Route Refresh

Chris_Atkinson · ‎2022-03-13

From an external view point 400,000 in iBGP seems high for most environments.

Has TAC provide guidance on if the situation would be improved by reducing this with employing route optimization strategies downstream?

Which model gateway appliances are used here out of interest?

CCSM R77/R80/ELITE

pce17 · ‎2022-03-13

In R80.20 I demonstrated to TAC that the issue went away when I filtered the iBGP routes. I mentioned the iBGP route size to TAC but TAC did not seem interested. I think TAC thinks it is a configuration issue. In R80.20 a custom ROUTED was created to fix the iBGP route issue. We are using open hardware.

pce17 · ‎2022-03-13

My issue has been open with Sirius since February and TAC for two weeks. You have been asking some very good questions. I can try to adding the route filtering tomorrow and 6 - 7pm ET . That is our slow time during the week. I have assumed it is an iBGP and the number of routes from the beginning. TAC keeps on saying that was fixed in R80

Chris_Atkinson · ‎2022-03-13

If you have the SR number for the same issue under R80.20 you should be able to request a portfix via TAC if a hotfix was provided.

Where possible I would suggest both strategies are employed to ensure stability.

CCSM R77/R80/ELITE

pce17 · ‎2022-03-21

Checkpoint R&D now claims that the standby cluster member in high CPU (ROUTED) for hours is caused by having ONLY a 1gig heartbeat interface. They said I need to upgrade to a 10 GIG heartbeat connection..... Very interesting CISCO says "Cisco typically recommends a minimum of 512 MB of RAM in the router to store a complete global BGP routing table from one BGP peer" 512MG needs a 10GIG connection?

pce17 · ‎2022-03-21

My switch says the heartbeat interface max'ed out at 141Mbps ? 10 GIG?

Chris_Atkinson · ‎2022-03-21

Can you please share your SR number for the TAC case with me in private?

(P.S. How did you go with the route filtering / summarization?)

CCSM R77/R80/ELITE

John_Fleming · ‎2022-03-21

holy zoinks bat scoob!

That is an impressive amount of routes. I'm assuming those aren't all 1918 prefixes?

pce17 · ‎2022-03-22

They are all Internet routes. I have been filtering out the RFC1918 routes out since R77.30 (the good old days)

pce17 · ‎2022-03-22

Todays update is that ROUTED crashed on the active cluster member (HA1) and the (now standby member HA1) CUL'ed non-stop for 4 hours and 20 minutes.

Chris_Atkinson · ‎2022-03-22

Despite this occurrence, I want to come back to your original statement briefly.

"When switching from active to standby the (old active) now standby cluster member goes into down status briefly. ROUTED on the now standby member uses high (CPU 65% one cpu) for over 60 minutes."

What operational problem is this creating for you, how many cores/CPUs are assigned to the machine?

CCSM R77/R80/ELITE

the_rock · ‎2022-03-22

Message me privately if you need more help with this...I have BGP running in my lab with R81.10 and I had not seen these issues at all. Personally, I dont see logic in why you were asked to remove graceful restart option, that can only help in situation like yours.

Best,
Andy

pce17 · ‎2022-03-22

Can you setup your lab to have one Peer configured as iBGP and then send in the full BGP route table? In R80.20 if I restricted the routes (only default) from my iBGP peer the issue went away. Never had any issues with R77.30.

the_rock · ‎2022-03-22

I can, but might take some time, since I gave lab access to lots of my colleagues, as its very good setup. I will try do it some time this week. In the meantime, be free to message me privately and we can do remote tomorrow if you have time. Im in EST time zone (GMT -4 currently).

Best,
Andy

the_rock · ‎2022-03-22

I see what you are saying...tested it in R81.10, same issue. I wonder if its some kind of bug...

Best,
Andy

Chris_Atkinson · ‎2022-03-22

Out of interest what about your internal topology needs the full routing table versus fewer summarized routes?

Perhaps cBit is an alternate to GR that may assist per sk175923.

CCSM R77/R80/ELITE

pce17 · ‎2022-03-31

we need the full touting table or else the data is not routed correctly . Do to the usage of BGP the same subnets are used by many carriers

pce17 · ‎2022-03-31

My ROUTED crashed again and after over 30 days Checkpoint R&D has stated "We are suspect large route update creates a bottle neck, we are working on confirming this possibility." No kidding it was the same issue in R80.20. How long does it take to compare a R80.20 ticket to a R81.10 ticket/

the_rock · ‎2022-03-31

Personally, I doubt myself thats an issue, just my 2 cents. I had seen people in past advertise way more routes than yourself and never had any problems.

Best,
Andy

pce17 · ‎2022-03-31

Are the routes that you have seen used iBGP or eBGP?

pce17 · ‎2022-03-31

the issue is the receiving of the iBGP routes not the advertising

the_rock · ‎2022-03-31

iBGP...as a matter of fact, I saw someone use 900.000+ and no issues. If I were you, I would ask R&D for more details on this, because unless there is a clear cut proof of what they told you, I cant see that being a real reason for your issue. Of course, needless to say, it would be much better if you advertised 10, rather than 400K routes, but I would definitely inquire more.

Best,
Andy

pce17 · ‎2022-03-31

Is there a way to get in touch with the other person that is receiving over 900,000 routes from an iBGp peer to compare with?

I had the same issues in R80.20 and it required a custom ROUTED from Checkpoint R&D.

Chris_Atkinson · ‎2022-03-31

As I explained before if you have the SR number for your previous case you should be able to request the fix be ported to a newer version where applicable.

CCSM R77/R80/ELITE

pce17 · ‎2022-03-31

Already did that

Chris_Atkinson · ‎2022-03-31

So the issue is different or the fix is still being prepared?

CCSM R77/R80/ELITE

pce17 · ‎2022-03-31

R&D is still working on it the response is "We are suspect large route update creates a bottle neck, we are working on confirming this possibility."

Are you a member of CheckMates?

R81.10 and BGP