Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
pce17
Explorer

R81.10 and BGP

I have upgraded from R80.20 to R81.10 .  I currently have 2 eBGP peers and 1 iBGP  peer. 

When switching from active to standby the (old active) now standby cluster member goes into down status briefly.  ROUTED on the now standby member uses high (CPU 65% one cpu)  for over 60 minutes.

Status so far, 

- lots of debugs and cpinfo

- Checkpoint TAC's (ticket open 2 weeks) solution was to remove graceful restart which on causes all connections to be dropped and high CPU.  I will continue to work with TAC .

FYI (In R80.20 the cluster lost all connections for 30 seconds when going from active to standby. Checkpoint said the solution was to turn on graceful restart. I turned on graceful restart and it resolved the dropping of all connections for 30 seconds in R80.20.)

But now Checkpoint TAC claims removing graceful restart will fix the issue.

 

Is anyone else using iBGP and R81.10?  DO you have any ideas

 

Leo

 

 

0 Kudos
37 Replies
Chris_Atkinson
Employee
Employee

How many routes are in the BGP table and do the adjacent peer/s have GR configured on their side?

Which JHF take is used on this gateway/cluster?

0 Kudos
pce17
Explorer

 

400,000+ routes, GR is on both sides (see below)   Members at JHF 30 

PeerID          AS Routes ActRts     State     InUpds OutUpds Uptime
12.122.NNN.NNN 7018 46809 40356    Established  11888    3      06:57:37
50.220.NNN.NNN 7922 7222 5110     Established    1936    3      06:57:01
4.53.NNN.NNN 21NNN 408564 392414   Established 126974    2      06:56:33

 

----- Peer 12.122
State Established (Uptime: 07:00:38)
Peer Type eBGP Peer
Remote AS 7018
Peer Capabilities IPv4 Unicast,Route Refresh,Cisco Route Refresh,Graceful Restart,4-Byte AS Extension
Our Capabilities IPv4 Unicast,Route Refresh,Graceful Restart,4-Byte AS Extension,Enhanced Route Refresh

----- Peer 50.220
State Established (Uptime: 07:00:02)
Peer Type eBGP Peer
Remote AS 7922
Peer Capabilities IPv4 Unicast,Route Refresh,Cisco Route Refresh,Graceful Restart,4-Byte AS Extension
Our Capabilities IPv4 Unicast,Route Refresh,Graceful Restart,4-Byte AS Extension,Enhanced Route Refresh

----- Peer 4.53
State Established (Uptime: 06:59:40)
Peer Type iBGP Peer
Remote AS 21NNN
Peer Capabilities IPv4 Unicast,Route Refresh,Cisco Route Refresh,Graceful Restart,4-Byte AS Extension,Enhanced Route Refresh
Our Capabilities IPv4 Unicast,Route Refresh,Graceful Restart,4-Byte AS Extension,Enhanced Route Refresh

0 Kudos
Chris_Atkinson
Employee
Employee

From an external view point 400,000 in iBGP seems high for most environments.

Has TAC provide guidance on if the situation would be improved by reducing this with employing route optimization strategies downstream?

Which model gateway appliances are used here out of interest?

pce17
Explorer

In R80.20 I demonstrated to TAC that the issue went away when I filtered the iBGP routes.  I mentioned the iBGP route size to TAC but TAC did not seem interested. I think TAC thinks it is a configuration issue.  In R80.20 a custom ROUTED was created to fix the iBGP route issue.   We are using open hardware.

0 Kudos
pce17
Explorer

My issue has been open with Sirius since February  and TAC for two weeks.  You have been asking some very good questions.   I can try to adding the route filtering tomorrow and 6 - 7pm ET .  That is our slow time during the week.  I have assumed it is an iBGP and the number of routes from the beginning.  TAC keeps on saying that was fixed in R80

0 Kudos
Chris_Atkinson
Employee
Employee

If you have the SR number for the same issue under R80.20 you should be able to request a portfix via TAC if a hotfix was provided.

Where possible I would suggest both strategies are employed to ensure stability.

0 Kudos
pce17
Explorer

Checkpoint R&D now claims that the standby cluster member in high CPU  (ROUTED) for hours is caused by having ONLY a 1gig heartbeat interface. They said I need to upgrade to a 10 GIG heartbeat connection.....     Very interesting CISCO says "Cisco typically recommends a minimum of 512 MB of RAM in the router to store a complete global BGP routing table from one BGP peer"     512MG needs a 10GIG connection?

0 Kudos
pce17
Explorer

My switch says the heartbeat interface max'ed out at 141Mbps ? 10 GIG?

0 Kudos
Chris_Atkinson
Employee
Employee

Can you please share your SR number for the TAC case with me in private?

 

(P.S. How did you go with the route filtering / summarization?)

0 Kudos
John_Fleming
Advisor

holy zoinks bat scoob!

That is an impressive amount of routes. I'm assuming those aren't all 1918 prefixes?

0 Kudos
pce17
Explorer

They are all Internet routes. I have been filtering out the RFC1918 routes out since R77.30 (the good old days) 

0 Kudos
pce17
Explorer

Todays update is that ROUTED crashed on the active cluster member (HA1) and the  (now standby member HA1)  CUL'ed non-stop for 4 hours and 20 minutes.  

0 Kudos
Chris_Atkinson
Employee
Employee

Despite this occurrence, I want to come back to your original statement briefly.

"When switching from active to standby the (old active) now standby cluster member goes into down status briefly.  ROUTED on the now standby member uses high (CPU 65% one cpu)  for over 60 minutes."

What operational problem is this creating for you, how many cores/CPUs are assigned to the machine?

0 Kudos
the_rock
Champion
Champion

Message me privately if you need more help with this...I have BGP running in my lab with R81.10 and I had not seen these issues at all. Personally, I dont see logic in why you were asked to remove graceful restart option, that can only help in situation like yours.

0 Kudos
pce17
Explorer

Can you setup your lab to have one Peer configured as iBGP and then send in the full BGP route table?   In R80.20 if I restricted the routes (only default) from my iBGP peer the issue went away.  Never had any issues with R77.30.

0 Kudos
the_rock
Champion
Champion

I can, but might take some time, since I gave lab access to lots of my colleagues, as its very good setup. I will try do it some time this week. In the meantime, be free to message me privately and we can do remote tomorrow if you have time. Im in EST time zone (GMT -4 currently).

0 Kudos
the_rock
Champion
Champion

I see what you are saying...tested it in R81.10, same issue. I wonder if its some kind of bug...

0 Kudos
Chris_Atkinson
Employee
Employee

Out of interest what about your internal topology needs the full routing table versus fewer summarized routes?

Perhaps cBit is an alternate to GR that may assist per sk175923.

0 Kudos
pce17
Explorer

we need the full touting table or else the data is not routed correctly .  Do to the usage of BGP the same subnets are used by many carriers

0 Kudos
pce17
Explorer

My ROUTED crashed again and after over 30 days Checkpoint R&D has stated "We are suspect large route update creates a bottle neck, we are working on confirming this possibility."   No kidding it was the same issue in R80.20.  How long does it take to compare a R80.20 ticket to a R81.10 ticket/

0 Kudos
the_rock
Champion
Champion

Personally, I doubt myself thats an issue, just my 2 cents. I had seen people in past advertise way more routes than yourself and never had any problems.

0 Kudos
pce17
Explorer

Are the routes that you have seen used  iBGP or eBGP?   

0 Kudos
pce17
Explorer

the issue is the receiving of the iBGP routes not the advertising

0 Kudos
the_rock
Champion
Champion

iBGP...as a matter of fact, I saw someone use 900.000+ and no issues. If I were you, I would ask R&D for more details on this, because unless there is a clear cut proof of what they told you, I cant see that being a real reason for your issue. Of course, needless to say, it would be much better if you advertised 10, rather than 400K routes, but I would definitely inquire more.

0 Kudos
pce17
Explorer

Is there a way to get in touch with the other person that is receiving over 900,000 routes from an iBGp peer to compare with?

I had the same issues in R80.20 and it required a custom ROUTED from Checkpoint R&D.

0 Kudos
Chris_Atkinson
Employee
Employee

As I explained before if you have the SR number for your previous case you should be able to request the fix be ported to a newer version where applicable.

0 Kudos
pce17
Explorer

Already did that

0 Kudos
Chris_Atkinson
Employee
Employee

So the issue is different or the fix is still being prepared?

0 Kudos
pce17
Explorer

R&D is still working on it the response is "We are suspect large route update creates a bottle neck, we are working on confirming this possibility."  

0 Kudos