giovedì 11 maggio 2023

BGP is the answer, what is the question ?

In my talk at ITNOG7 I presented "BGP FlowSpec Services beyond DDOS Mitigation" with the intention of proposing other uses of flowspec, as too often cataloged exclusively as a tool for managing DDOS.I built two services to achieve egress engineering and bidirectional traffic steering, using a combination of BGP Flowspec and MPLS L3VPN. Finally I described a framework for creating NFV services that can scale on service provider architectures.
the slide with the requirements and proposed solution are eloquent:


and in the conclusions I try to make people think about how I organized the service


Ivan Pepelniak who I had the honor of having as a reviewer and listener, told me that I should call my presentation "BGP is the answer, what is the question ?" and I find it hard not to agree with him.


This year was a huge edition of #ITNOG, which over the years is growing in form and content, maintaining that spirit of independence and informality that make it unique.

A big thank you to all the ITNOG - Italian Network Operators Group staff for the work done and for allowing me to present again this year, and to all the people I met who wanted to renew their friendship and had the patience to listen to me.

the presentation it's available on my GitHub repository 


with also an extended edition from WWT 2023


mercoledì 27 aprile 2022

modern bgp design

The Wholesale Winery Tour 2022 was an opportunity to meet old and new friends, and to present something new.

"modern bgp design" is a talk on how to overcome the stereotypes of traditional bgp design and combine new features using BGP as a real control-plane protocol.

The target of the presentation are small and medium-sized service providers and the will is to rationalize network resources and simplify infrastructure and configurations. It is then used what I use to call a "dual stage lookup" for upstream traffic, distributing only the default route within the POPs and DCs, and keeping the Full Internet Routing Table only within the core devices.

A single pair of RR is used for the entire infrastructure, exploiting a combination of ADD-PATH [rfc7911] and ORR [rfc9107] (Optimized Route Reflection) to distribute in a targeted manner the only information necessary to maintain optimal routing and guarantee load-balancing or even just High Availability based on a local convergence.

this is the full Presentation



venerdì 29 maggio 2020

Switch as Internet Border Router - FIRT with selective FIB install

I had the opportunity to present at ITNOG on the web the use of a switch as an internet border router, and how to set up a distribution strategy within the backbone to reduce the routing information while maintaining an “almost optimal routing.

For the border router, I used a selective FIB installation in TCAM only for significant destinations. The wide availability of RAM on recent switches permits to hold the FIRT (Full Internet Routing Table) and tag with a BGP Community the relevant destination to be loaded into the FIB. For the remaining prefixes a “hot-potato” strategy can be applied using a default-router to the transit provider.



The optimal border selection with a “cold-potato” approach is then realized into the backbone, using an intelligent external route reflection on the route-reflector.

The border routers sends the FIRT to the route-reflector with significant prefixes already cataloged with BGP communities. it therefore becomes a task of the RR to reflect only these significant prefixes or make a further selection, for example by combining the destinations with a netflow analysis.

The idea is inspired by the work of David Barroso and Paolo Lucente with their SIR (Software Internet Router)

The main difference of my solution:

-       It’s a combination of selective FIB installation on border router and selective route distribution in the backbone
-       the border router lies in the approach completely based on BGP and native policies, and not by an external programming or by the loading of the FIB by an external controller.
-       Further route selection, driven by a netflow analysis it’s performed only on the route-reflector.
-       All the backbone router operates without the FIRT and a traditional aligned RIB/FIB.

The solution is therefore extremely simple and requires peering devices capable of managing the FIRT in RAM and in TCAM a number of prefixes certainly much smaller than the size of the current FIRT.

This the complete presentation 

lunedì 3 dicembre 2018

The need for simplicity and standardization, at least in networking (The wheel has already been invented)

As my knowledge and experience in networking evolved, I came to the conclusion that too much freedom and too many features can be very dangerous, especially in the wrong hands. After all, who would give a Ferrari to a young driver?

The truth is that getting to the essence, removing the superfluous and using the right tools and in the right way, is a precious skill to be developed with continuous study, dedication and preferably under the right guidance. Often this could simply be the result of identifying requirements and turning them to a reference architecture, but too often design becomes an exercise in creativity and research of originality with the continuous research to be able to use up to the last available feature.

In recent years I have combined the search for simplicity with that of prototyping to make everything replicable and automatable. During my participation at the #NFD19, I had the opportunity to learn more about Apstra and meet a group of people for whom this mantra has been realized in the creation of their AOS "Apstra Operating System".

They just used some considerations:


AOS is designed to create and manage multivendor Clos IP Fabrics, in which a standard configuration is applied that makes extensive use of BGP and EVPN. The model used is not mandatory and can be modified to adapt to any different customer needs. But in this case, we would have to go back on questioning the starting idea, and in fact talking with Apstra team, they confirmed that customers eventually adopt the proposed configuration without modifications.

The advantages of this approach are such that discourage a change. The configurations are in fact:
  • Made by experts and based on their greater experience
  • Born to guarantee interoperability between different vendors
  • Continuously verified to guarantee compatibility with the new releases of AOS and vendors.


This is my tweet during the event with the nice slide in which the CEO presents the result in the interoperability EVPN between Cisco and Arista.

Intent-based Design
This abstraction and simplified management therefore translate into a set of applications in which the resources available are defined in terms of devices, connections, addresses, AS numbers, and the intentions (or blueprints), namely the specificity of connectivity, visibility, security etc. AOS is concerned with translating them into configuration statements and applying them to the devices.

Automation of Automation
Automation is not just delegated to configuration management but is a fundamental component of the platform. In fact, a common platform is provided to customize and further automate the various devices. The thing that struck me is the possibility to create probes, which allow a pro-active monitoring of the infrastructure. On GitHub there is already a large collection of probes and users are encouraged to contribute. The automation is therefore automatable, as AOS itself. I leave to you to determine the level of recursion to which we have arrived :-).

Intent-based Analytics
AOS integrates a sophisticated management of telemetry, which also in this case normalizes the different vendors. The results of the active probes are integrated and also in this case it is possible to integrate automatic instruments. The preventive management of GBIC failures has been presented: the entire process of off-line transfer has been automated upon exceeding the established threshold values, the activation of a replacement request and subsequent replacement return.

It is not my goal to do a review of the platform, you can directly draw your conclusions from NFD videos available at the Tech Field Day  portal, but I would like to share some thoughts:

Is it possible to simplify everything?
The reality is that in order to manage very large and intrinsically complex systems, simplicity is a requirement. But simplifying does not mean renouncing any of the necessary functionalities, rather rationalizing and reducing the features used to achieve the prefixed objectives.

What makes me think that Apstra get the right way?
It comes down to the maturity and scalability of the architecture and technology used: Clos fabric, BGP and EVPN are now the building blocks for each datacenter. If once were only the prerogative of large data centers, now it is also applicable on the small scale. I have already discussed my beliefs with highly qualified people here (Using EVPN in Very Small Data Center Fabrics), and also the data presented by Apstra show how their solution is spreading also in the enterprise market and not just among the big cloud providers.

The use of an abstraction layer (or intent), can eliminate the main obstacle represented by BGP knowledge and experience.

Are we going to meet a world that is all the same?
In reality, it is not what we have always looked for with the various IETF, ISO etc etc? Unfortunately, the transition from what has been defined to what has been achieved has introduced sometimes insurmountable incompatibilities or led to completely different implementations (Anyone TRILL?).
But moving from protocols to architectures, we introduce the "user" variable that too often ventures into original solutions "because I have unique needs" or simply as a result of a disorderly growth that miraculously works (Layer-2 + Any form of spanning tree).

Having a guide and a reference model which must be followed (especially for those without the skills) is certainly very useful.

Can I keep my uniqueness?
The wheel has already been invented! Let's try to think that the Clos networks theory was defined in 1953 and only in the last few years has it been applied in our networks. I believe that only now does networking begin to come out of its stone age and begin the transition from craftsmanship to industrialization. It is important to invest your resources in an intelligent way and in the search for the simplification and automation of work rather than the solution of problems for which ready and tested solutions exist. This means working at a higher level, exploiting automation and in this case with Apstra being able to do it transparently with more vendors.

Right now, the solution is confined to the datacenter, but I'm curious to discover the evolution and if it will decide to expand further into the enterprise world, how it will address the Campus and Wan themes.

lunedì 7 novembre 2016

evpn control-plane for overlay networks

I had the opportunity to talk about datacenter during itnog2. thank you guys!


 
I talked about the use EVPN as control plane for overlay networks, and how to exploit them to create distributed services between different datacenters.
I also mentioned the use of EVPN type-5 with proxy-arp to reduce distribution of mac-address routes and completely eliminate layer-2, while maintaining compatibility with current clustering and HA solutions based on layer-2 but now distributed in multiple datacenters.


 

this is the full presentation

lunedì 20 agosto 2012

BGP Diverse-Path for a faster convergence


The BGP implementation in Junos is event-driven while in IOS is timer based and require that the scan process goes trough the BGP Table and select the best path to put into the RIB. The BGP scan-time command control this interval, with a default value of 60 sec.
In a large scale bgp scenario where usually route reflectors are involved, this mean that in the worst case the convergence time can be up to 120 sec because the route-reflector convergence and bgp update is required before the client can have a consistent BGP table and compute the new best path updating the RIB.
This is because Route Reflectors distribute to the clients only the best path.

In layer-3 MPLS VPN scenario this problem is solved using different Route-Distinguisher that create not comparable entry on route reflectors allowing reflection of both routes to all clients. This action moves the best-path selection process to all clients, eliminating the intermediate covergence step of route-reflectors.
But how to solve the problem in global routing table ?
Different approach are proposed, and a wonderful discussion can be reached here:

http://blog.ine.com/2010/11/22/understanding-bgp-convergence/

I already use the Add-Path ( http://tools.ietf.org/html/draft-ietf-idr-add-paths-07 ) extension that permit multiple next-hop for the same prefix, this allows load-balancing in addition to the fast convergence due to the direct next-hop tracking, but this approach require the support of this new bgp capability and usually MPLS encapsulation on the backbone to prevent ip lookups and possible routing loops on transit nodes.
BGP Diverse-Path ( http://tools.ietf.org/html/draft-ietf-grow-diverse-bgp-path-dist-08 ) it's not a new capability, but comes from the knowledge of the topology and uses existing attributes of a typical RR BGB Cluster. One cluster member are selected as a "shadow" route-reflector and instead of reflect the best path ( that is reflected by the others route reflector in the cluster ) it's announce the backup path to his clients. It's also important to note that like all other routers in the backbone, it still install the best path into it's own RIB for traffic forwarding.

Now all backbone routers has at least two iBGP peering session to the RR Cluster, the first to the regular route-reflector and the other to the shadow RR.
BGP topology on the RR clients now contain the best and the backup path, allowing a local calculation of the best path. This step eliminates the need of convergence of the route-reflector, halving the total convergence time removing the convergence requirement of the route reflector.

This behavior it's not new, and in the past was performed with an IGP metric manipulation on the Shadow route-reflector ( because the in these cases the tie-break for the best path selection process is the IGP ) but now on some IOS image there is the support to build in a simple manner this architecture.
The last step to speed up the convergence process is to eliminate the scan time and trigger the reconvergence process to the next-hop availability. This can be performed using the next-hop-tracking feature that track the IGP for the next-hop reachability and trigger an immediate reconvergence. In recent IOS version this function is enabled by default.
Take care that having so different converging time ( from few ms to 120 sec ) on different part of the backbone can lead to a traffic loops and high dependence to flapping links. The development of a fast convergence and high capacity backbone require a careful analysis of all components ( and the possible involvement of MPLS, LFA and TE ) and not just enabling some fancy feature.

Testing Lab

This is the complete lab scenario to test this capability:

Into the lab only IPv6 addresses form the ULA ( Unique Local Address ) address Range are used: only one single level-2 ISIS area with all the point-to-point internal lefts to the automatic link-local addresses. Loopback are numbered as /128 ipv6 address and an aggregate prefix is generated on the peering point.
The complete addressing and IGP configuration of R2 looks like:

!
interface Loopback0
no ip address
ipv6 address FD00::2/128
!
interface FastEthernet0/0
no ip address
!
interface FastEthernet0/0.201
description ---- to R1 ----
encapsulation dot1Q 102
ipv6 enable
ipv6 router isis
isis network point-to-point
!
interface FastEthernet0/0.203
description ---- to R3 ----
encapsulation dot1Q 203
ipv6 enable
ipv6 router isis
isis network point-to-point
!
interface FastEthernet0/0.205
description ---- to R5 - ASN2 ----
encapsulation dot1Q 205
ipv6 address FD00:25::2/64
!
router isis
net 49.0000.0000.0002.00
is-type level-2-only
metric-style wide
no hello padding
passive-interface Loopback0
!


R1 is chosen as the shadow route reflectors.
To configure the BGP Diverse Path on the shadow router 4 steps are required:


1) Disable the IGP bestpath igp-metric tie-break ( optional and topology depended )
bgp bestpath igp-metric ignore
2) Allow the identification of the backup path
bgp additional-paths select backup
3) Permit the backup path announcement
bgp additional-paths send
4) select the route-reflection clients enabled for the update ( the Clients peer-group )
neighbor Clients advertise diverse-path backup


The complete BGP configuration of the shadow RR ( R1 )


!
router bgp 1
bgp router-id 100.0.0.1
bgp cluster-id 1
bgp log-neighbor-changes
no bgp default ipv4-unicast
neighbor Clients peer-group
neighbor Clients remote-as 1
neighbor Clients update-source Loopback0
neighbor FD00::2 peer-group Clients
neighbor FD00::3 peer-group Clients
neighbor FD00::4 peer-group Clients
!
address-family ipv4
exit-address-family
!
address-family ipv6
bgp additional-paths select backup
bgp additional-paths send
bgp bestpath igp-metric ignore
neighbor Clients route-reflector-client
neighbor Clients advertise diverse-path backup
neighbor FD00::2 activate
neighbor FD00::3 activate
neighbor FD00::4 activate
exit-address-family
!


check the BGP status:


R1#sh bgp all summary
...
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
FD00::2 4 1 20 24 5 0 0 00:14:47 1
FD00::3 4 1 22 28 5 0 0 00:18:56 0
FD00::4 4 1 24 27 5 0 0 00:18:55 2


The bgp table identify the best path for for "FD00:5::/64" trough R2 ( and install into the RIB ) and the possible "backup-path" trough R4:


R1#sh bgp ipv6 unicast
BGP table version is 5, local router ID is 100.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i FD00::/64 FD00::4 0 100 0 i
*>i FD00:5::/64 FD00::2 0 100 0 2 i
*bi FD00::4 0 100 0 2 i


This backup path is now sent to R3:

R1#sh bgp ipv6 unicast neighbors FD00::3 advertised-routes
BGP table version is 5, local router ID is 100.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i FD00::/64 FD00::4 0 100 0 i
*biaFD00:5::/64 FD00::4 0 100 0 2 i

Total number of prefixes 2


on R3 the nexthop trigger is enabled with a timeout of 1 sec for the IPv6 address-family


router bgp 1
bgp router-id 100.0.0.3
no bgp default ipv4-unicast
bgp log-neighbor-changes
neighbor FD00::1 remote-as 1
neighbor FD00::1 update-source Loopback0
neighbor FD00::2 remote-as 1
neighbor FD00::2 update-source Loopback0
!
address-family ipv6
bgp nextop trigger enable
bgp nextop trigger delay 1
neighbor FD00::1 activate
neighbor FD00::1 activate
neighbor FD00::1 activate
exit-address-family
!


on R3 two exit point for FD00:5::/64 are now present, and the best path still select R2 as the primary, but the backup path is already present in the BGP table

R3#sh bgp ipv6 unicast
BGP table version is 8, local router ID is 100.0.0.3
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric LocPrf Weight Path
* iFD00::/64 FD00::4 0 100 0 i
*>i FD00::4 0 100 0 i
*>iFD00:5::/64 FD00::2 0 100 0 2 i
* i FD00::4 0 100 0 2 i


A traceroute confirm the complete path correctness:

R3#traceroute ipv6 fd00:5::5

Type escape sequence to abort.
Tracing the route to FD00:5::5

1 FD00::2 12 msec 8 msec 8 msec
2 FD00:5::5 24 msec 84 msec 84 msec


As a simple test, during a continuous ping to R5 from R3, the R2 loopback was forced down, triggering the backup path selection without any packet loss.


R3#ping fd00:5::5 repeat 1000

Type escape sequence to abort.
Sending 10000, 100-byte ICMP Echos to FD00:5::5, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

*Mar 1 01:00:05.675: %BGP-3-NOTIFICATION: received from neighbor FD00::2 4/0 (hold time expired) 0 bytes
*Mar 1 01:00:05.675: %BGP-5-ADJCHANGE: neighbor FD00::2 Down BGP Notification received
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!



Conclusion

Whit this solution the convergence time of bgp is now comparable to the IGP also with route reflectors.
BGP add-path is obviously the more powerful options but require the specific capability in most of the BGP speaker, and then recommended for new solutions,  while diverse-path can help to improve the global convergent time without requiring any new capability on legacy device. MPLS is not always required for both solutions, but take my advice and adopt it always.

Feature availability:

This feature is primary available in IOS XR and recently implemented in IOS 15.2(3)T and 15.2(4)S

The complete lab configuration are here