Zachary Cayou – SMS https://www.sms.com Solving | Managing | Securing Thu, 07 Dec 2023 18:49:36 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 https://www.sms.com/wp-content/uploads/2023/05/cropped-sms-favicon-32x32.png Zachary Cayou – SMS https://www.sms.com 32 32 Complex Routing Symmetry in BGP https://www.sms.com/blog/complex-routing-symmetry-in-bgp/ https://www.sms.com/blog/complex-routing-symmetry-in-bgp/#respond Thu, 07 Dec 2023 17:55:10 +0000 https://www.sms.com/?p=7046 Enforcing network traffic symmetry across stateful devices has long been the biggest routing challenge faced in our network. It is unquestionably the singular problem I have dedicated the most amount of brainstorming, whiteboarding, and testing to. In this blog I share the problem, how it relates to our network, and the theory and solution that we use to solve it. This blog assumes the reader has a strong foundation in BGP design and behavior for a full understanding of the theory and solution.

Traffic Symmetry

In many networks, asymmetry of traffic can be acceptable, provided it does not negatively affect the reliability of the traffic flow. Routers generally do not function as stateful devices — they do not require establishing and maintaining a connection that must handle both sides of a bidirectional traffic flow. However, firewalls, load balancers, and many security devices generally are stateful — they must observe the traffic flow in both directions. It should be intuitive that any device responsible for inspecting traffic, and not just forwarding traffic, would have this requirement. In a large disperse network that handles hundreds of stateful devices, the challenge rests on engineering the routing to ensure that any traffic flow traverses the same stateful device(s) bidirectionally. This challenge may be simple when stateful devices are placed at points in the network where traffic is naturally funneled, such as the ingress point to a local network, but may be very difficult when those devices are instead distributed across centralized locations.

Our Network

To better frame the problem as it relates to our network, it is helpful to provide an overview of our network design. Our network consolidated all network security and inspection into approximately a dozen centralized locations across the United States, while processing traffic originating at hundreds of other locations. We refer to these centralized security/inspection locations as stacks. They are a collection of routers, firewalls, switches, load balancers, passive and active inspection devices, and monitoring tools. These stacks also serve as the ingress/egress point for all traffic entering or exiting our network perimeter. Security is layered into tiers, where traffic flows often require traversing several sets of security devices when communicating between different organizations, each of whom control the security policy in their routing domain. Each stack provides redundancy within itself, as well as redundancy in predefined failover orders, such that any traffic pattern should have N+1 redundancy within a single stack, and N+2 redundancy across different stacks. In this model, every stack and edge location can be said to have a distinct primary, secondary, and tertiary stack for processing traffic.

All edge locations on the network are typically served by a single pair of edge routers with a L2 campus infrastructure.

1

All locations on the network connect to an MPLS network and exchange L3 routing information in BGP via VPNv4/VPNv6. Every security zone on the network exists as an isolated forwarding domain that must traverse one or more security tiers to communicate outside of that zone. These are enforced by strictly segregated L3VPNs across the MPLS core, VRFs on routing devices, VLANs on switches, and ultimately correspond to a unique zone on a firewall, the sum of which we refer to as a routing-domain.

The routing-domains form the discrete components of the tiered architecture, where an upper tier may belong to a large entity who broadly controls the security policy for the entire organization at the network perimeter, while a lower tier may belong to a subdivision of that organization, responsible only for their piece of the network, but still nested within the organization as a whole.

2

All intra-zone traffic routes directly while all inter-zone traffic must route through some portion of a stack — at a minimum across the FW within the zone’s routing domain. Routing information between zones within a stack is exchanged via eBGP, directly from each VRF on the edge routers to a control-plane-only router that we call the firewall-router. This router effectively shadows each firewall context and provides the routing functionality that could otherwise be provided by the firewall itself.

3

Requirements

The ultimate goal in developing our solution is that any traffic flow between hosts in different zones takes a symmetric path through the stacks, subject to the following requirements:

  1. Signaled in BGP — the routing logic must be enforced and signaled by BGP and without maintaining knowledge of specific prefixes (e.g. no policy-based-routing based on prefix lists).
  2. Redundancy — minimum N+2 redundancy across stacks must be provided for any traffic flow.
  3. Predictable/Optimal pathing — a traffic flow should always take a predictable and optimal path. These paths should be predetermined based on the geographic proximity of a network to a stack.

Theory

To better understand the logic behind how we solve this problem, I present the theory we use on how to classify and frame routing decisions within this environment.

Note: A number of unique terms to describe certain routing behaviors are defined and used in this section, often italicized, which should not be mistaken as belonging to the general network engineering lexicon nor any industry standard.

Source/Destination-Following

Source-following and destination-following are a pair of terms we coined to describe the two fundamental options for choosing a stack to route traffic through. In source-following, traffic to a destination follows the stack nearest to the source — but more accurately, the stack nearest the device forwarding the traffic. In destination-following, traffic to a destination follows the stack nearest to the destination. It should be apparent that using the same option bidirectionally results in asymmetry through the stacks. However, using the opposite technique on each side of the flow results in symmetry through a single stack.

4

This logic is implemented by tagging prefixes with communities indicating both their proximity to, and advertisement through, a particular stack. We then use these communities in policy to set appropriate metrics.

Consider for the following examples, the network is composed of 4 stacks with N+1 redundancy, and where 1/2 and 3/4 form redundancy groups (i.e 1 is secondary to 2, and 2 is secondary is 1). Each Rx, represents a specific forwarding table (i.e. VRF) on the edge routers at the stack.

5

In destination-following, each prefix is tagged with the dst-x community when originated into BGP, where x is an identifier for the stack. Policies on the north/south eBGP neighbors within the stack set a local-preference of 400 for paths tagged with the dst-x community that corresponds to its local stack, and a local-preference of 300 for paths that correspond to its secondary stack. The preferences are unmodified as they are advertised east/west across the VPNv4 mesh, and thus all tables on that horizontal plane share the same perspective on any prefix — each agree on the same best path for any prefix.

In source-following, each path is tagged with the src-x community when advertised north/south through the stack. Policies on the north/south eBGP neighbors set a local-preference of 400 for all paths. The east/west VPNv4 neighbors set a local-preference of 400 for paths corresponding to its local stack, and 300 for paths corresponding to its secondary stack, as indicated by the src-x community. In this case, all tables on the horizontal plane have a different perspective on the best path for any prefix.

6

In this scenario, the best path at R12 to network D1 is the LP 400 path learned through the stack from R11, and the best path at R11 is the LP 400 path learned laterally from R41. The best path at R41 to prefix A2 is the LP 400 path learned laterally from R11, and at R11 is the LP 400 path learned through the stack from R12.

Source/Destination-Following Control-Plane Overhead

It may be intuitive that that following an anycasted prefix (e.g. a default route), where that prefix may originate equally at all stacks, would be consistent with the method of source-following. Indeed we are required to follow the stack of the source in this case — an anycast route must never be originated on the side that uses destination-following, as this would result in bi-directional source-following. In a similar fashion, source-following is conducive to the use of expansive aggregate routes or supernets, without concern to the originating stack of a contributor. This allows for a notable reduction in control-plane overhead. For destination-following, aggregation can only be accomplished as well as a contributing and suppressed prefix matches the designated origin of the aggregate route.

Tiered Source/Destination-Following

When expanding the source/destination-following concept to multiple tiers, additional logic is required to continue to enforce symmetry during a failure of individual tiers within a stack. Consider the below scenario where there is a failure at the lower tier at stack-1. The traffic flows asymmetric through the upper tier due to a shift in perspective when traffic from network A2 fails over to stack-2. The network A2 prefix continues to be shared laterally across the Rx2 tables above the failure, and then advertised through the upper tier at all stacks. The destination-following logic would dictate that the best path for prefix A2 on all Rx1 tables is the LP 400 path learned laterally from R11 on stack-1. However during the failure event, traffic from network A2 would move laterally around the failure up through stack-2 to R22. At this point, the traffic must continue to follow stack-2’s perspective on how to reach network D1, which is to follow the path from R21 up the stack.

7

This problem introduces a pair of techniques that we call source-shifting and destination-shifting, where the src-x and dst-x communities are rewritten as a path traverses a tier. In destination-shifting, every stack in the prefix’s failover order rewrites the path’s existing dst-x community with the local dst-x community as the path is advertised across a tier. In source-shifting every stack rewrites the existing src-x community with the local src-x community as the path is advertised across a tier.

The destination-shifting technique relies on a concept that we call lateral-blocking, where paths learned laterally may block paths learned with different attributes across the stack. This, of course, is just standard BGP best-path selection behavior, but we highlight it to stress that we utilize path-selection for more than the basic purpose of influencing the forwarding table. We use it to signal information (e.g. network failures) carried in path-attributes, even in cases where the selection of one path over another has zero effect on the forwarding table. This is shown more clearly in the following sections.

In the non-failure scenario, the path for network A2 is rewritten with the local dst-x community across each stack in the failover order. The path learned laterally from R12 remains the best path on all Rx2, and is the one subsequently advertised to the next tier. In the failure scenario, the path with the rewritten dst-2 community onto R22 becomes the new best path. As that path is advertised through the upper tier, all Rx1 tables now see the path from R21 as the best path, and we achieve symmetry. The effect of this shifting behavior is that lower tier failures propagate up, causing traffic to shift to another stack across both tiers. However upper tier failures are masked, and traffic does not shift until reaching the tier that has failed.

8

Zone Directionality

These concepts apply well when there is a clear choice for determining which side utilizes source-following and which side utilizes destination-following. The natural direction in our network would simply be up/down the stack. Where down points towards lower tiers closer to the edge of user networks, while up points to the upper tiers towards the egress of our network perimeter. For many traffic flows, that distinction cannot easily be made, for example, between two zones that are neither up nor down from each otherThis largely applies to traffic between different DMZs within the same routing-domain, and forms what we refer to as zone-to-zone routing. Moreover, as there may be many zones in the same routing-domain, they all must be consistent in determining a routing option.

The directionality problem can be solved by determining a predictable ordering of the stacks (e.g. stack-1 is of a lower order than stack-2, and stack-2 is of a lower order than stack-3, etc) where routing between two zones utilizes source-following from a zone at a lower ordered stack, and utilizes destination-following from a zone at the higher ordered stack. For the below example: traffic from network A2 to B1 should traverse stack-1, from A2 to D1 should traverse stack-1, from B1 to C2 should traverse stack-2, and from C2 to D1 should traverse stack-3.

9

This logic however is slightly more nuanced, in that similar to the tiering problem, the perspective of each stack must be accounted for. In the following examples, note that stacks 2 and 3 have been flipped, such that 1/3 and 2/4 now form each redundancy group, but where the stack ordering of 1 < 2 < 3 < 4 remains the same.

Traffic between B1 and C2 would normally traverse stack-2, due to stack stack-2 being of a lower order than stack-3. During a failure at stack-2, R22 could not choose to traverse its secondary stack-4, since R42 will have a different (and unchanged by this failure) perspective that the best path to reach B1 is via the lower ordered stack-3 on R32.

10

This likewise applies in a dual-failure where both stack 2 and stack-3 have failed. Both R22 and R31 are forced to consider the perspective of their redundant stacks with respect to the best path, despite the fact that the traffic now follows the failover-order of the higher ordered stack as opposed to the failover-order of the lower ordered stack.

11

To achieve this, we preference paths as the combination of both the stack ordering and the advertising stack’s position in the path’s failover order (i.e. primary vs secondary). The rule being: when routing between two networks, choose the highest failover position (e.g pri is better than sec) and then lowest ordered stack between them.

To implement, we use a third helper failover community for each level of redundancy: failover-prifailover-sec, etc. This community is applied when a path is advertised through any stack within the path’s failover order. For example, when network A2 is advertised through stack-1, it is tagged with failover-pri. When that prefix is advertised through stack-3 (keeping with the failover pairs from the previous example), it is tagged with failover-sec. When advertised through stack-2 or stack-4, no failover community is applied. This community is not fundamental, in that it can be derived from an existing pair of dst-x and src-x values, however the abstraction substantially reduces the verbosity of the policies that use it.

The failover community in conjunction with the src-x community signals the information needed to preference any path: which stack the path is being advertised from, and where in the failover order (pri/sec/etc) the stack belongs for that path. Illustrating the logic directly on the diagrams would be impractical, but a complete implementation is provided in the following sections. This preferencing logic is applied on both the eBGP peers through the stack and the lateral VPNv4 peers, where the paths are ordered from most preferred to least preferred as follows:

  1. From a stack lower or equal ordered to the primary and is the failover-primary for the path.
  2. From primary stack.
  3. From a stack higher ordered to the primary and is the failover-primary for the path.
  4. From a stack lower or equal ordered to the secondary and is the failover-secondary for the path.
  5. From secondary stack.
  6. From a stack higher ordered to the secondary and is the failover-secondary for the path.

… Each set of 3 terms repeat for the degree of stack redundancy required.

Zone-to-Zone Control-Plane Overhead

The zone-to-zone model, while flexible, comes at the worst case expense in control-plane utilization. Generally, every prefix originating in a zone must be installed in the forwarding table for every other zone. Unlike source/destination-following, the option to at least reduce the overhead from the source-following side, with a default route or broad aggregation, does not apply.

Tiered Zone-to-Zone

The zone-to-zone model nearly breaks down when it is attempted across multiple tiers. The same rules from the tiering section apply with the need for destination-shifting and source-shifting. The latter is by nature easy, as the only requirement is to implement a local rewrite of src-xDestination-shifting is far more difficult, because it relies on lateral-blocking to force all stacks to agree on the same best-path. With zone-to-zone, the same lateral-blocking is not achieved — all routers on the same horizontal plane do not converge to the same best path. Without an agreement on the same best path, a dst-x rewrite would signal non-existent failures in accordance with the tiering logic.

If there is a solution to this problem purely in BGP policy, I have not discovered it, however a simple solution is taking advantage of another BGP feature here: add-path. The BGP add-path capability allows advertising and/or receiving multiple paths for the same prefix to/from a BGP neighbor. By implementing add-path send-only from R11 and R12 and add-path receive-only at the FW between them, we can finalize the necessary lateral-blocking at the mid-point, without affecting the forwarding decision made by the zone-to-zone logicAfter lateral-blocking converges to a singular best path (i.e. from the stack with the highest failover position) at the FW, destination-shifting is implemented, and the paths are exported to R11/R12 with the desired attributes.

A hypothetical BGP feature that would allow for decoupling the selection of paths for advertisement from paths for RIB installation would allow for the cleanest solution to this.

Our Hybrid Solution

Our network had been through several iterations of solving this traffic problem. A variety of new requirements and unrealized failure scenarios caused us to rethink and redesign our solution. We ultimately settled on a hybrid model based on the principals described in the previous section.

While implementing a complete tiered zone-to-zone model across the entire network would provide the ultimate in flexibility and uniformity — the ideal one-size-fits-all approach — in practice and at our scale this would immediately crush and exhaust the control-plane resources on our network. The exponential growth of routes propagated in a complete zone-to-zone model would far exceed the resource capacity on our edge routers, which are hefty carrier grade routers as it is. Coupled with the fact that we do have some natural directionality to the network, we use the following hybrid approach where we partition each routing-domain on the network. This approach uses both the source/destination-following and tiered zone-to-zone models. The tiered portion of zone-to-zone is implemented in policy, but only used by exception.

12

For every routing-domain, we implement source/destination-following between the untrust zone and all other zones. The untrust zone is that which peers to any upper-tiers, and towards the network perimeter. Between the non-untrust zones, we implement zone-to-zone logic. When source-following to the untrust zone, we limit the advertised prefix to only the default route, and subsequently constrain the exponential growth of routes at each routing-domain boundary.

Recall from first section, that the peering design within a stack is represented as follows, where each zone in a particular routing-domain on the edge routers (coded as ER) peers to its corresponding table on the firewall-router (coded as an FR).

13

The following pseudo-code policies outline the logic used to implement this solution.

ER Redistribute — Unicast Prefix

Implemented during simple redistribution or origination of a network into BGP. We add only the the dst-x community, failover community, and set our default maximum local preference.

add community dest-<stack-id>
add community failover-pri
set local-pref 400
pass

ER Redistribute — Anycast Prefix

Implemented during redistribution or origination of an anycasted network into BGP (e.g a default route). We add only a special community anycast to signal it is an anycast network.

add community anycast
set local-pref 400
pass

ER-Untrust to FR Export

Implemented on the edge router when exporting paths from an untrust zone to the firewall-router. We pass only paths tagged with the anycast community, which should normally be only the default route.

if community matches anycast then
  pass

ER to FR Export

Implemented on the edge router when exporting paths from a non-untrust zone to the firewall-router. We pass only paths tagged with a failover community. In tiered zone-to-zone this limits the add-paths to only those necessary for lateral-blocking.

if community matches failover-all then
  pass

FR to ER Import

Implemented on the firewall-router when importing paths from any zone on the edge router. The policy completes the lateral-shifting needed in the tiered zone-to-zone model by preferencing the add-paths and which only applies to the paths from non-untrust zones.

if community matches failover-pri then
  set local-pref 400
elseif community matches failover-sec then
  set local-pref 300
elseif community matches failover-ter then
  set local-pref 200

pass

FR to ER Export

Implemented on the firewall-router when exporting paths to any zone on the edge router. The first section of the policy rewrites the src-x community needed for source-shifting. The next section simultaneously rewrites the dst-x community needed for destination-shifting, and rewrites the failover community needed for zone-to-zone logic.

delete community in src-all
add community src-<pri stack-id>

delete community in failover-all
if community matches <stack-id>-pri-dst then
  delete community in dst-all
  add community dst-<pri stack-id>
  add failover-pri
elseif community matches <stack-id>-sec-dst then
  delete community in dst-all
  add community dst-<pri stack-id>
  add failover-sec
elseif community matches <stack-id>-ter-dst then
  delete community in dst-all
  add community dst-<pri stack-id>
  add failover-ter

pass

ER-Untrust to FR Import

Implemented on the edge router when importing paths into an untrust zone from the firewall-router. The policy first deletes the src-x community set by the firewall-router in order to force the zone to bypass any zone-to-zone preferencing set by the VPNv4 policy and instead fall-through to destination-shifting. The preferences are set consistent with destination-shifting by matching on failover communities.

delete community in src-any

if community matches failover-pri then
  set local-pref 180
elseif community matches failover-sec then
  set local-pref 150
elseif community matches failover-ter then
  set local-pref 120

ER to FR Import

Implemented on the edge router when importing paths to a non-untrust zone from the firewall-router. The logic needs only consider the top two preferences from the zone-to-zone logic, as they encompass any possible path learned in this direction: it is either a failover-pri path learned from its primary stack, or it is any other path learned from its primary stack.

if community matches failover-pri
  set local-pref 200
else
  set local-pref 190

VPNv4 Import

Implemented on the edge router when importing paths from all other stacks over VPNv4. The policy preferences paths consistent with the ordering specified by the zone-to-zone logic. To implement the “from any stack equal or lower than” qualifier in those rules, we utilize regex-based communities to classify those stacks. For example, a src-3-minus community would use the pattern ^[1–3]on the left-hand side of the community value.

if community matches src-<pri stack-id>-minus and failover-pri
  set local-pref 200
elseif community matches src-<pri stack-id>
  set local-pref 190
elseif community matches failover-pri
  set local-pref 180
elseif community matches src-<sec stack-id>-minus and failover-sec
  set local-pref 170
elseif community matches src-<sec stack-id>
  set local-pref 160
elseif community matches failover-sec
  set local-pref 150
elseif community matches src-<ter stack-id>-minus and failover-ter
  set local-pref 140
elseif community matches src-<ter stack-id>
  set local-pref 130
elseif community matches failover-ter
  set local-pref 120
else
  drop

pass

Hybrid Control-Plane Overhead

For a quick example of the control-plane savings on the hybrid approach: assume a network with 5 route-domains, each with 10 other route-domains tiered underneath, and where every route-domain contains 5 unique zones, and every zone originates 50 prefixes. In total: 55 route-domains, 275 zones, and 13,750 prefixes.

Some quick math can demonstrate that a complete tiered zone-to-zone model, absent any aggregation, would come at an expense of roughly 3.8M routes inserted into the RIB.

By partitioning the zone-to-zone logic to exist only within a route-domain, and enforcing that every route-domain uses a default route from the untrust to exit, this expense is reduced to roughly 130K routes inserted into the RIB — roughly a 97% reduction.

Conclusion

Implementing a complete tiered zone-to-zone model follows the same policy logic as above only without applying the unique policies on the untrust zone. There are a number of other policies involved to handle other unique requirements for example interfacing with external autonomous systems, redundant redistribution points, and route aggregation which all follow the same general routing principals outlined.

]]>
https://www.sms.com/blog/complex-routing-symmetry-in-bgp/feed/ 0
EVPN+IRB Over MPLS With JUNOS and IOS-XR  https://www.sms.com/blog/evpnirb-over-mpls-with-junos-and-ios-xr/ https://www.sms.com/blog/evpnirb-over-mpls-with-junos-and-ios-xr/#respond Thu, 27 Oct 2022 02:00:38 +0000 http://sms-old.local/?p=4718 By Zachary Cayou, Network Engineer, SMS 

Introduction
I was given a project to implement EVPN+IRB over MPLS in our network, with the bonus to make it interoperable between JUNOS and IOS-XR routers. At the time, the depths of Google revealed precisely zero examples, guides, or blog posts of anyone attempting to do this. In addition, vendor documentation on the subject tends to assume a particular network design, which we do not follow. As a result, it became an interesting experiment of trial/error, dissecting documentation and RFCs, and a fair amount of head scratching.

The purpose of this post is to outline the compatibility, limitations, and tweaks necessary to implement EVPN+IRB in a multivendor environment with JUNOS and IOS-XR.

Primer
Ethernet VPN (EVPN) is a next-generation VPN protocol for building both L2 and L3 VPNs. EVPN attempts to address many of the challenges faced by traditional L2VPN protocols such as VPLS, while also providing L3VPN capabilities.

Adding integrated routing and bridging (IRB) into EVPN enables both L2 forwarding of intra-subnet traffic and L3 forwarding of inter-subnet traffic within the L3VPN. This facilitates stretching a L2 domain across the core when L2 reachability is needed, while providing optimal forwarding of L3 traffic, and enabling VM mobility support with distributed anycast gateways.

The topic of EVPN, associated protocols, and their applications are far too broad to be covered in depth here, and thus the details and challenges discussed hereafter assume a fair understanding of EVPN already. Details on EVPN implementations may be found in the respective vendor’s documentation.

The specific platforms and OS versions referenced:

  • Juniper MX80s running JUNOS 20.4R1.12
  • Cisco ASR9010s running IOS-XR 6.7.3

Interoperability
Is EVPN+IRB over MPLS interoperable between JUNOS and IOS-XR? At the time of writing… no, strictly speaking they are not, due to reasons I’ll outline below. The two platforms implement incompatible EVPN+IRB behavior. That said, with a certain degree of workarounds, loose interoperability can be achieved.

While EVPN is a mature technology, EVPN+IRB is less so. Vendors began adding IRB features of EVPN in advance of governing RFCs. RFC 9135 (Integrated Routing and Bridging in Ethernet VPN (EVPN)) was in draft until October 2021. This RFC largely outlines the two IRB models that IOS-XR and JUNOS follow, symmetric and asymmetric, respectively.

Asymmetric IRB

ENVP1.3 1

In the asymmetric IRB model, the lookup operation for inter-subnet routing is asymmetric on the ingress and egress PE. When H1 sends a packet destined for H4, PE1 interface X receives the frame and conducts an IP lookup for the destination in its VRF table, where the longest match resolves to the network on interface Y. PE1 then does a lookup for H4’s MAC, which resolves as an EVPN learned adjacency. The packet is then encapsulated with a source MAC of interface Y, a destination MAC of H4, and then forwarded as a L2 payload across the core to PE2. At PE2, a single MAC lookup is performed to switch the traffic to H4. While this model achieves some simplicity in the control plane, and provides for centralized routing, it introduces limitations in scalability and flexibility. Since the ingress PE must be able to do the MAC lookup for the destination, it follows that it must also contain every IRB interface and install adjacency rewrites in the forwarding plane for every host in the routing domain, regardless if the PE has any local hosts in that network.

 

Symmetric IRB

ENVP2.png 1

In the symmetric model, the lookup operation for inter-subnet routing is symmetric on the ingress and egress PE. When host H1 sends a packet destined for H4, PE1 interface X receives the frame and conducts an IP lookup for the destination in its VRF table, where the longest match resolves to a host route learned from PE2. The traffic is forwarded as a L3 payload across the core to PE2. PE2 then does a lookup for H4s MAC, which resolves as a local adjacency on interface Y, and the packet can be forwarded to H4.

In the symmetric model, it should be clear that PE1 need not store a L2 adjacency for H4 in the forwarding plane, nor does interface Y need to exist on PE1 at all. This is achieved by including an additional label (aka Label2) and an additional VRF route-target into EVPN Type-2 MAC/IP advertisements. The additional route-target and label are used the same as they would be in a VPNv4 advertisement: to import the host-route into the correct VRF and to provide a forwarding label.

Compatibility
RFC 9135 says that while asymmetric and symmetric IRB modes may coexist on the same network and the egress PE indirectly dictates the mode by the presence or absence of Label2 and the VRF Route-Target in the EVPN Type-2 advertisements. In other words, this coexistence only means that PEs can prefer to operate in different modes, but they must be capable of both modes. As it turns out, IOS-XR operates exclusively in symmetric mode, and JUNOS operates exclusively in asymmetric mode.

ENVP3.png e1666966983231 1

Let’s look at where this breaks down on the data-plane. Where PE1 is JUNOS and PE2 is IOS-XR, PE1 sends an EVPN Type-2 MAC/IP advertisement without Label2/VRF RT for H1, and PE2 sends an EVPN Type-2 MAC/IP advertisement with Label2/VRF RT for H4. PE1 does not recognize the Label2/VRF RT attributes, and they are ignored, and installs an adjacency for the H4 MAC/IP in the forwarding table. PE1 performs an IP lookup for H4, finds interface Y as the longest match, does a MAC lookup for H4, forwards the L2 payload to PE2, and PE2 successfully switches the packet to H4. In the return direction from H4, PE2 performs an IP lookup for H1, finds interface X as the longest match, but fails to resolve a MAC for H1. Now even though PE2 is aware of the MAC/IP binding from EVPN in the control plane, this binding is not installed as an adjacency in the data plane. PE2 operates purely in the symmetric mode and expects to see Label2/VRF RT in the Type-2 advertisement for H2 if inter-subnet routing was desired.

The ultimate problem is bidirectional traffic fails because there is no way to properly route traffic from a PE that only supports symmetric mode to a PE that only supports asymmetric mode.  Therefore, we can conclude that EVPN+IRB in isolation is not presently interoperable between JUNOS and IOS-XR. However, since the problem is purely that of a routing type, there are other ways we could approach the problem to still make it work.

Solution
Our network consists of hundreds of sites, each typically with a pair of PEs connected to a L2 campus infrastructure. Each subnet’s gateway lives as a FHRP VIP between each PE. Routing for each site is distributed over VPNv4. The introduction of EVPN+IRB in our network was intended to support stretching subnets across sites for services that required L2 connectivity, as well as supporting VM mobility for failover events. Our requirement included that a host stretched to any site must always be able to route on the local PE.

ENVP4.png e1666967007125 1

The interoperability problem outlined above breaks down specifically due to the lack of host-routes advertised from the JUNOS PE to IOS-XR PE, but in our network we are already running another protocol that we could use to solve this problem: VPNv4. The solution is to make the JUNOS PEs advertise the local EVPN host-routes inside VPNv4. While this does introduce additional overhead on the control-plane, as every EVPN Type-2 MAC/IP advertisement from the JUNOS PE will also have a corresponding VPNv4 advertisement, this expense is trivial for our use cases. By only injecting the EVPN host routes into VPNv4 on the JUNOS PE, we end up running EVPN in the asymmetric model on the JUNOS to IOS-XR path, and in the symmetric model on the IOS-XR to JUNOS path. At this point we’re still bound by the limitations of the asymmetric model on the JUNOS PEs. When we take the same solution one step further by also advertising the host-routes in VPNv4 from the IOS-XR PE, then we finally replicate the symmetric model bidirectionally.

Configuration Steps
The following are the configuration steps utilized to achieve loose EVPN+IRB interoperability on our network. For brevity, this assumes the control-plane is already configured with VPNv4 and EVPN address families enabled, and with relevant VRFs already created.

1.  Create the attachment circuits and set the ethernet circuit parameters. In our design, each PE in a pair connects to the L2 campus but not as a LAG, so we configure the ethernet segment in single-active mode.

IOS-XR:

evpn interface Bundle-Ether1
evpn interface Bundle-Ether1 ethernet-segment
evpn interface Bundle-Ether1 ethernet-segment identifier type 0 00.00.00.00.00.00.00.00.01
evpn interface Bundle-Ether1 ethernet-segment load-balancing-mode single-active

interface Bundle-Ether1.1000 l2transport
interface Bundle-Ether1.1000 l2transport description v1000;VRF-A;172.16.0.0/24
interface Bundle-Ether1.1000 l2transport encapsulation dot1q 1000
interface Bundle-Ether1.1000 l2transport rewrite ingress tag pop 1 symmetric

JUNOS:

set interfaces ae1 flexible-vlan-tagging
set interfaces ae1 encapsulation flexible-ethernet-services
set interfaces ae1 esi 00:00:00:00:00:00:00:00:00:02
set interfaces ae1 esi single-active

set interfaces ae1 unit 1000 description “v1000;VRF-A;172.16.0.0/24”
set interfaces ae1 unit 1000 encapsulation vlan-bridge
set interfaces ae1 unit 1000 vlan-id 1000

2.  Create the IRB interfaces in the respective VRF. To provide gateway ARP consistency as a distributed anycast gateway, the MAC address must be statically assigned and replicated on each PE.   

IOS-XR PE: 

The “host-routing” knob enables the symmetric behavior in the control plane.

interface BVI1000 description VRF-A;172.16.0.0/24
interface BVI1000 host-routing
interface BVI1000 vrf VRF-A
interface BVI1000 ipv4 address 172.16.0.1 255.255.255.0
interface BVI1000 mac-address 0.0.1000

JUNOS: 

set interfaces irb unit 1000 description “VRF-A;172.16.0.0/24”
set interfaces irb unit 1000 description “VRF-A;172.16.0.0/24”
set interfaces irb unit 1000 family inet address 172.16.0.1/24
set interfaces irb unit 1000 mac 00:00:00:00:10:00

set routing-instance VRF-A interface irb.1000

3.  Create the EVPN instance.   

IOS-XR:

The binding of the attachment circuit(s), and IRB interface is done inside of a L2VPN configuration, which is then associated with the EVPN instance.  

evpn evi 1000
evpn evi 1000 bgp
evpn evi 1000 bgp rd 1000:1000
evpn evi 1000 route-target import 1000:1000
evpn evi 1000 route-target export 1000:1000
evpn evi 1000 description VRF-A;172.16.0.0/24
 
l2vpn bridge group VRF-A
l2vpn bridge group VRF-A bridge-domain 1000 
l2vpn bridge group VRF-A bridge-domain 1000 interface Bundle-Ether1.1000
l2vpn bridge group VRF-A bridge-domain 1000 routed interface BVI1000
l2vpn bridge group VRF-A bridge-domain 1000 evi 1000

JUNOS: 

The EVPN instance is created as a routing-instance of type EVPN.  

By default, JUNOS will suppress both ingress/egress ARPs across the core, and instead proxy the response utilizing information known from the Type-2 MAC/IP advertisements. IOS-XR does not implement this feature, so ARP requests/replies must be allowed across the core. The hidden command ‘no-arp-suppression’ is necessary to disable this behavior on JUNOS.  

JUNOS also implements default-gateway MAC synchronization by default. In our use case with distributed anycast gateways, this feature is not necessary since all gateway MACs are statically set, and should be disabled with the “default-gateway do-not-advertise” knob.  

Finally, JUNOS by default does not insert a control-word in front of the payload for egress traffic, while IOS-XR by default does. A control-word is a nibble of zeros that sits between the bottom MPLS label and the payload. The purpose is to ensure a “dumb” transit device does not mistake a L2 payload with a source MAC starting with a 4 or 6 as a L3 IPv4 or IPv6 payload. This must be consistent across all PEs, otherwise the payload offset on received traffic will be inconsistent.

set routing-instances VRF-A-evpn-1000 protocols evpn interface ae1.1000
set routing-instances VRF-A-evpn-1000 protocols evpn no-arp-suppression
set routing-instances VRF-A-evpn-1000 protocols evpn default-gateway do-not-advertise
set routing-instances VRF-A-evpn-1000 protocols evpn control-word
set routing-instances VRF-A-evpn-1000 instance-type evpn
set routing-instances VRF-A-evpn-1000 vlan-id none
set routing-instances VRF-A-evpn-1000 routing-interface irb.1000
set routing-instances VRF-A-evpn-1000 interface ae1.1000
set routing-instances VRF-A-evpn-1000 route-distinguisher 1000:1000
set routing-instances VRF-A-evpn-1000 vrf-target target:1000:1000

4.  Advertise host routes in VPNv4… i.e the interoperability workaround.

IOS-XR: 

This command imports the local EVPN IRB adjacencies as host-routes into the VRF table, allowing for advertisement in VPNv4 or other protocols. 

vrf VRF-A address-family ipv4 unicast import from bridge-domain advertise-as-vpn

JUNOS: 

In JUNOS the local EVPN IRB adjacencies already exist VRF table, and advertising them requires nothing other than allowing routes from protocol evpn.  

Also JUNOS will advertise both Type-2 MAC/IP advertisements AND Type-2 MAC advertisements. Here we only require the MAC/IP advertisements, so to reduce control-plane overhead, the MAC only advertisements should be filtered.

set policy-options policy-statement VRF-A-export term EVPN from protocol evpn
set policy-options policy-statement VRF-A-export term EVPN then accept

set policy-options policy-statement rr-bgp-export term EVPN from family evpn
set policy-options policy-statement rr-bgp-export term EVPN from evpn-mac-route mac-only
set policy-options policy-statement rr-bgp-export term EVPN then reject

Future
The solution in place in our network is not optimal with regards to control-plane utilization, as we are required to double the necessary BGP advertisements for any given host in an EVPN. However, it remains more than viable at the scale we plan to deploy EVPN for the foreseeable future. JUNOS has recently implemented support for the symmetric IRB model for EVPN+IRB over VXLAN, so presumably support over MPLS is on the horizon. At that point, transitioning to the native symmetric model in EVPN would be desirable for both overhead and protocol simplicity.

]]>
https://www.sms.com/blog/evpnirb-over-mpls-with-junos-and-ios-xr/feed/ 0