We build our network in order to simultaneously achieve high availability and maximum utilisation of available bandwidth. To that end, we are using Multi-Chassis Link Aggregation (MLAG) between our data centre switches running Cumulus Linux and our firewall cluster «on a stick»:

MLAG topology
MLAG topology

fw1 is tricked into believing it is dual connected to a single device that speaks the IEEE 802.3ad Link Aggregation Control Protocol.

This works great for layer 2 and provides active/active load balancing of fw1’s outbound traffic.

Bringing layer 3 routing into the mix complicates matters. While an MLAG switch pair behaves as a single device from a layer 2 perspective, the switches remain independent entities on layer 3. For this reason, fw1 is required to maintain two BGP sessions – one session to each switch. We facilitate this by creating a link VLAN where the firewall and both switches are connected, as shown below. Each switch enables a layer 3 switch virtual interface (SVI) in the link VLAN.

MLAG BGP topology

MLAG + ECMP considered harmful

If fw1 had been connected with two pure layer 3 up-links to the switches (i.e., no MLAG), then we would normally have used ECMP to load balance its outgoing traffic. While that approach does ostensibly work with MLAG too, it causes problems further down the line. The two load balancing methods in play here – 802.3ad and ECMP – are totally unaware of each other. A packet that ECMP decided to load balance towards sw1’s SVI might actually end up being transmitted to sw2 first, because that is the LAG member 802.3ad load balanced it to (or vice versa).

We would therefore end up in a situation where about 50% of the outgoing traffic from fw1 is sent to the wrong switch and has to cross the peer-link between the two switches, as shown below. Although it will work, is of course not very efficient.

MLAG and ECMP

Unfortunately, that is not the only issue. Cumulus Linux’s MLAG implementation disables MAC learning on the peer-link. All the misdirected traffic will be flooded as unknown unicast. If either of the switches has another unrelated VLAN trunk with all VLANs active (i.e., no pruning), fw1’s misdirected traffic will be flooded to that link as well, possibly saturating it and/or causing high CPU usage on one or more unrelated devices:

MLAG causing unicast flood

Using ECMP for layer 3 load balancing on an MLAG link is simply out of the question. What now?

VRR and BGP to the rescue

Fortunately, Cumulus Linux implements another layer 3 load balancing method that saves the day. This is called Virtual Router Redundancy, or VRR for short.

It is a simple and elegant solution. It works by having both switches create a secondary VRR SVI on the link VLAN. This secondary SVI has the same IP and MAC addresses on both switches. Thus, when fw1 transmits a packet destined for the VRR SVI, it will never have to be routed across the MLAG peer-link because the destination is local to both switches.

The VRR SVI can not be used to terminate a BGP session, because a TCP session requires both endpoints to be unique. Therefore, fw1 will still need to speak BGP directly with the primary SVIs on sw1 and sw2.

The remaining piece of the puzzle is to use the built-in capability of the BGP protocol to signal a non-default next-hop for each of the advertised routes: the VRR SVI. This is easily done in Cumulus Linux / FRRouting using a route-map, and results in fw1’s outbound traffic taking the ideal path through the network:

MLAG VRR topology

The resulting routing can be seen by queering the advertisements received by the BIRD BGP daemon running on fw1:

root@fw1:~# birdc show route
BIRD 1.5.0 ready.
0.0.0.0/0          via 192.0.2.3 on bond0 [sw1 11:46:56 from 192.0.2.1] * (100) [AS39039i]
                   via 192.0.2.3 on bond0 [sw2 11:46:57 from 192.0.2.2] (100) [AS39039i]
root@fw1:~# birdc6 show route
BIRD 1.5.0 ready.
::/0               via 2001:db8::3 on bond0 [sw1 11:46:56 from 2001:db8::1] * (100) [AS39039i]
                   via 2001:db8::3 on bond0 [sw2 11:46:57 from 2001:db8::2] (100) [AS39039i]

Success! It has learned a redundant pair of IPv4/IPv6 default routes from the switches, one from each switch. Since the next-hop is the VRR SVI that is shared between both switches, the traffic destined for the backbone (from which the default routes originate in the first place) will never have to cross the peer-link – exactly what we set out to achieve in the first place.

(Packets that are part of the two BGP sessions themselves might still be forwarded across the peer-link and cause unknown unicast flooding, as discussed earlier. Fortunately, a BGP session only causes a minute amount of traffic, so this is not really a problem worth worrying about.)

Configuration files

I’m including the basic configuration files necessary to set up the topology in the figures, in case you want to try this out for yourself. It can be done in a virtual lab environment using Cumulus VX too, so there is no need to buy any lab hardware.)

sw1

sw1: /etc/network/interfaces
# Unrelated inferfaces removed
auto swp1
iface swp1
    address 192.0.2.253/31
    address 2001:db8:a::1/127
    alias uplink to backbone

auto swp42
iface swp42
    alias downlink to fw1 port eth0

auto swp47
iface swp47
    alias peerlink LAG member 1

auto swp48
iface swp48
    alias peerlink LAG member 2

auto bridge
iface bridge
    bridge-ports mlag-fw1 peerlink
    bridge-vids 42
    bridge-vlan-aware yes

auto mlag-fw1
iface mlag-fw1
    bond-slaves swp42
    bridge-pvid 42
    clag-id 42
    mstpctl-bpduguard yes
    mstpctl-portadminedge yes

auto peerlink
iface peerlink
    bond-slaves swp47 swp48

auto peerlink.4094
iface peerlink.4094
    address 169.254.1.1/30
    alias MLAG state synchronisation interface
    clagd-peer-ip 169.254.1.2
    clagd-priority 16384
    clagd-sys-mac 44:38:39:ff:10:ff

auto vlan42
iface vlan42
    address 192.0.2.1/29
    address 2001:db8::1/64
    address-virtual 00:00:5E:00:01:2A 192.0.2.3/29 2001:db8::3/64
    alias Layer 3 link network between sw1/sw2/fw1
    vlan-id 42
    vlan-raw-device bridge
sw1: /etc/frr/frr.conf
frr version 3.2+cl3u3
frr defaults datacenter
hostname sw1
username cumulus nopassword
!
service integrated-vtysh-config
!
log syslog informational
!
router bgp 65001
 bgp router-id 192.0.2.1
 no bgp default ipv4-unicast
 coalesce-time 1000
 neighbor 192.0.2.6 remote-as 65006
 neighbor 192.0.2.6 description downlink to fw1
 neighbor 192.0.2.252 remote-as 39039
 neighbor 192.0.2.252 description uplink to backbone
 neighbor 2001:db8::6 remote-as 65006
 neighbor 2001:db8::6 description downlink to fw1
 neighbor 2001:db8:a:: remote-as 39039
 neighbor 2001:db8:a:: description uplink to backbone
 !
 address-family ipv4 unicast
  neighbor 192.0.2.6 activate
  neighbor 192.0.2.6 route-map next-hop-vrr out
  neighbor 192.0.2.252 activate
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor 2001:db8::6 activate
  neighbor 2001:db8::6 route-map next-hop-vrr out
  neighbor 2001:db8:a:: activate
 exit-address-family
!
route-map next-hop-vrr permit 10
 set ip next-hop 192.0.2.3
 set ipv6 next-hop global 2001:db8::3
!
line vty
!

sw2

sw2: /etc/network/interfaces
# Unrelated interfaces removed
auto swp1
iface swp1
    address 192.0.2.255/31
    address 2001:db8:b::1/127
    alias uplink to backbone

auto swp42
iface swp42
    alias downlink to fw1 port eth1

auto swp47
iface swp47
    alias peerlink LAG member 1

auto swp48
iface swp48
    alias peerlink LAG member 2

auto bridge
iface bridge
    bridge-ports mlag-fw1 peerlink
    bridge-vids 42
    bridge-vlan-aware yes

auto mlag-fw1
iface mlag-fw1
    bond-slaves swp42
    bridge-pvid 42
    clag-id 42
    mstpctl-bpduguard yes
    mstpctl-portadminedge yes

auto peerlink
iface peerlink
    bond-slaves swp47 swp48

auto peerlink.4094
iface peerlink.4094
    address 169.254.1.2/30
    alias MLAG state synchronisation interface
    clagd-peer-ip 169.254.1.1
    clagd-priority 32768
    clagd-sys-mac 44:38:39:ff:10:ff

auto vlan42
iface vlan42
    address 192.0.2.2/29
    address 2001:db8::2/64
    address-virtual 00:00:5E:00:01:2A 192.0.2.3/29 2001:db8::3/64
    alias Layer 3 link network between sw1/sw2/fw1
    vlan-id 42
    vlan-raw-device bridge
sw2: /etc/frr/frr.conf
frr version 3.2+cl3u3
frr defaults datacenter
hostname sw2
username cumulus nopassword
!
service integrated-vtysh-config
!
log syslog informational
!
router bgp 65002
 bgp router-id 192.0.2.2
 no bgp default ipv4-unicast
 coalesce-time 1000
 neighbor 192.0.2.6 remote-as 65006
 neighbor 192.0.2.6 description downlink to fw1
 neighbor 192.0.2.254 remote-as 39039
 neighbor 192.0.2.254 description uplink to backbone
 neighbor 2001:db8::6 remote-as 65006
 neighbor 2001:db8::6 description downlink to fw1
 neighbor 2001:db8:b:: remote-as 39039
 neighbor 2001:db8:b:: description uplink to backbone
 !
 address-family ipv4 unicast
  neighbor 192.0.2.6 activate
  neighbor 192.0.2.6 route-map next-hop-vrr out
  neighbor 192.0.2.254 activate
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor 2001:db8::6 activate
  neighbor 2001:db8::6 route-map next-hop-vrr out
  neighbor 2001:db8:b:: activate
 exit-address-family
!
route-map next-hop-vrr permit 10
 set ip next-hop 192.0.2.3
 set ipv6 next-hop global 2001:db8::3
!
line vty
!

fw1

fw1: /etc/network/interfaces
# unrelated interfaces removed
auto eth0
iface eth0 inet manual
     bond-master bond0

auto eth1
iface eth1 inet manual
     bond-master bond0

auto bond0
iface bond0 inet static
     address 192.0.2.6
     netmask 255.255.255.248
     bond-miimon 100
     bond-mode 802.3ad
     bond-xmit-hash-policy encap3+4
     bond-slaves none # Ubuntu sure is weird
iface bond0 inet6 static
     address 2001:db8::6
     netmask 64
fw1: /etc/bird/bird.conf
router id 192.0.2.6;

protocol kernel {
    scan time 60;
    import none;
    export all;
}

protocol device {
    scan time 60;
}

protocol bgp sw1 {
    local as 65006;
    neighbor 192.0.2.1 as 65001;
}

protocol bgp sw2 {
    local as 65006;
    neighbor 192.0.2.2 as 65002;
}
fw01: /etc/bird/bird6.conf
router id 192.0.2.6;

protocol kernel {
    scan time 60;
    import none;
    export all;
}

protocol device {
    scan time 60;
}

protocol bgp sw1 {
    local as 65006;
    neighbor 2001:db8::1 as 65001;
}

protocol bgp sw2 {
    local as 65006;
    neighbor 2001:db8::2 as 65002;
}

Update

  • 2024-01-29 Updated dead links.

Tore Anderson

Senior Systems Consultant at Redpill Linpro

Tore works with infrastructure at Redpill Linpro. Joining us more than a decade ago as a trainee, Tore is now responsible for our network architecture and operations.

Just-Make-toolbox

make is a utility for automating builds. You specify the source and the build file and make will determine which file(s) have to be re-built. Using this functionality in make as an all-round tool for command running as well, is considered common practice. Yes, you could write Shell scripts for this instead and they would be probably equally good. But using make has its own charm (and gets you karma points).

Even this ... [continue reading]

Containerized Development Environment

Published on February 28, 2024

Ansible-runner

Published on February 27, 2024