This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite

When an ISP or Autonomous System (AS) such as Redpill Linpro acquires a block of globally unique IP addresses (called a prefix), it must advertise it to the global Internet routing table. This advertisement causes all other ASes in the world to find out that the new prefix is now alive, and also how and where to send any IP packets destined for it. Connectivity is established, and everybody is happy. Right?

Except there is a problem. The number of prefixes in the global Internet routing table is increasing at an alarming rate. At the time of writing, the global Internet routing table consists of approx. 635,000 IPv4 prefixes and 33,000 IPv6 prefixes. These are being advertised by approx. 55,000 different ASes from around the world.

IPv4 prefixes IPv6 prefixes
The global IPv4 and IPv6 routing tables sizes over time (source: CIDR Report)

The growth can of course be explained by the fact that the Internet is growing, but that’s only half the story. It is also an economic problem; it costs an AS next to nothing to advertise a new prefix (or smaller subset of an already existing prefix). This cost is instead borne by all the other ASes, which need to carry this prefix in the routing tables in their routers. Simply put: on the Internet, the polluter does not pay.

Most organisations, Redpill Linpro included, have dealt with this problem by throwing money at it. They go to vendors such as Cisco Systems or Juniper Networks and buy their routers. Those are purpose-built to be able to perform high-speed routing forwarding of network traffic even if the routing table should grow past a million routes. They make some truly terrific boxes, but make no mistake, they certainly do not come cheap. We would therefore like to see if there was a smarter and more cost-effective way of dealing with this problem going forward.

Using a switch as a router

In the years since we purchased our current Juniper MX Internet routers, the IP routing capabilities of commodity data centre switches have improved greatly, to the point where they can be used for serious IP routing tasks. Most of them run an advanced Network Operating System (NOS) which supports all the necessary protocols, in particular the Border Gateway Protocol (BGP) that’s used to exchange routing information across the Internet.

As described in a previous post, we’re testing a HPE Altoline 6920 in our lab. The Altoline 6920 is, like other switches based on the Broadcom Trident II chipset, able to handle up to 720 Gbps of throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis. Its price is in all likelihood a single-digit percentage of the price of a traditional Internet router with a comparable throughput rating.

In order to attain these impressive performance and density properties while maintaining an attractive low price, certain compromises obviously have to be made. One of them is the routing table scalability; the Trident II chipset can simply not be programmed with more than 131,072 routes. That is not nearly enough to handle the full Internet routing table of today. This is true for most other data centre switch chipsets as well.

We could therefore at this point have concluded that the routing table limitation makes it impossible to use an inexpensive switch as an Internet router. But instead of giving up so easy and keep throwing money at the problem, we decided to spend some time thinking a bit «outside the box».

Do we really need a full Internet routing table?

The first thing we realised was that from our point of view, the vast majority of Internet routes are redundant. For example: even though the routes advertised by an Italian ISP are distinct from the routes advertised by a Spanish ISP, it is very likely that we would route packets destined for both of them exactly the same way. The paths to those destinations might not diverge until well after the packets have reached continental Europe, long after they have left our network.

The second realisation is that even if the routes in question weren’t redundant, i.e., that we’d normally use different IP transit providers to reach them, differentiating between them doesn’t really bring any real benefit. Redpill Linpro have two main IP transit providers. In most cases, it really makes no discernible difference which of them is used to deliver traffic destined for networks outside of Scandinavia.

The third realisation is that the majority of the routes in the global Internet routing table will remain completely unused. The vast majority of our traffic never leaves Scandinavia. Yet a route advertised by, say, a North Korean ISP will take up just as much space in our routers as a route advertised by one of the largest ISPs in Scandinavia. That’s the case if we’ve never even sent a single packet to that North Korean ISP.

The conclusion is that we don’t really need a full Internet table. We can instead make a distinction between «important» and «non-important» routes. A route could be classified as «important» based on criteria such as the amount of traffic we send to it, if belongs to a network in our geographic vicinity, if it is advertised by a direct peer, or if it hosts services that are of particular importance to our customers.

The set of «important» routes will fit comfortably into the chipset of a modern data centre switch, even if we set the bar for «important» classification very low.

The RIB and the FIB

In order to actually classify a route as «important» or not, we first need to learn of its existence. That means that we will still need to receive a full Internet routing table from our IP transit providers. So does that mean our switch-as-a-router project is doomed to fail? No, not necessarily!

As it happens, any modern network device (both routers and switches) actually contain (at least) two distinct routing tables.

The first table is called the Forwarding Information Base, or FIB for short. The FIB is located inside the packet forwarding chipset itself, and is what is being used to perform routing of each individual packet that traverses the switch. The FIB is the routing table I’ve talked about so far in this post, because this is where the size limitations are found.

The second table is called the Routing Information Base (RIB). The RIB lives is located inside the NOS. Its size is really only limited by the amount of memory available to the NOS. The Altoline 6920 has 8 GiB of memory, which is enough to hold a tens of millions of routes - easily enough to hold multiple copies of the Internet routing table.

The relationship between the RIB and the FIB is that information from the former is being used to program the latter. For example, an Internet router connected to two different IP transit providers will receive two copies of the Internet routing table (one from each provider). Both copies are stored in the RIB (which at that point will contain well over 1.2 million routes).

For each unique prefix found in the RIB the NOS will next decide which of the two alternate routes is the best one. It then proceeds to program the best route into the FIB. The not-best route will however remain in the RIB, so that if currently best route is changed or withdrawn by the IP transit provider advertising it, the NOS is in a position to immediately reprogram the FIB accordingly.

The final plan: reserve the FIB for important routes only

We intend to to insert another criterion in the flow of information from the RIB to the FIB. In order for a RIB route to propagate to the FIB, it must not only be the «best», but it must also be on a list of «important» routes. If it is not, it is not programmed into the FIB. The RIB will however still contain every route, since the exact set of routes that are «important» is dynamic and subject to change.

Of course, even though we’re keeping the «non-important» routes away from the FIB, this does not mean that we will not have connectivity to those networks. To ensure we have connectivity to the entire Internet, not just the «important» parts, we intend to install a «route of last resort» (also known as a «default route») into the FIB that is directed at our primary IP transit provider.

The route of last resort is really no different than any other «important» route, except for the fact that it covers the entire IP address space. The RIB will also contain a separate route of last resort pointing to our secondary IP transit provider, ready to be automatically promoted to the FIB should a failure bring down the primary IP transit connection.

Getting a packet to a «non-important» destination then becomes the IP transit provider’s job, no matter where in the world that is. This is essentially no different from today’s situation - except that we’re achieving this by using a single route entry in the FIB instead of hundreds of thousands.

Proof of concept implementation with Cumulus Linux NOS

It is of course necessary to make sure that the plan will actually work in practice, so we built a proof of concept in our lab. We installed Cumulus Linux on the Altoline 6920 switch, and configured it to establish IPv4 and IPv6 BGP sessions to two of our current border routers.

router bgp 39029
 no bgp default ipv4-unicast
 neighbor 192.0.2.1 remote-as internal
 neighbor 192.0.2.2 remote-as internal
 neighbor 2001:db8::1 remote-as internal
 neighbor 2001:db8::2 remote-as internal
 !
 address-family ipv4 unicast
  neighbor 192.0.2.1 activate
  neighbor 192.0.2.2 activate
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor 2001:db8::1 activate
  neighbor 2001:db8::2 activate
 exit-address-family

The two border routers will advertise the entire IPv4 and IPv6 Internet routing table to the switch. As we expected, this did not work. The ASIC simply can’t cope with that many routes, and the FIB fills up completely:

cumulus@cumulus:mgmt-vrf:~$ cl-resource-query | grep -i route
IPv4 route entries:    131072, 100% of maximum value 131072
IPv6 route entries:         0,   0% of maximum value  20480
IPv4 Routes:           131072
IPv6 Routes:                0
Total Routes:          131072, 100% of maximum value 131072
cumulus@cumulus:mgmt-vrf:~$ sudo journalctl -o cat -u switchd -n 5
hal_bcm.c:5295 CRIT bcm_l3_route_add failed for hal route 109.65.192.0/20 num_nh 1 neigh_flag: 0x0 flags: 0x0 metric: 20 table: 0 via swp1 44:1e:a1:44:af:18 87.238.46.65, vrf 0: Table full
hal_bcm.c:5295 CRIT bcm_l3_route_add failed for hal route 109.65.208.0/20 num_nh 1 neigh_flag: 0x0 flags: 0x0 metric: 20 table: 0 via swp1 44:1e:a1:44:af:18 87.238.46.65, vrf 0: Table full
hal_bcm.c:5295 CRIT bcm_l3_route_add failed for hal route 109.65.224.0/20 num_nh 1 neigh_flag: 0x0 flags: 0x0 metric: 20 table: 0 via swp1 44:1e:a1:44:af:18 87.238.46.65, vrf 0: Table full
hal_bcm.c:5295 CRIT bcm_l3_route_add failed for hal route 109.65.240.0/20 num_nh 1 neigh_flag: 0x0 flags: 0x0 metric: 20 table: 0 via swp1 44:1e:a1:44:af:18 87.238.46.65, vrf 0: Table full
sync.c:4625 ERR 512813 routes were ignored due to total capacity.

The next step is to determine which routes are «important». While only the imagination limits how one might go about making this classification, for the proof of concept we decided on the following static list:

  • The route of last resort (the default route)
  • All routes not not received from an IP transit provider (i.e., routes received from peers)
  • All routes to Scandinavian customers of our primary IP transit provider

All other routes are classified as «non-important» and we’ll keep them away from the FIB. We can now go on to construct a route-map that matches these routes:

!
! This prefix-list matches the IPv4 and IPV6 routes of last resort.
!
ip prefix-list defaultroute seq 5 permit 0.0.0.0/0
ipv6 prefix-list defaultroute seq 5 permit ::/0
!
! These AS paths matches routes advertised to us from our IP transit
! providers Global Crossing (AS3549) and Level 3 (AS3356).
!
ip as-path access-list gblx permit ^3549_
ip as-path access-list level3 permit ^3356_
!
! Our primary transit provider Level 3 tag all their customer routes with
! the BGP community 3356:123. They also tag them with country-specific
! communities, allowing us to identify all their Danish, Norwegian and
! Swedish customer routers using the commnity lists below.
!
ip community-list standard level3-cust-denmark permit 3356:123 3356:510
ip community-list standard level3-cust-norway permit 3356:123 3356:517
ip community-list standard level3-cust-sweden permit 3356:123 3356:507
!
! We most definitively need the routes of last resort, as it ensures there
! will be connectivity to networks/routes we keep out of the FIB.
!
route-map important-routes-only permit 5
 match ip address prefix-list defaultroute
!
! The following three matches will make sure that routes belonging to
! Scandinavian customers of Level 3 will make it into the FIB.
!
route-map important-routes-only permit 10
 match community level3-cust-denmark
route-map important-routes-only permit 15
 match community level3-cust-norway
route-map important-routes-only permit 20
 match community level3-cust-sweden
!
! The next two matches ensure that any other routes learned from our
! transit providers are kept out of the FIB.
!
route-map important-routes-only deny 25
 match as-path gblx
route-map important-routes-only deny 30
 match as-path level3
!
! Finally, allow all other routes. This is typically routes learned from
! our peering partners.
!
route-map important-routes-only permit 35

The last piece of the puzzle is to make the BGP daemon filter all received BGP routes through the important-routes-only route-map before installing them to the FIB. This is done using the table-map parameter, making the final BGP configuration look like this:

router bgp 39029
 no bgp default ipv4-unicast
 neighbor 192.0.2.1 remote-as internal
 neighbor 192.0.2.2 remote-as internal
 neighbor 2001:db8::1 remote-as internal
 neighbor 2001:db8::2 remote-as internal
 !
 address-family ipv4 unicast
  neighbor 192.0.2.1 activate
  neighbor 192.0.2.2 activate
  table-map important-routes-only
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor 2001:db8::1 activate
  neighbor 2001:db8::2 activate
  table-map important-routes-only
 exit-address-family

As expected, this makes the FIB consumption become much more reasonable, and well within the capabilities of the Altoline 6920 switch:

cumulus@cumulus:mgmt-vrf:~$ cl-resource-query | grep -i route
IPv4 route entries:     25367,  19% of maximum value 131072
IPv6 route entries:      1506,   7% of maximum value  20480
IPv4 Routes:            25367
IPv6 Routes:             1506
Total Routes:           26873,  20% of maximum value 131072

Success! We’ve slimmed down the Internet routing table to a size that will fit comfortably in the FIB of a commodity data centre switch, while at the same time maintaining full connectivity to the entire global Internet.

Summary

Slimming down the FIB will allow us to build our network using commodity data centre switches rather than traditional Internet routers. The new network will be able to handle significantly higher traffic volumes at a fraction of the cost, making us able to offer competitively priced Internet bandwidth as part of our Managed Services service offerings.

Tore Anderson

Senior Systems Consultant at Redpill Linpro

Tore works with infrastructure at Redpill Linpro. Joining us more than a decade ago as a trainee, Tore is now responsible for our network architecture and operations.

Just-Make-toolbox

make is a utility for automating builds. You specify the source and the build file and make will determine which file(s) have to be re-built. Using this functionality in make as an all-round tool for command running as well, is considered common practice. Yes, you could write Shell scripts for this instead and they would be probably equally good. But using make has its own charm (and gets you karma points).

Even this ... [continue reading]

Containerized Development Environment

Published on February 28, 2024

Ansible-runner

Published on February 27, 2024