Our Routing Problems Have Not Yet Begun
Bruce M. Maggs
Computer Science Department
Carnegie Mellon University
and
Akamai Technologies
Abstract
The current routing protocols have many well-known deficiencies. Yet the Internet as a
whole has proven to be remarkably stable, and core capacity has been scaling with demand,
indeed has perhaps outpaced demand, so that end users are seeing better performance today
than ever. This paper argues that because of this spare capacity, the consequences of the
flaws in the protocols have not yet been truly experienced. It then argues that short-term
reversals in the ratio of capacity to demand are plausible, and that these reversals might
engender serious routing problems.
Successes and failures of BGP
The most commonly cited problems with the routing protocols involve BGP. (See [HM00] for a
thorough introduction.) BGP, which governs the routes taken by datagrams that travel
between different autonomous systems provides no effective mechanisms for guaranteeing
quality of service or optimizing performance (in terms of latency and throughput). Support
for load-balancing, adopting to rapid changes in traffic patterns, and filtering malicious
traffic range from minimal to none. Furthermore, in practice, routing policies may be
influenced by financial considerations, and the manual entry of router configuration data
is common.
Perhaps it is surprising, then, that there have been only a few isolated incidents in which
major network outages have occurred. Human configuration has been to blame for several.
For example, in the summer of 2000, a BGP configuration error led routers at a Level3
facility in San Diego to advertise short routes to the rest of the Internet, temporarily
diverting an unsupportable traffic load to this facility. Later, in June 2001, Cable and
Wireless intentionally and abruptly refused to peer with PSINet (for financial reasons),
isolating many users on PSINet. (To those who advocate fully automatic configuration,
however, it would be wise to remember the adage, "To err is human, but to really foul
things up requires a computer." [E78])
But the success stories outweigh these incidents. Although it is early to assess the impact
of the recent large-scale power outage in the Eastern portion of the United States, there
are few initial reports of core network outages or even web-site infrastructure outages. A
National Research Council report on the impact of the September 11, 2001, attacks in the
United States [NRC02] also showed that the routing protocols adjusted properly to the
physical destruction of network infrastructure in New York City, and the Internet as a
whole continued to perform well (although certain news-oriented web sites were unable to
satisfy demand.) Addressing BGP more specifically, although certain worms such as "Code
Red" and "Slammer" (or "Sapphire") have generated enough malicious network traffic to
distract routers from their primary functions and disrupt BGP TCP connections (forcing
table retransmissions and resulting in "churn") [C03b,M03], none of these worms have caused
widespread route instability. Perhaps most interestingly, many BGP routers throughout the
world were patched and restarted one night in the spring of 2002, after the release of a
Cisco security patch, and yet network routing was not disrupted.
Operating in the dark
Before arguing that circumstances may soon arise in which the weaknesses of BGP may begin
to have more serious consequences, however, it is important to first observe that our
ability to predict the future behavior of network traffic and routing behavior is limited.
Indeed, it may be fair to say that we cannot even accurately characterize the behavior of
network traffic today, except to say that it has been known to change rapidly and
drastically. Examples include the rapid rise in http traffic after the introduction of the
Mosaic and Netscape browsers, the recent boom in "SPAM" email, the explosion of
file-sharing services, and the heavy traffic loads generated by worms and viruses and the
corresponding patches. (More than half of the Akamai's top-ten customers, ranked by total
number of bytes served, use Akamai to deliver virus and worm signatures and software
updates.) This is not to say that there have not been effective approaches to
understanding network traffic. The study by Saroiu et al. [SGGL02], for example, paints a
detailed picture of the types of flows entering and exiting the University of Washington's
network, and points out recent growth in traffic attributed to file-sharing services. But
this study may not be representative of Internet traffic at large. For example, it fails
to capture VPN traffic between enterprise office facilities (and many other sorts of
traffic). BGP behavior has also been studied extensively. As a well-done representative
of this type of work, Maennel et al. [MF02] study BGP routing instabilities.
But it would be difficult to find consensus among networking experts on answers to the
following sorts of questions.
1. Where (if anywhere) is the congestion in the Internet?
2. How much capacity does the Internet have, and how fast is it growing?
3. How much traffic does the core of the Internet carry today, and what does it look like?
4. How fast is network traffic growing?
5. What will traffic patterns look like five years from now?
6. Can we scale the network to support the demands of users five years from now?
7. How much does it and will it cost to increase network capacity?
8. Will stub networks soon be employing sophisticated traffic engineering mechanisms on
their own, e.g., those based on multihoming and overlay routing? What impact might these
techniques have?
9. What about content delivery networks? What fraction of the traffic are they carrying?
What is the impact of the trick of using DNS to route traffic?
These questions have, of course, been studied. Regarding the first question, the
"conventional wisdom" has been that congestion occurs primarily in "last mile" connections
to homes and enterprises. Cheriton [C03] and others have argued that the abundance of
"dark fiber" in the United States will provide enough transmission capacity for some time
to come. A recent study by Akella, et al. [ASS03], however, found that up to 40% of the
paths between all pairs of a diverse set of hosts on the Internet had at most 50Mbps spare
capacity. These "bottlenecks" were most commonly seen on tier-two, -three, and -four
networks, but 15% appeared on tier-one networks. The study indicates that regardless of
fiber capacity, there is already congestion in the core. Perhaps router capacity is a more
limited resource.
The second question has been addressed by Danzig who has periodically estimated network
capacity an traffic load. His estimates for of cross-continental capacity are surprisingly
low.
The coming crises?
Despite the caveats about our understanding of the state of the network today, let us make
the assumptions that the core of the Internet is, in many places, running at close to
capacity, and (more easily supported) that the last-mile remains a bottleneck for many end
users. How might a routing crisis ensue? Suppose that there is a rapid increase (perhaps
two orders of magnitude) in the traffic generated by end users. Such a scenario would be
driven by end user demand and greatly improved last-mile connectivity. As we have argued,
new applications (the web, file-sharing services, etc.) have in the past periodically
created large new traffic demands. Furthermore, these demands have arisen without abrupt
technology changes. What the new applications might be is difficult to predict. There are
many possible applications that could utilize high-quality video, but we have yet to see
enough last-mile connectivity to support them. In South Korea, where the penetration of
"broadband" to the home is more widespread than in the United States, networking gaming
applications have become a significant driver of traffic. Whatever the source, it is seems
plausible that great increases in demand will continue to punctuate the future. On the
last-mile front, upgrading capacity is likely to prove expensive, but is certainly
technically feasible. Let us assume there is great demand from end users for improved
connectivity (two orders of magnitude), and that end users are willing to pay for this
access connectivity into their homes and businesses. Increasing end-user bursting capacity
will increase the potential for drastic changes in traffic patterns.
If the following scenario should take place, carriers will be faced with task of scaling
their networks, requiring increases in both transmission capacity and switching capacity.
Predominant traffic patterns may also shift, requiring capacity in new places. The
carriers will presumably price data services to cover the expense of this new
infrastructure, and will make an effort to match increases with traffic demands to
increases in capacity.
So what might go wrong? As the carriers attempt to increase capacity, they will (as they
have in the past) try to avoid building-in excessive margins of spare capacity. But
predictions about where capacity is needed, and how much, may prove difficult. There are
many unknown variables, and they have the potential to swing rapidly. How quickly will
traffic demand grow? How will traffic patterns change? Will new applications behave
responsibly? How will the ratio of capacity-and-demand-at-the-edge to
capacity-required-in-the-core change? How much will it cost to increase capacity in the
core? As our scenario unfolds, let us assume that, due to the difficulties in predicting
these variables, occasionally growth-in-demand versus growth-in-core-capacity become
out-of-kilter, so that demand bumps up against capacity, and large parts of the core of the
Internet operate for weeks or perhaps months at a time at or near to capacity.
Now the routing problems set in
Imagine the problems a largely saturated core would cause. BGP provides no mechanism for
routing around congestion. Networks might find themselves effectively isolated from each
other, even if, through proper load balancing, congestion-free routes are available.
High-priority traffic would fare no better. BGP itself might have difficulty functioning.
Manual attempts to reduce congestion through BGP configuration BGP would increase the risk
of routing outages.
Directions for future research
The above discussion suggests a number of directions for future research. To ward off
problems in the short-to-medium term, we should further improve our understanding of how
the Internet currently operates so that we can make better short-term predictions. We
should analyze the behavior of the Internet with a saturated core, and determine what can
be done using the current protocols and practices to alleviate the problems that would
arise. Longer term, we need to replace BGP and most likely the interior protocols as well,
and consider modifying the Internet architecture too (as suggested by Zhang et al. [MZ03],
and surely many others). Of course replacing a universally adopted protocol like BGP is no
easy task, but it seems risky to continue with a protocol that is not designed to perform
well in extreme situations. Performance optimizations must be integral to such a protocol.
It is difficult to design, tune, or improve protocols or build networks, however, without a
good understanding of how networks operate in practice. Hence measurability should be a
goal as well (as suggested by Varghese and others). Most importantly, we should decide how
we want the Internet behave to behave in the future, and build accordingly.
References
[ASS03] A. Akella, S. Seshan, and A. Shaikh, An Empirical Evaluation of Wide-Area Internet
Bottlenecks, in Proceedings of the First ACM Internet Measurement Conference, October 2003,
to appear.
[C03] D. Cheriton, The Future of the Internet: Why it Matters, Keynote Address (SIGCOMM
2003 Award Winner), SIGCOMM 2003 Conference on Applications, Technologies, Architectures
and Protocols for Computer Communication, September, 2003.
[C03b] G. Cybenko, Presentation at DARPA Dynamic Quarantine Industry Day, March 2003.
[HM97] S. Halabi and D. McPherson, Internet Routing Architectures, second edition, Cisco Press, 2000.
[M03] B. M. Maggs, Presentation at DARPA Dynamic Quarantine Industry Day, March 2003.
[NRC01] The Internet Under Crisis Conditions: Learning from September 11, National Research
Council, Washington, DC, December 2001.
[SGGL02], S. Saroiu, K. Gummadi, S. D. Gribble and H. M. Levy, An Analysis of Internet
Content Delivery Systems, in Proceedings of the Fifth Symposium on Operating Systems Design
and Implementation, December, 2002.
[MF02] O. Maennel and A. Feldmann, Realistic BGP Traffic for Test Labs, in Proceedings of
the 2003 SIGCOMM Conference on Communications Architectures and Protocols, August, 2002.
[E78] Paul Erlich, Farmers Almanac, 1974.