Our Routing Problems Have Not Yet Begun Bruce M. Maggs Computer Science Department Carnegie Mellon University and Akamai Technologies Abstract The current routing protocols have many well-known deficiencies. Yet the Internet as a whole has proven to be remarkably stable, and core capacity has been scaling with demand, indeed has perhaps outpaced demand, so that end users are seeing better performance today than ever. This paper argues that because of this spare capacity, the consequences of the flaws in the protocols have not yet been truly experienced. It then argues that short-term reversals in the ratio of capacity to demand are plausible, and that these reversals might engender serious routing problems. Successes and failures of BGP The most commonly cited problems with the routing protocols involve BGP. (See [HM00] for a thorough introduction.) BGP, which governs the routes taken by datagrams that travel between different autonomous systems provides no effective mechanisms for guaranteeing quality of service or optimizing performance (in terms of latency and throughput). Support for load-balancing, adopting to rapid changes in traffic patterns, and filtering malicious traffic range from minimal to none. Furthermore, in practice, routing policies may be influenced by financial considerations, and the manual entry of router configuration data is common. Perhaps it is surprising, then, that there have been only a few isolated incidents in which major network outages have occurred. Human configuration has been to blame for several. For example, in the summer of 2000, a BGP configuration error led routers at a Level3 facility in San Diego to advertise short routes to the rest of the Internet, temporarily diverting an unsupportable traffic load to this facility. Later, in June 2001, Cable and Wireless intentionally and abruptly refused to peer with PSINet (for financial reasons), isolating many users on PSINet. (To those who advocate fully automatic configuration, however, it would be wise to remember the adage, "To err is human, but to really foul things up requires a computer." [E78]) But the success stories outweigh these incidents. Although it is early to assess the impact of the recent large-scale power outage in the Eastern portion of the United States, there are few initial reports of core network outages or even web-site infrastructure outages. A National Research Council report on the impact of the September 11, 2001, attacks in the United States [NRC02] also showed that the routing protocols adjusted properly to the physical destruction of network infrastructure in New York City, and the Internet as a whole continued to perform well (although certain news-oriented web sites were unable to satisfy demand.) Addressing BGP more specifically, although certain worms such as "Code Red" and "Slammer" (or "Sapphire") have generated enough malicious network traffic to distract routers from their primary functions and disrupt BGP TCP connections (forcing table retransmissions and resulting in "churn") [C03b,M03], none of these worms have caused widespread route instability. Perhaps most interestingly, many BGP routers throughout the world were patched and restarted one night in the spring of 2002, after the release of a Cisco security patch, and yet network routing was not disrupted. Operating in the dark Before arguing that circumstances may soon arise in which the weaknesses of BGP may begin to have more serious consequences, however, it is important to first observe that our ability to predict the future behavior of network traffic and routing behavior is limited. Indeed, it may be fair to say that we cannot even accurately characterize the behavior of network traffic today, except to say that it has been known to change rapidly and drastically. Examples include the rapid rise in http traffic after the introduction of the Mosaic and Netscape browsers, the recent boom in "SPAM" email, the explosion of file-sharing services, and the heavy traffic loads generated by worms and viruses and the corresponding patches. (More than half of the Akamai's top-ten customers, ranked by total number of bytes served, use Akamai to deliver virus and worm signatures and software updates.) This is not to say that there have not been effective approaches to understanding network traffic. The study by Saroiu et al. [SGGL02], for example, paints a detailed picture of the types of flows entering and exiting the University of Washington's network, and points out recent growth in traffic attributed to file-sharing services. But this study may not be representative of Internet traffic at large. For example, it fails to capture VPN traffic between enterprise office facilities (and many other sorts of traffic). BGP behavior has also been studied extensively. As a well-done representative of this type of work, Maennel et al. [MF02] study BGP routing instabilities. But it would be difficult to find consensus among networking experts on answers to the following sorts of questions. 1. Where (if anywhere) is the congestion in the Internet? 2. How much capacity does the Internet have, and how fast is it growing? 3. How much traffic does the core of the Internet carry today, and what does it look like? 4. How fast is network traffic growing? 5. What will traffic patterns look like five years from now? 6. Can we scale the network to support the demands of users five years from now? 7. How much does it and will it cost to increase network capacity? 8. Will stub networks soon be employing sophisticated traffic engineering mechanisms on their own, e.g., those based on multihoming and overlay routing? What impact might these techniques have? 9. What about content delivery networks? What fraction of the traffic are they carrying? What is the impact of the trick of using DNS to route traffic? These questions have, of course, been studied. Regarding the first question, the "conventional wisdom" has been that congestion occurs primarily in "last mile" connections to homes and enterprises. Cheriton [C03] and others have argued that the abundance of "dark fiber" in the United States will provide enough transmission capacity for some time to come. A recent study by Akella, et al. [ASS03], however, found that up to 40% of the paths between all pairs of a diverse set of hosts on the Internet had at most 50Mbps spare capacity. These "bottlenecks" were most commonly seen on tier-two, -three, and -four networks, but 15% appeared on tier-one networks. The study indicates that regardless of fiber capacity, there is already congestion in the core. Perhaps router capacity is a more limited resource. The second question has been addressed by Danzig who has periodically estimated network capacity an traffic load. His estimates for of cross-continental capacity are surprisingly low. The coming crises? Despite the caveats about our understanding of the state of the network today, let us make the assumptions that the core of the Internet is, in many places, running at close to capacity, and (more easily supported) that the last-mile remains a bottleneck for many end users. How might a routing crisis ensue? Suppose that there is a rapid increase (perhaps two orders of magnitude) in the traffic generated by end users. Such a scenario would be driven by end user demand and greatly improved last-mile connectivity. As we have argued, new applications (the web, file-sharing services, etc.) have in the past periodically created large new traffic demands. Furthermore, these demands have arisen without abrupt technology changes. What the new applications might be is difficult to predict. There are many possible applications that could utilize high-quality video, but we have yet to see enough last-mile connectivity to support them. In South Korea, where the penetration of "broadband" to the home is more widespread than in the United States, networking gaming applications have become a significant driver of traffic. Whatever the source, it is seems plausible that great increases in demand will continue to punctuate the future. On the last-mile front, upgrading capacity is likely to prove expensive, but is certainly technically feasible. Let us assume there is great demand from end users for improved connectivity (two orders of magnitude), and that end users are willing to pay for this access connectivity into their homes and businesses. Increasing end-user bursting capacity will increase the potential for drastic changes in traffic patterns. If the following scenario should take place, carriers will be faced with task of scaling their networks, requiring increases in both transmission capacity and switching capacity. Predominant traffic patterns may also shift, requiring capacity in new places. The carriers will presumably price data services to cover the expense of this new infrastructure, and will make an effort to match increases with traffic demands to increases in capacity. So what might go wrong? As the carriers attempt to increase capacity, they will (as they have in the past) try to avoid building-in excessive margins of spare capacity. But predictions about where capacity is needed, and how much, may prove difficult. There are many unknown variables, and they have the potential to swing rapidly. How quickly will traffic demand grow? How will traffic patterns change? Will new applications behave responsibly? How will the ratio of capacity-and-demand-at-the-edge to capacity-required-in-the-core change? How much will it cost to increase capacity in the core? As our scenario unfolds, let us assume that, due to the difficulties in predicting these variables, occasionally growth-in-demand versus growth-in-core-capacity become out-of-kilter, so that demand bumps up against capacity, and large parts of the core of the Internet operate for weeks or perhaps months at a time at or near to capacity. Now the routing problems set in Imagine the problems a largely saturated core would cause. BGP provides no mechanism for routing around congestion. Networks might find themselves effectively isolated from each other, even if, through proper load balancing, congestion-free routes are available. High-priority traffic would fare no better. BGP itself might have difficulty functioning. Manual attempts to reduce congestion through BGP configuration BGP would increase the risk of routing outages. Directions for future research The above discussion suggests a number of directions for future research. To ward off problems in the short-to-medium term, we should further improve our understanding of how the Internet currently operates so that we can make better short-term predictions. We should analyze the behavior of the Internet with a saturated core, and determine what can be done using the current protocols and practices to alleviate the problems that would arise. Longer term, we need to replace BGP and most likely the interior protocols as well, and consider modifying the Internet architecture too (as suggested by Zhang et al. [MZ03], and surely many others). Of course replacing a universally adopted protocol like BGP is no easy task, but it seems risky to continue with a protocol that is not designed to perform well in extreme situations. Performance optimizations must be integral to such a protocol. It is difficult to design, tune, or improve protocols or build networks, however, without a good understanding of how networks operate in practice. Hence measurability should be a goal as well (as suggested by Varghese and others). Most importantly, we should decide how we want the Internet behave to behave in the future, and build accordingly. References [ASS03] A. Akella, S. Seshan, and A. Shaikh, An Empirical Evaluation of Wide-Area Internet Bottlenecks, in Proceedings of the First ACM Internet Measurement Conference, October 2003, to appear. [C03] D. Cheriton, The Future of the Internet: Why it Matters, Keynote Address (SIGCOMM 2003 Award Winner), SIGCOMM 2003 Conference on Applications, Technologies, Architectures and Protocols for Computer Communication, September, 2003. [C03b] G. Cybenko, Presentation at DARPA Dynamic Quarantine Industry Day, March 2003. [HM97] S. Halabi and D. McPherson, Internet Routing Architectures, second edition, Cisco Press, 2000. [M03] B. M. Maggs, Presentation at DARPA Dynamic Quarantine Industry Day, March 2003. [NRC01] The Internet Under Crisis Conditions: Learning from September 11, National Research Council, Washington, DC, December 2001. [SGGL02], S. Saroiu, K. Gummadi, S. D. Gribble and H. M. Levy, An Analysis of Internet Content Delivery Systems, in Proceedings of the Fifth Symposium on Operating Systems Design and Implementation, December, 2002. [MF02] O. Maennel and A. Feldmann, Realistic BGP Traffic for Test Labs, in Proceedings of the 2003 SIGCOMM Conference on Communications Architectures and Protocols, August, 2002. [E78] Paul Erlich, Farmers Almanac, 1974.