Workshop on Internet Routing Evolution and Design (WIRED)

October 7-8, 2003
Timberline Lodge, Mount Hood, Oregon, USA

Position statement of

Bruce Maggs

(CMU/Akamai)






         Our Routing Problems Have Not Yet Begun
         
         Bruce M. Maggs
         Computer Science Department
         Carnegie Mellon University
         and
         Akamai Technologies
         
         Abstract
         
	 The current routing protocols have many well-known deficiencies.  Yet the Internet as a
	 whole has proven to be remarkably stable, and core capacity has been scaling with demand,
	 indeed has perhaps outpaced demand, so that end users are seeing better performance today
	 than ever.  This paper argues that because of this spare capacity, the consequences of the
	 flaws in the protocols have not yet been truly experienced.  It then argues that short-term
	 reversals in the ratio of capacity to demand are plausible, and that these reversals might
	 engender serious routing problems.
         
         Successes and failures of BGP
         
	 The most commonly cited problems with the routing protocols involve BGP.  (See [HM00] for a
	 thorough introduction.) BGP, which governs the routes taken by datagrams that travel
	 between different autonomous systems provides no effective mechanisms for guaranteeing
	 quality of service or optimizing performance (in terms of latency and throughput).  Support
	 for load-balancing, adopting to rapid changes in traffic patterns, and filtering malicious
	 traffic range from minimal to none.  Furthermore, in practice, routing policies may be
	 influenced by financial considerations, and the manual entry of router configuration data
	 is common.
         
	 Perhaps it is surprising, then, that there have been only a few isolated incidents in which
	 major network outages have occurred.  Human configuration has been to blame for several.
	 For example, in the summer of 2000, a BGP configuration error led routers at a Level3
	 facility in San Diego to advertise short routes to the rest of the Internet, temporarily
	 diverting an unsupportable traffic load to this facility.  Later, in June 2001, Cable and
	 Wireless intentionally and abruptly refused to peer with PSINet (for financial reasons),
	 isolating many users on PSINet.  (To those who advocate fully automatic configuration,
	 however, it would be wise to remember the adage, "To err is human, but to really foul
	 things up requires a computer." [E78])
         
	 But the success stories outweigh these incidents. Although it is early to assess the impact
	 of the recent large-scale power outage in the Eastern portion of the United States, there
	 are few initial reports of core network outages or even web-site infrastructure outages.  A
	 National Research Council report on the impact of the September 11, 2001, attacks in the
	 United States [NRC02] also showed that the routing protocols adjusted properly to the
	 physical destruction of network infrastructure in New York City, and the Internet as a
	 whole continued to perform well (although certain news-oriented web sites were unable to
	 satisfy demand.) Addressing BGP more specifically, although certain worms such as "Code
	 Red" and "Slammer" (or "Sapphire") have generated enough malicious network traffic to
	 distract routers from their primary functions and disrupt BGP TCP connections (forcing
	 table retransmissions and resulting in "churn") [C03b,M03], none of these worms have caused
	 widespread route instability.  Perhaps most interestingly, many BGP routers throughout the
	 world were patched and restarted one night in the spring of 2002, after the release of a
	 Cisco security patch, and yet network routing was not disrupted.
         
         Operating in the dark
         
	 Before arguing that circumstances may soon arise in which the weaknesses of BGP may begin
	 to have more serious consequences, however, it is important to first observe that our
	 ability to predict the future behavior of network traffic and routing behavior is limited.
	 Indeed, it may be fair to say that we cannot even accurately characterize the behavior of
	 network traffic today, except to say that it has been known to change rapidly and
	 drastically.  Examples include the rapid rise in http traffic after the introduction of the
	 Mosaic and Netscape browsers, the recent boom in "SPAM" email, the explosion of
	 file-sharing services, and the heavy traffic loads generated by worms and viruses and the
	 corresponding patches.  (More than half of the Akamai's top-ten customers, ranked by total
	 number of bytes served, use Akamai to deliver virus and worm signatures and software
	 updates.)  This is not to say that there have not been effective approaches to
	 understanding network traffic.  The study by Saroiu et al. [SGGL02], for example, paints a
	 detailed picture of the types of flows entering and exiting the University of Washington's
	 network, and points out recent growth in traffic attributed to file-sharing services.  But
	 this study may not be representative of Internet traffic at large.  For example, it fails
	 to capture VPN traffic between enterprise office facilities (and many other sorts of
	 traffic).  BGP behavior has also been studied extensively.  As a well-done representative
	 of this type of work, Maennel et al. [MF02] study BGP routing instabilities.
         
	 But it would be difficult to find consensus among networking experts on answers to the
	 following sorts of questions.
         
         1. Where (if anywhere) is the congestion in the Internet?
         
         2. How much capacity does the Internet have, and how fast is it growing?
         
         3. How much traffic does the core of the Internet carry today, and what does it look like?
         
         4. How fast is network traffic growing?
         
         5. What will traffic patterns look like five years from now?
         
         6. Can we scale the network to support the demands of users five years from now?
         
         7. How much does it and will it cost to increase network capacity?
         
	 8. Will stub networks soon be employing sophisticated traffic engineering mechanisms on
	 their own, e.g., those based on multihoming and overlay routing? What impact might these
	 techniques have?
         
	 9. What about content delivery networks?  What fraction of the traffic are they carrying?
	 What is the impact of the trick of using DNS to route traffic?  
         
	 These questions have, of course, been studied.  Regarding the first question, the
	 "conventional wisdom" has been that congestion occurs primarily in "last mile" connections
	 to homes and enterprises.  Cheriton [C03] and others have argued that the abundance of
	 "dark fiber" in the United States will provide enough transmission capacity for some time
	 to come.  A recent study by Akella, et al. [ASS03], however, found that up to 40% of the
	 paths between all pairs of a diverse set of hosts on the Internet had at most 50Mbps spare
	 capacity.  These "bottlenecks" were most commonly seen on tier-two, -three, and -four
	 networks, but 15% appeared on tier-one networks.  The study indicates that regardless of
	 fiber capacity, there is already congestion in the core.  Perhaps router capacity is a more
	 limited resource.
         
	 The second question has been addressed by Danzig who has periodically estimated network
	 capacity an traffic load.  His estimates for of cross-continental capacity are surprisingly
	 low.
         
         
         The coming crises?
         
	 Despite the caveats about our understanding of the state of the network today, let us make
	 the assumptions that the core of the Internet is, in many places, running at close to
	 capacity, and (more easily supported) that the last-mile remains a bottleneck for many end
	 users.  How might a routing crisis ensue?  Suppose that there is a rapid increase (perhaps
	 two orders of magnitude) in the traffic generated by end users.  Such a scenario would be
	 driven by end user demand and greatly improved last-mile connectivity.  As we have argued,
	 new applications (the web, file-sharing services, etc.) have in the past periodically
	 created large new traffic demands.  Furthermore, these demands have arisen without abrupt
	 technology changes.  What the new applications might be is difficult to predict.  There are
	 many possible applications that could utilize high-quality video, but we have yet to see
	 enough last-mile connectivity to support them.  In South Korea, where the penetration of
	 "broadband" to the home is more widespread than in the United States, networking gaming
	 applications have become a significant driver of traffic.  Whatever the source, it is seems
	 plausible that great increases in demand will continue to punctuate the future.  On the
	 last-mile front, upgrading capacity is likely to prove expensive, but is certainly
	 technically feasible.  Let us assume there is great demand from end users for improved
	 connectivity (two orders of magnitude), and that end users are willing to pay for this
	 access connectivity into their homes and businesses.  Increasing end-user bursting capacity
	 will increase the potential for drastic changes in traffic patterns.
         
	 If the following scenario should take place, carriers will be faced with task of scaling
	 their networks, requiring increases in both transmission capacity and switching capacity.
	 Predominant traffic patterns may also shift, requiring capacity in new places.  The
	 carriers will presumably price data services to cover the expense of this new
	 infrastructure, and will make an effort to match increases with traffic demands to
	 increases in capacity.
         
	 So what might go wrong?  As the carriers attempt to increase capacity, they will (as they
	 have in the past) try to avoid building-in excessive margins of spare capacity.  But
	 predictions about where capacity is needed, and how much, may prove difficult.  There are
	 many unknown variables, and they have the potential to swing rapidly.  How quickly will
	 traffic demand grow?  How will traffic patterns change?  Will new applications behave
	 responsibly?  How will the ratio of capacity-and-demand-at-the-edge to
	 capacity-required-in-the-core change?  How much will it cost to increase capacity in the
	 core?  As our scenario unfolds, let us assume that, due to the difficulties in predicting
	 these variables, occasionally growth-in-demand versus growth-in-core-capacity become
	 out-of-kilter, so that demand bumps up against capacity, and large parts of the core of the
	 Internet operate for weeks or perhaps months at a time at or near to capacity.
         
         
         Now the routing problems set in
         
	 Imagine the problems a largely saturated core would cause.  BGP provides no mechanism for
	 routing around congestion.  Networks might find themselves effectively isolated from each
	 other, even if, through proper load balancing, congestion-free routes are available.
	 High-priority traffic would fare no better.  BGP itself might have difficulty functioning.
	 Manual attempts to reduce congestion through BGP configuration BGP would increase the risk
	 of routing outages.
         
         
         Directions for future research
         
	 The above discussion suggests a number of directions for future research.  To ward off
	 problems in the short-to-medium term, we should further improve our understanding of how
	 the Internet currently operates so that we can make better short-term predictions.  We
	 should analyze the behavior of the Internet with a saturated core, and determine what can
	 be done using the current protocols and practices to alleviate the problems that would
	 arise.  Longer term, we need to replace BGP and most likely the interior protocols as well,
	 and consider modifying the Internet architecture too (as suggested by Zhang et al. [MZ03],
	 and surely many others).  Of course replacing a universally adopted protocol like BGP is no
	 easy task, but it seems risky to continue with a protocol that is not designed to perform
	 well in extreme situations.  Performance optimizations must be integral to such a protocol.
	 It is difficult to design, tune, or improve protocols or build networks, however, without a
	 good understanding of how networks operate in practice.  Hence measurability should be a
	 goal as well (as suggested by Varghese and others).  Most importantly, we should decide how
	 we want the Internet behave to behave in the future, and build accordingly.
          
         
         References
         
	 [ASS03] A. Akella, S. Seshan, and A. Shaikh, An Empirical Evaluation of Wide-Area Internet
	 Bottlenecks, in Proceedings of the First ACM Internet Measurement Conference, October 2003,
	 to appear.
         
	 [C03] D. Cheriton, The Future of the Internet: Why it Matters, Keynote Address (SIGCOMM
	 2003 Award Winner), SIGCOMM 2003 Conference on Applications, Technologies, Architectures
	 and Protocols for Computer Communication, September, 2003.
         
         [C03b] G. Cybenko, Presentation at DARPA Dynamic Quarantine Industry Day, March 2003.
         
         [HM97] S. Halabi and D. McPherson, Internet Routing Architectures, second edition, Cisco Press, 2000.
         
         [M03] B. M. Maggs, Presentation at DARPA Dynamic Quarantine Industry Day, March 2003.
         
	 [NRC01] The Internet Under Crisis Conditions: Learning from September 11, National Research
	 Council, Washington, DC, December 2001.
         
	 [SGGL02], S. Saroiu, K. Gummadi, S. D. Gribble and H. M. Levy, An Analysis of Internet
	 Content Delivery Systems, in Proceedings of the Fifth Symposium on Operating Systems Design
	 and Implementation, December, 2002.
         
	 [MF02] O. Maennel and A. Feldmann, Realistic BGP Traffic for Test Labs, in Proceedings of
	 the 2003 SIGCOMM Conference on Communications Architectures and Protocols, August, 2002.
         
         [E78] Paul Erlich, Farmers Almanac, 1974.