Workshop on Internet Routing Evolution and Design (WIRED)

October 7-8, 2003
Timberline Lodge, Mount Hood, Oregon, USA

PDF position statements:
Olivier Bonaventure (UCL Belgium)
Nick Feamster (MIT) (slides)
Vijay Gill (AOL) slides
Ratul Mahajan (U. Washington) (slides)
Aman Shaikh (UCSC/AT&T) (slides position statement, slides panel)
Lakshmi Subramanian (slides)
Lan Wang (UCLA) (slides)

Position statement of

Aditya Akella

(CMU)





On the Effects of the Wide-Spread Deployment of Route Control Products
and Overlay Routing Services (Aditya Akella, CMU)
===============================================================

Recent years have seen route control and overlay routing products that
allow users and end-networks to select wide-area paths for their
transfers in a more informed manner. For example, multihomed
subscribers at the edge of the network are increasingly employing
route control products (e.g., RouteScience's "Path
Control"). Similarly, customers of Akamai's SureRoute service receive
access to a large, diverse overlay network to route traffic on. The
primary motivation for these products is to provide end-network-based
mechanisms for optimizing wide-area performance and reliability. While
the deployment of these products and services is not very widespread
today, we expect it to grow rapidly over the coming years.

At the same time, the deployment of such route control mechanisms has
given rise to concerns about their impact on the general well-being
(e.g. the stability of routing and network load) of the network. For
this reason, the questions below are critical to our understanding of
where the state-of-the-art in end-to-end routing and route selection
lies today and where it is headed in the foreseeable future:

- What is the impact of the deployment of route control mechanisms and
  services on the operation of ISP networks and on the efficient
  functioning of the Internet as a whole?

- Would these products cause route or traffic instability in the
  Internet and if so, to what extent?

- What new mechanisms do we need to put in place to counter the
  potential ill-effects?

These questions can be addressed via a combination of measurement and
analysis. The first step here is to accurately measure prevalent
end-network practices for achieving intelligent route control and then
build models for such end-network behavior.  It is also crucial to
understand, in general, what the best end-network strategies are for
improving performance and resilience. This may help influence (and
possibly, model) future product design too.

The next step is to study the impact that both limited and wide-spread
deployment of route control products can have on network
operation. Since these products are not overly popular amongst
end-networks today, this question cannot be answered using traditional
measurement-based approaches. However, modeling, simulation and
analysis could give us the answers we are looking for. One useful tool
is game theory. The interaction between various intelligent
end-networks and the Internet can be modeled as a game in which the
end-networks are selfish agents trying to individually maximize a
local goal, such as observed performance. The models for end-network
behavior constructed above could prove very useful in such an
analysis.

If the above analysis shows that deployment of route control does not
impact the stability of routes in a negative manner, network traffic
and the efficient operation of the network as a whole, then we need
not be too concerned about the proliferation of route control
products. If, on the other hand, the analysis shows that these
products can have a negative impact on how well the network functions,
then we may have to work on measures to counter the
ill-effects. Stated otherwise, "aggressive" end-network behavior must
be sufficiently penalized and thereby discouraged.

One way to achieve the negative incentives described above is to
design novel pricing schemes (which may involve rewriting SLAs) to
ensure that end-networks offer somewhat fixed, predictable load to
their provider networks. The SLAs could be coupled with policing
schemes at the ingresses of ISP networks which could, for example,
rate-limit traffic or drop packets to discourage a particular choice
of routes made by the end-network. Such schemes could help strike the
right balance between the end-networks' attempts to improve
performance and resilience, and the carriers' goal of ensuring stable
traffic and routes, by factoring in economic benefit as the key
incentive for socially conformant behavior.

To summarize, it is unclear yet whether widespread use of route
control products will disturb the stability of the operation of the
Internet. This issue should be further explored by first identifying
the various ways in which end networks can impact stability, then
understanding the extent of the ill-effects, and finally designing
pricing-based mechanisms to contain the ill-effects.

slides

Position statement of

Anja Feldmann

(TU Munich)






Internet Routing: What is the Problem and how do we evaluate it?

Most people seem to believe that Internet Routing is to some degree
broken. Still pinpointing what is proken seems to be rather difficult.
One key contributing characteristic of routing is that there are lots
of scenarios each with its own individual challenges. For example:

    - All major ISPs have to determine their routing policies,
      implement these in their operations environment, and then
      realize them using the various components of their network and
      management infrastructure, including routers and databases.
    - Any multihomed customer has to worry about what is
      necessary for him to be able to do multihoming
    - An Internet researcher may not want to be bound by the
      commercial relationships between the providers and may 
      therefore establishe an overlay network.
    - ....

Internet routing is a global optimization problem within the control
plane of the Internet. So far it has been solved in a piecemeal
fashion. Folks are deriving solutions that are "optimial"/"workable"
for their specific situations. Nevertheless the current Internet
routing architecture works rather well. It is just questionable if it
is an optimal solution to the global optimization problem. But where
are the shortcommings of this solution and do they actually matter?
Judging from the interest of ISPs, Vendors, and researchers routing
related questions matter.  On the other hand most users may very well
think that the problem does not matter.  Most packets reach their
destination most of the time in a reasonable time. The complexity is
hidden from the user and most of the time the users are reasonable
happy.

But what happens if packets are not properly delievered. At this point
we realize among other problems that the Internet control plane is not
capable of debugging itself nor is it providing humans with sufficient
details for debugging it. This can only be changed with a measurement
infrastructure embedded within the routing architecture.  Furthermore
some of the problems are due to human errors, bad configuration
practices, and no consistency checks within the control plane. These
are not just in need of local checks but global controls build into
the architecture.

But there are also more fundamental question regarding the Internet
routing architecture:
  - Should routing be static or dynamic based on the amout of traffic?
  - Should routing be hierarchical or support virtual overlays
  - Should the routing be controled by the end-user or the ISP?
If there is more than one routing layer what are their interactions
and how do they interact with the user workload and the user
performance requirements?
 
This last question points out that one has to consider many sometimes
conflicting tradeoffs in order to design an alternative routing
strategy:
   - local control vs. global fault analysis
     (which information is providered, where is it processed, who can
      access it, and who can use it?)
   - fast updates vs. stability
     (This question has two aspects: 
       - for static/dynamic routing: When is a link faulty
       - for dynamic routing: When timescale is appropriate given a workload? 
   - how can we enable choices of algorithm and parameters while
     maintaining simplicity?
   - how does one find a tradeoff between performance/resilience/cost?  
   - how does one formulate policies and check them for validity and
     against intruders against various data bases? 
   - simplicity vs. features
   - end-user vs. operator control

In summary, I believe that we neiter understand the basic design space
in which to develop an Internet routing architecture nor the
evaluation criteria with which to judge our success or failure nor the
process of realizing the architecture in a controllable manner nor a
process for checking its operation.


Position statement of

Bruce Maggs

(CMU/Akamai)





Our Routing Problems Have Not Yet Begun

Bruce M. Maggs
Computer Science Department
Carnegie Mellon University
and
Akamai Technologies

Abstract

The current routing protocols have many well-known deficiencies.  Yet the Internet as a
whole has proven to be remarkably stable, and core capacity has been scaling with demand,
indeed has perhaps outpaced demand, so that end users are seeing better performance today
than ever.  This paper argues that because of this spare capacity, the consequences of the
flaws in the protocols have not yet been truly experienced.  It then argues that short-term
reversals in the ratio of capacity to demand are plausible, and that these reversals might
engender serious routing problems.

Successes and failures of BGP

The most commonly cited problems with the routing protocols involve BGP.  (See [HM00] for a
thorough introduction.) BGP, which governs the routes taken by datagrams that travel
between different autonomous systems provides no effective mechanisms for guaranteeing
quality of service or optimizing performance (in terms of latency and throughput).  Support
for load-balancing, adopting to rapid changes in traffic patterns, and filtering malicious
traffic range from minimal to none.  Furthermore, in practice, routing policies may be
influenced by financial considerations, and the manual entry of router configuration data
is common.

Perhaps it is surprising, then, that there have been only a few isolated incidents in which
major network outages have occurred.  Human configuration has been to blame for several.
For example, in the summer of 2000, a BGP configuration error led routers at a Level3
facility in San Diego to advertise short routes to the rest of the Internet, temporarily
diverting an unsupportable traffic load to this facility.  Later, in June 2001, Cable and
Wireless intentionally and abruptly refused to peer with PSINet (for financial reasons),
isolating many users on PSINet.  (To those who advocate fully automatic configuration,
however, it would be wise to remember the adage, "To err is human, but to really foul
things up requires a computer." [E78])

But the success stories outweigh these incidents. Although it is early to assess the impact
of the recent large-scale power outage in the Eastern portion of the United States, there
are few initial reports of core network outages or even web-site infrastructure outages.  A
National Research Council report on the impact of the September 11, 2001, attacks in the
United States [NRC02] also showed that the routing protocols adjusted properly to the
physical destruction of network infrastructure in New York City, and the Internet as a
whole continued to perform well (although certain news-oriented web sites were unable to
satisfy demand.) Addressing BGP more specifically, although certain worms such as "Code
Red" and "Slammer" (or "Sapphire") have generated enough malicious network traffic to
distract routers from their primary functions and disrupt BGP TCP connections (forcing
table retransmissions and resulting in "churn") [C03b,M03], none of these worms have caused
widespread route instability.  Perhaps most interestingly, many BGP routers throughout the
world were patched and restarted one night in the spring of 2002, after the release of a
Cisco security patch, and yet network routing was not disrupted.

Operating in the dark

Before arguing that circumstances may soon arise in which the weaknesses of BGP may begin
to have more serious consequences, however, it is important to first observe that our
ability to predict the future behavior of network traffic and routing behavior is limited.
Indeed, it may be fair to say that we cannot even accurately characterize the behavior of
network traffic today, except to say that it has been known to change rapidly and
drastically.  Examples include the rapid rise in http traffic after the introduction of the
Mosaic and Netscape browsers, the recent boom in "SPAM" email, the explosion of
file-sharing services, and the heavy traffic loads generated by worms and viruses and the
corresponding patches.  (More than half of the Akamai's top-ten customers, ranked by total
number of bytes served, use Akamai to deliver virus and worm signatures and software
updates.)  This is not to say that there have not been effective approaches to
understanding network traffic.  The study by Saroiu et al. [SGGL02], for example, paints a
detailed picture of the types of flows entering and exiting the University of Washington's
network, and points out recent growth in traffic attributed to file-sharing services.  But
this study may not be representative of Internet traffic at large.  For example, it fails
to capture VPN traffic between enterprise office facilities (and many other sorts of
traffic).  BGP behavior has also been studied extensively.  As a well-done representative
of this type of work, Maennel et al. [MF02] study BGP routing instabilities.

But it would be difficult to find consensus among networking experts on answers to the
following sorts of questions.

1. Where (if anywhere) is the congestion in the Internet?

2. How much capacity does the Internet have, and how fast is it growing?

3. How much traffic does the core of the Internet carry today, and what does it look like?

4. How fast is network traffic growing?

5. What will traffic patterns look like five years from now?

6. Can we scale the network to support the demands of users five years from now?

7. How much does it and will it cost to increase network capacity?

8. Will stub networks soon be employing sophisticated traffic engineering mechanisms on
their own, e.g., those based on multihoming and overlay routing? What impact might these
techniques have?

9. What about content delivery networks?  What fraction of the traffic are they carrying?
What is the impact of the trick of using DNS to route traffic?  

These questions have, of course, been studied.  Regarding the first question, the
"conventional wisdom" has been that congestion occurs primarily in "last mile" connections
to homes and enterprises.  Cheriton [C03] and others have argued that the abundance of
"dark fiber" in the United States will provide enough transmission capacity for some time
to come.  A recent study by Akella, et al. [ASS03], however, found that up to 40% of the
paths between all pairs of a diverse set of hosts on the Internet had at most 50Mbps spare
capacity.  These "bottlenecks" were most commonly seen on tier-two, -three, and -four
networks, but 15% appeared on tier-one networks.  The study indicates that regardless of
fiber capacity, there is already congestion in the core.  Perhaps router capacity is a more
limited resource.

The second question has been addressed by Danzig who has periodically estimated network
capacity an traffic load.  His estimates for of cross-continental capacity are surprisingly
low.


The coming crises?

Despite the caveats about our understanding of the state of the network today, let us make
the assumptions that the core of the Internet is, in many places, running at close to
capacity, and (more easily supported) that the last-mile remains a bottleneck for many end
users.  How might a routing crisis ensue?  Suppose that there is a rapid increase (perhaps
two orders of magnitude) in the traffic generated by end users.  Such a scenario would be
driven by end user demand and greatly improved last-mile connectivity.  As we have argued,
new applications (the web, file-sharing services, etc.) have in the past periodically
created large new traffic demands.  Furthermore, these demands have arisen without abrupt
technology changes.  What the new applications might be is difficult to predict.  There are
many possible applications that could utilize high-quality video, but we have yet to see
enough last-mile connectivity to support them.  In South Korea, where the penetration of
"broadband" to the home is more widespread than in the United States, networking gaming
applications have become a significant driver of traffic.  Whatever the source, it is seems
plausible that great increases in demand will continue to punctuate the future.  On the
last-mile front, upgrading capacity is likely to prove expensive, but is certainly
technically feasible.  Let us assume there is great demand from end users for improved
connectivity (two orders of magnitude), and that end users are willing to pay for this
access connectivity into their homes and businesses.  Increasing end-user bursting capacity
will increase the potential for drastic changes in traffic patterns.

If the following scenario should take place, carriers will be faced with task of scaling
their networks, requiring increases in both transmission capacity and switching capacity.
Predominant traffic patterns may also shift, requiring capacity in new places.  The
carriers will presumably price data services to cover the expense of this new
infrastructure, and will make an effort to match increases with traffic demands to
increases in capacity.

So what might go wrong?  As the carriers attempt to increase capacity, they will (as they
have in the past) try to avoid building-in excessive margins of spare capacity.  But
predictions about where capacity is needed, and how much, may prove difficult.  There are
many unknown variables, and they have the potential to swing rapidly.  How quickly will
traffic demand grow?  How will traffic patterns change?  Will new applications behave
responsibly?  How will the ratio of capacity-and-demand-at-the-edge to
capacity-required-in-the-core change?  How much will it cost to increase capacity in the
core?  As our scenario unfolds, let us assume that, due to the difficulties in predicting
these variables, occasionally growth-in-demand versus growth-in-core-capacity become
out-of-kilter, so that demand bumps up against capacity, and large parts of the core of the
Internet operate for weeks or perhaps months at a time at or near to capacity.


Now the routing problems set in

Imagine the problems a largely saturated core would cause.  BGP provides no mechanism for
routing around congestion.  Networks might find themselves effectively isolated from each
other, even if, through proper load balancing, congestion-free routes are available.
High-priority traffic would fare no better.  BGP itself might have difficulty functioning.
Manual attempts to reduce congestion through BGP configuration BGP would increase the risk
of routing outages.


Directions for future research

The above discussion suggests a number of directions for future research.  To ward off
problems in the short-to-medium term, we should further improve our understanding of how
the Internet currently operates so that we can make better short-term predictions.  We
should analyze the behavior of the Internet with a saturated core, and determine what can
be done using the current protocols and practices to alleviate the problems that would
arise.  Longer term, we need to replace BGP and most likely the interior protocols as well,
and consider modifying the Internet architecture too (as suggested by Zhang et al. [MZ03],
and surely many others).  Of course replacing a universally adopted protocol like BGP is no
easy task, but it seems risky to continue with a protocol that is not designed to perform
well in extreme situations.  Performance optimizations must be integral to such a protocol.
It is difficult to design, tune, or improve protocols or build networks, however, without a
good understanding of how networks operate in practice.  Hence measurability should be a
goal as well (as suggested by Varghese and others).  Most importantly, we should decide how
we want the Internet behave to behave in the future, and build accordingly.
 

References

[ASS03] A. Akella, S. Seshan, and A. Shaikh, An Empirical Evaluation of Wide-Area Internet
Bottlenecks, in Proceedings of the First ACM Internet Measurement Conference, October 2003,
to appear.

[C03] D. Cheriton, The Future of the Internet: Why it Matters, Keynote Address (SIGCOMM
2003 Award Winner), SIGCOMM 2003 Conference on Applications, Technologies, Architectures
and Protocols for Computer Communication, September, 2003.

[C03b] G. Cybenko, Presentation at DARPA Dynamic Quarantine Industry Day, March 2003.

[HM97] S. Halabi and D. McPherson, Internet Routing Architectures, second edition, Cisco Press, 2000.

[M03] B. M. Maggs, Presentation at DARPA Dynamic Quarantine Industry Day, March 2003.

[NRC01] The Internet Under Crisis Conditions: Learning from September 11, National Research
Council, Washington, DC, December 2001.

[SGGL02], S. Saroiu, K. Gummadi, S. D. Gribble and H. M. Levy, An Analysis of Internet
Content Delivery Systems, in Proceedings of the Fifth Symposium on Operating Systems Design
and Implementation, December, 2002.

[MF02] O. Maennel and A. Feldmann, Realistic BGP Traffic for Test Labs, in Proceedings of
the 2003 SIGCOMM Conference on Communications Architectures and Protocols, August, 2002.

[E78] Paul Erlich, Farmers Almanac, 1974.

slides position statement slides panel discussion

Position statement of

Cengiz Alaettinoglu

(Packet Design)





Link-state routing convergence and stability: is there a trade off?
Cengiz Alaettinoglu

New service sensitive applications require increasing level of network
availability. Current IGP restoration times are in seconds, much better
than 10s of seconds a few years ago. However, this is still not acceptable
for many service sensitive applications such as VoIP or online gaming.

In theory, link state routing restoration times can be as fast as a single
SPF computation time (100s of microseconds to few milliseconds) plus some
scheduling delay. However, such an implementation may not be
practical. Instead, implementations which achieve restoration within
propagation delay time frames (10s to few 100s of milliseconds) are within
reach today.

Why is it then the current IGP deployments can not achieve such
convergence times? Because, and for very good reasons, there is a
misconception of a trade off between IGP convergence times and
stability. In order to ensure stability, there are timers that limit the
effect of external instability to the system. Definitely these timers are
on the way of fast convergence. However, while trying to tune down these
timers to achieve fast convergence in the past, several ISPs have
experienced network wide melt downs.

If so, why is this trade off a misconception? Because, it is not a trade
off between convergence and stability in general, it only exists for the
current IGP implementations. It is possible to avoid instability by
slowing down the convergence only during link recovery. Further protection
can also be provided by damping the spf process.

Vendors attempted to implement such protection by implementing adaptive
timers that limit how often the spf process can be run. However, since
these algorithms were implemented without having realistic IGP
measurements, for the IGP deployments we studied, they always delayed the
routing convergence.

Thus, what is needed to achieve fast convergence without sacrificing
stability is good damping algorithms which can separate unstable
components from the stable components and tune themselves to the
conditions of the network. This can only be done with careful measurement
and analysis of IGP routing protocols. What is harder to come is to win
back trust of ISPs once such algorithms have been implemented.
          

slides

Position statement of

Christophe Diot

(Intel)





Service availability in IP networks

One of the main challenges facing networking researchers is to improve
the Internet service availability. Availability has a different meaning
in different contexts. For example, for VoIP, it means that outages or
loss burst should not last more than 250ms. For a web users, it means
that the server is available to serve a request (not much constraints on
the lossiness of links).

Internet research has not made much effort to understand this issue. To
start with, there is not even a decent definition of availability for
Internet services. Using the POTS definition is wrong and their notion
of "5 nines" does not mean anything in a packet network. Since
restoration is now done at the IP level, routing protocols have a huge
impact on service availability:

- IGP restoration of service is in the order of seconds. Convergence is
even longer. Most of the link failures are very short and significantly
impact service availability.
- Interconnection provided by EGPs is flacky. There is no way to
guarantee that packets will be forwarded between to ASes if their ISPs
dont have commercial agreements. FOR EXAMPLE, a 911 VoIP call from a GBX
customer would not be forwarded by UUNET if the call receiver is not a
UUNET customer.

In addition, availability is limited by practices such as NAT, and
relies on services such as the DNS.

In summary, routing protocols should be designed in such a way that they
maximize service availbility in order to allow mission critical
applications to be deployed on packet networks. At the IGP level,
reducing the convergence time together with simple traffic engineering
should help. More significant changes are requested at the EGB level and
evolving BGP might not be an acceptable option. Instead, we have to
imagine now a routing architecture that will maximize service
availability in the Internet of the future. Therefore, the first step is
to agree on a defintion of service availability.

Details are available on S. Bhattacharyya, G. Iannaccone, A. Markopoulou, C.-N. Chuah, C. Diot. "Service availability in IP networks". Rejected from HotNets II. also Sprintlabs Research Report RR03-ATL-071888. July 2003.

Position statement of

David Ward

(Cisco)





***************************
There is only gold .

The increase of competition between IP Service Providers (SPs)
together with the heightened importance of IP to business operations
has led to an increased demand and consequent supply of IP services
with tighter Service Level Agreements (SLAs) for IP performance.

The IP technical community has developed a set of technologies that
enable IP networks to be engineered to support tight SLA commitments:

… Differentiated Services.  The Differentiated Services Architecture
allows differentiated delay, jitter and loss commitments to be
supported on the same IP backbone for different types or classes of
service.
… Faster IGP convergence.  New developments in Interior Gateway
routing Protocols (IGPs) allow for faster convergence upon link or
node failure, hence enabling higher service availability to be
offered.
… MPLS Traffic Engineering.  MPLS Traffic Engineering
(Diffserv-aware or not) introduces constraint-based routing and
admission control to IP backbones.  This allows optimum use to be
made of the installed backbone bandwidth capacity, or conversely
allows the same level of service to be offered for less capacity.
It can also be used to ensure that the amount of low-jitter traffic
per link does not exceed a specified maximum.
… MPLS Traffic Engineering Fast Reroute. MPLS Traffic Engineering
Fast Reroute is an IP protection technique that enables connectivity
to be restored around link and node failures in a few tens of
milliseconds.

For an SP IP service, the SLA commitments are generally based on
delay, jitter, packet loss rate, throughput and availability.

What has been seen is that to deploy new IP based services using the
tools that are available, a SP must build the entire network with
the highest level of service in mind or, build complex and expensive
multiple topologies or overlay networks. Following this, SPs cannot
offer degraded services for several reasons:

… The network is only engineered for the highest level of service,
there often is not a degraded service
… Customers have been trained and become accustomed to the highest
level of service and have become intolerant of outages or 'issues.'
… The tools to separate topologies and have different failure
domains and characteristics are just emerging in the protocol
specifications
… Building overlay networks or redundant networks is too expensive
… Stockpiling spares that have different capabilities is too
expensive and the devices that can 'run all the service levels' are
also expensive


Therefore the question arises, is there a demand for degraded
services or just cheaper service w/ the same SLA requirements? It
appears that building a bigger and better internet means that the
deployment model is engineered toward 'gold service for all.'


Position statement of

David Meyer

(Cisco and Routeviews)





	Does the Complexity of the Internet Routing System Matter?
	                   (and if so, why)
	----------------------------------------------------------


	The advent of the MP_REACH_NLRI and MP_UNREACH_NLRI
	attributes, combined with the resulting generalization to
	the BGP framework (i.e.,  consider the use of extended
	communities [EXTCOMM] to provide route distinguishers
	and/or route targets [RFC2547BIS]) have created the
	opportunity to use BGP to transport a wide variety of
	features and their associated signaling (the combination
	of a BGP feature and its associated signaling is
	sometimes called an "application"). Examples include flow
	specification rules [FLOW], auto-discovery mechanisms for
	Layer 3 VPNs [BGPVPN], and virtual private LAN services
	[VPLS]. However, the use of the BGP as a generalized
	feature transport infrastructure has generated a great
	deal of discussion in the IETF community [IETFOL].

	This debate has focused on the potential trade-offs
	between the stability and scalability of the Internet
	routing system, and the desire on the part of service
	providers to rapidly deploy new services such as IP VPNs
	[RFC2547BIS]. The debate has recently intensified due to
	the emergence of a new class of services that use the BGP
	infrastructure to distribute what may be considered
	"non-routing information". Examples of such services
	include the use of the BGP infrastructure as a
	auto-discovery mechanisms for Layer 3 VPNs [BGPVPN] and
	the virtual private LAN services mentioned above.

	The problem, then, can be framed in terms of how we think
	about the deployed BGP infrastructure. In particular, the
	various positions can be summarized as follows:

	o BGP is a General Purpose Transport Infrastructure

	  The General Purpose Transport Infrastructure position
	  asserts that BGP is a general purpose feature and
	  signaling transport infrastructure, and that new
	  services can be thought of as applications built on
	  this generic transport. Proponents of this position see
	  the issue as not whether the attributes (features and
	  signaling) that need to be distributed are part of some
	  particular class (routing, in this case), but rather
	  whether the requirements for the distribution of these
	  attributes are similar enough to the requirements for
	  the distribution of inter-domain routing
	  information. Hence, BGP is a logical candidate for such
	  a transport infrastructure, not because of the
	  ("non-routing") information distributed, but rather due
	  to the similarity in the transport requirements. There
	  are other operational considerations that make BGP a
	  logical candidate, including its close to ubiquitous
	  deployment in the Internet (as well as in intranets),
	  its policy capabilities, and operator comfort levels
	  with the    technology.

	o BGP is a Special Purpose Transport Infrastructure

	  The proponents of the other position, namely, that the
	  BGP infrastructure was designed specifically and
	  implemented  to transport "routing information", are
	  concerned that the addition of various other
	  non-routing applications to BGP will destabilize the
	  global routing system. The argument here is two-fold:
	  First, there is the concern that the plethora of new
	  features being added to BGP will cause software quality
	  degrade, hence destabilizing the global routing
	  system. This position is based upon well understood
	  software engineering principles, and is strengthened
	  long-standing experience that there is a direct
	  correlation between software features and bugs
	  [MULLER1999]. This concern is augmented by the fact
	  that in many cases, the existence of the code for these
	  features, even if unused, can also cause
	  destabilization in the routing system, since in many
	  cases these bugs cannot be isolated. 

	  A second concern is based on complexity arguments,
	  notably that the increase in complexity of BGP and the
	  types of data that it carries will inherently
	  destabilize the global routing system. This is based on
	  several different lines of reasoning, including the
	  Simplicity Principle [RFC3439], and the concern that
	  the interaction of the dynamics and deployment
	  practices surrounding the simplest form of BGP, IPv4
	  BGP, is poorly understood. Finally, a related concern
	  is that the addition of these non-routing data types
	  will effect convergence and other scaling properties of
	  the global routing system.


	The question is, then, what is the effect on the global
	routing system of using the BGP distribution protocol to
	transport arbitrary data types, versus the effect in
	terms of the additional cost (e.g., in protocol
	development, code, and operational expense) associated
	with not utilizing the mechanisms already present in BGP?
	More importantly, does it matter, and if so, why?



[BGPVPN]        Ould-Brahim, H., E. Rosen, and Y. Rekhter, "Using
                BGP as an Auto-Discovery Mechanism for
                Provider-provisioned VPNs",
                draft-ietf-l3vpn-bgpvpn-auto-00.txt, July,
                2003. Work in Progress.

[EXTCOMM]       Sangali, S., D. Tappan, and Y. Rekhter, "BGP
                Extended Communities Attribute",
                draft-ietf-idr-bgp-ext-communities-06.txt. Work
                in Progress.

[FLOW]          Marques, P, et. al., "Dissemination of flow
                specification rules", 
                draft-marques-idr-flow-spec-00.txt, June,
                2003. Work in Progress.  

[IETFOL]        https://www1.ietf.org/mailman/listinfo/routing-discussion

[MULLER1999]    Muller, R. et. al., "Control System Reliability
                Requires Careful Software Installation
                Procedures", International Conference on
                Accelerator and Largeand Large Experimental
                Physics Systems, 1999, Trieste, Italy.

[RFC2547BIS]    Rosen, E., et. al., "BGP/MPLS IP VPNs", 
                draft-ietf-l3vpn-rfc2547bis-00.txt, May, 2003, 
                Work in Progress.

[RFC3439]       Bush, R. and D. Meyer, "Some Internet
                Architectural Guidelines and Philosophy", RFC
                3439, December, 2002.

Position statement of

Tim Griffin

(Intel)





Tim Griffin 
----------------------------------------------------------
ROUTING POLICY LANGUAGES MUST BE DESIGNED AND STANDARDIZED
----------------------------------------------------------

The following scenario MUST take place within the next few years: 
The Interdomain routing system will enter a state of non-convergence 
that is so disruptive as to effectively bring down large portions of 
the Internet. The problem will be due to unforeseen global interactions of 
locally defined routing policies. Furthermore, no one ISP will have enough 
knowledge to identify and debug the problem.  It will take nearly a week 
to fix and cost the world economy billions of dollars. The world press will 
learn that the internet engineering community had known about this lurking
problem all along.... 

So, we better have a solution! I'll argue the only way to effectively 
solve this problem is to define routing policy languages that are 
guaranteed to be globally sane, no matter the what local policies 
are defined. Then these languages need to be standardized and 
BGP speakers MUST be forced to use them. 

This raises many interesting research problems. Is it possible to 
design such languages? How can we find the right balance between 
local policy expressiveness and global sanity? What exactly do we mean 
by "autonomy" of routing policy?  Do we need additional protocols 
to enforce global sanity conditions? How can we enforce compliance of 
policy language usage? 

slides

Position statement of

Jennifer Rexford

(AT&T)





          
Routing Problems are Too Easy to Cause, and Too Hard to Diagnose
================================================================

IP routing protocols, such as OSPF or BGP, form a complex,
highly-configurable distributed system underlying the end-to-end
delivery of data packets.  "Highly configurable" is a nice way of
saying "hard to configure" or "easy to misconfigure," and "distributed
system" is a nice way of saying "hard to understand" or "hard to
debug."  As such, we have a routing system today where a single
typographical error by a human operator can easily disconnect parts of
the Internet, and diagnosing and fixing routing problems remains an
elusive black art.  This is unacceptable for any technology that would
be considered a core communication infrastructure.  I believe that 
the networking research community should devote significant attention
to improving the state of the art in router configuration and network
troubleshooting.

Several factors conspire to make IP router configuration extremely challenging

- Vendor configuration languages are primitive and low-level, like
assembly language (e.g., a typical router may have ten thousand lines
of configuration commands)

- Routers implement numerous complex protocols (e.g., static routes,
RIP, EIGRP, IS-IS, OSPF, BGP, MPLS, and various multicast protocols)
that have many tunable parameters (e.g., timers, link weights/areas,
and BGP routing policies)

- The routing protocols interact with each other (e.g., "hot-potato"
routing in BGP based on the underlying IGP, use of static routes to
reach the remote BGP end-point, and route injection between protocols)

- Scalability often requires even more complex configuration to limit
the scope of routing information (e.g., OSPF areas and summarization,
BGP route reflectors and confederations, and route aggregation)

- Networks are configured at the element (or router) level, rather than
as a single cohesive unit with well-defined policies and constraints

- Key network operations goals, such as traffic engineering and
security, are not directly supported, requiring operators to tweak the
router configuration in the hope of having the right (indirect) effect
on the network and its traffic

Addressing these complicated problems will require research work in
configuration languages, protocol modeling, and network modeling, and
would hopefully lead to a higher level of abstraction for managing the
configuration of the network as well as tools for configuration
checking and, better yet, automation of configuration from a
higher-level specification of the network goals.  Extensions (or
replacements!) of the routing protocols may also be necessary to
rectify some of these problems.

Detecting, diagnosing, and fixing routing problems are also very
complicated because:

- Routing protocols are hard to configure, making configuration
mistakes very common (see above!)

- Routing protocols do not convey enough information to explain why a
route has changed (or disappeared entirely)

- No authoritative record exists that can identify which routes are
valid (e.g., whether the originating AS is entitled to advertise the
prefix, or whether one AS should be providing transit service from one
AS to another)

- Failures, configuration errors, or malicious acts in remote
locations can affect the path between two hosts

- Reachability problems can arise for other reasons, unrelated to the
routing protocols (e.g., packet filtering or firewalls, MTU mismatches, 
network congestion, and overloaded or faulty end hosts)

- The end-to-end forwarding path depends on the complex interaction between 
multiple routing protocols running in a large collection of networks

- Route filtering and route aggregation (often necessary for scalability) can 
lead to subtle reachability problems, including persistent forwarding loops

- The network does not have much support for active measurement tools
for measuring the forwarding path (i.e., traceroute is very primitive,
and limited in its accuracy and potential uses)

- The Internet topology is not fully known, at the router or the AS
levels (or in terms of AS relationships and policies), and may be
inherently unknowable

Like router configuration, network troubleshooting has received little
attention from the research community, despite its importance to
network practitioners.  Research work in network support for
measurement, extensions to routing protocols to facilitate diagnosis,
and new diagnostic tools would be extremely valuable for improving the
state of the art.

slides position statement slides panel discussion

Position statement of

Lixin Gao

(UMass)






Interaction between Control and Data Planes


The widespread use of the Internet as well as its potential for
disruptive effects on both business and society has made the Internet
one of the most important communication infrastructures.
All IP services (including domain name service or DNS, web hosting, and email) 
depend on connectivity that is built on the routing infrastructure. 
The Internet traffic has exhibit increasing variability in both 
data plane and control plane. In data plane, malicious attacks 
such as worm or virus scan can impact the dynamics of 
the traffic, while in control plane, both malicious attacks
and unintentional misconfigurations can impact the stability and 
reliability of the Internet. Router performance under 
variable data and control traffic load is critical for understanding
the robustness and performance of the Internet infrastructure. 


Variablity of data traffic:

How does the variablity of data traffic impact the performance of routers?
The variablity includes packet size, packet interarrival time, and 
packet type. Systematic studies of potential variablity can further 
facilitate the understanding of variablity of control plane as described 
below. 


Variablity of control plane traffic:

The variablity of control plane traffic (generated by both interdomain and 
intradomain routing protocols) can be caused by attacks
on either data plane or control plane. For example, evidences have shown
that worm traffic has caused BGP session reset or router reboot, which
in turn leads to large variablity on control plane.
Further,  unintentional human errors or intentional misconfigurations 
can lead to persistent variablity on control plane. 
Although large scale exploitation of routers has not been reported yet, 
the potential impact of these attacks can be so large that preventive 
measures must be taken in the near future. 


Interaction between control and data plane traffic:

  The instability of control plane can further impact the performance of
data plane. Updating routing  and forwarding tables consumes a large
number of CPU cycles when there is a significant amount of routing update 
traffic. This might lead to significant performance degradation of data 
forwarding, in particular to low-end routers. Further, to what extent can
the instability of control plane impacts delay, loss on data plane? 
Are loops caused by transit behavior of routing protocols or misconfiguration?



Position statement of

Z. Morley Mao

(UC Berkeley)





Position statement for WIRED
Z. Morley Mao
UC Berkeley
zmao@eecs.berkeley.edu

How to debug the routing system?  
================================
Problem: 
--------

Today, network operators have very limited tools to debug routing
problems. Only primitive tools such as traceroute and ping are
commonly used to identify existing routing behavior. There is very
little visibility into the routing behavior of other ISPs' networks
from a given ISP's perspective, making it even more difficult to
identify the culprit of any routing anomalies. This also means that it
is difficult to predict the impact any routing policy change has on
the global routing behavior. Oftentimes, routing problems are noticed
only after a customer complains about reachability or severe
degradation of performance. There is lack of proactive, automated
analysis of routing problems that detect routing problems at early
stages. As certain routing problems initially may not be very obvious
and result in suboptimal and unintended routes. Diagnosing Internet
routing problems often requires analysis of data from multiple vantage
points.

Proposed solutions:
-------------------
(1) Build routing assertions, so that nothing fails silently.  When
network operator configures a network, it is important to create a set
of assertions, equivalent to integrity constraints in database or
assertions in software programs. This generates the expected behavior
of the routing protocols in terms of which routes are allowed, the
resulting attributes of the routes, etc. These constraints can be
checked dynamically by a route monitor.

(2) cooperation among networks
Each network builds a measurement repository to collect data from
multiple locations. It builds a profile of the expected routing
behavior to quickly identify any deviations using statistic
techniques.  Cooperation across networks is absolutely necessary to
diagnose global Internet routing problems. It is a challenge to
provide summaries of measurement data at sufficiently detailed level
to be useful but without revealing sensitive information about
internals of ISP's networks. A complementary approach is to allow
special distributed queries of the detailed network data from multiple
vantage points without direct access to the data.

(3) scalable distributed measurement interpretation and measurement
    calibrations 
Routing measurement (e.g., BGP) can result in significant data volume
and it may be infeasible to perform real-time or online interpretation
of such measurement data by combining all the data from multiple
locations in distinct networks at a centralized location. Distributed
algorithms are useful to interpret measurement results locally and
then aggregate them intelligently to identify routing anomalies.
Interpreting measurement can be challenging as there is a lack of
global knowledge of topologies and policies which can arbitrarily
translate a given measurement input signal to observed output
signals.  We propose the use of calibration points to help identify
expected or normal routing behavior and correlate the output with the
input. Calibration points are well-controlled active measurement
probes with known measurement input. The BGP Beacons work is one such
example of an attempt to understand the patterns of output for a known
input routing change. 

(4) Internet-wide emulation for network configurations
The impact of a single routing configuration change caused by a policy
change for example could be global; thus, it is important to emulate
the behavior in advance to study its impact.  It is useful to abstract
the routing behavior in a single network at a higher level to study
the perturbation on the global routing system.  Currently, the routing
configuration is done at a device level.  Higher-level programming
support is needed to provide semantically more meaningful
configuration of networks.  Predicting the output of a routing
configuration implicitly assumes that routing is deterministic.
However, nondeterministic routing may be more stable by preferring
routes that have been in the routing tables the longest. Such
tradeoffs are important to study.

(5) Understanding the interaction of multiple routing protocols and
    implementation variants
Internet routing consists of multiple protocols, e.g., interdomain,
intradomain routing protocols, and MPLS label distribution protocol.
All these protocols interact to achieve end-to-end routing behavior
from an application's point of view. It is critical to understand
their dependency on each other.  For instance, in BGP/MPLS IP VPNs,
the label distribution protocol is needed to set up label switched
paths across the network and if that is unsuccessful, BGP cannot find
a route.  There is similar dependence of BGP on OSPF or IS-IS.
Implementation variants among router vendors determine routing
dynamics which is poorly understood. The interaction among the
variants may result in unexpected behavior and needs to be studied.

(6) Understanding routing "politics"
When a customer complains about routing problems either in terms of
reachability or poor performance, it typically is in the context of
some applications. Network operators install route filters in the
routers to determine which routes to accept in calculating the best
path to forward traffic. Packet filters at the routers are much more
flexible in the sense that they determine which packets are accepted
for forwarding based on attributes of the packets, e.g., port numbers,
protocol types.  Given a route in one's routing table received by
one's upstream provider, there is no guarantee that all application
traffic can reach the destination due to the presence of packet
filters. Some networks, for instance, perform port-based filtering to
protect against known worm traffic. When debugging routing problems,
one needs to view from application's perspective to understand which
type of application traffic is correctly forwarded.

How to improve the application performance?
-------------------------------------------
Problem:
--------
Today, the Internet has no performance guarantees for real-time or
delay-sensitive applications, such as VoIP, gaming, especially if
traffic goes across multiple networks. To obtain flexible routing in
terms of control over cost and performance of network paths, end users
resort to either multihoming to multiple networks or overlay routing.
However, studies have shown that there may be potential adverse
interaction between application routing and traffic engineering at the
IP layer. Multihoming, similarly, is not a perfect solution as it does
not directly translate to paths with performance guarantees, has
little impact on how incoming traffic reaches the customers, and may
further amplify the amount of routing traffic during convergence.

Proposed solution:
------------------
Application is the king: correlate routing with forwarding plane,
evaluate and improve in the context of application performance
metrics: delay, loss rate, and jitter.

When studying routing protocol performance, researchers often use
convergence delay as a universal metric.  However it does not
translate directly to metrics applications care about, e.g., delay,
loss rate, and jitter. Understanding the stability of such
measurements as a function of the network topology and time provides a
way for overlay routing algorithms to intelligently route around
network problems. Application performance measurements also expose the
detailed interaction between the dynamics of forwarding plane and
control plane.

How to protect the routing system?
----------------------------------
Problem:
--------
There has been relatively little studies on protecting the Internet
routing infrastructure against attacks. Vulnerabilities in router
architectures are relatively unknown and have not been widely
exploited. The routing system can also be indirectly affected due to
enormous traffic volume. Recently, there has been a large number of
worms exploiting end host OS vulnerability. Significant attack traffic
volume causes router sessions to time out. Session resets result in
exchange of entire routing tables and disruption of routing.  Cascaded
failures can occur if the session reset traffic subsequently cause
router overload and other peering sessions to be affected.

Proposed solution:
------------------
(1) Understanding vendor implementation of routing protocols
Through detailed black-box testing and support from vendors, one can
better understand the obscure, undocumented behavior of routers that
are not documented in RFCs and their implication on router security.

(2) Understanding vulnerability points on the Internet
Network topology and policy information are more widely known through
various Internet mapping effort. Such mapping efforts help us discover
vulnerability points by analyzing failure scenarios.

(3) Higher priority for routing traffic
The delay and loss of routing traffic, especially keepalive HELLO
messages, can cause sessions to reset.  This can occur when there is
significant data traffic. Increasing the queuing and processing
priority of routing packets in the routers is one possibility to
reduce the impact of bandwidth attacks on the routing system.

(4) Automated dynamic installation of packet and route filters
The attack against windowsupate.com was prevented just in time by
invalidating the relevant DNS entry in the DNS system, which takes at
least 24 hours to propagate any change globally.  To react to any
attacks in real time, there needs to be a faster and automated
way.  One possibility is to dynamically install relevant packet and route
filters across a selected set of networks to eliminate/reduce the
impact of the attacks. Routers have limited memory for such filters
and the order of the filters determine the actual routes or packets
permitted. We need to study efficient algorithms to compute such
filters on the fly.
          

Position statement of

Olaf Maennel

(TU Munich)






   ON BGP MUTATIONS
   ================


The collapse of the Internet has already been predicted lots of times.
Some researchers and practitioners have damned BGP, and proposals for
finding a replacement for BGP are resounding throughout the community. 

But before proposing changes to existing protocols, we should understand
the origins of todays problems. We have to grasp the design decisions,
the interactions, and the scalability limitations of the current
implementations. Regarding BGP this in-depth understanding is clearly
not present.

In the following I like to envisage three areas in which BGP may/should
evolve in the next few years. Those areas can be viewed as short-, mid-,
and long-term goals. 


      1. Vendor implementation issues
         (or: convergence and scalability questions)
   
   Convergence times in the Internet are still in the order of several
   minutes. Regarding the critical importance and compared to telephone
   networks, this is no longer acceptable!
 
   But the protocol does not have to be changed to improve convergence.
   The limiting factors are vendor specific implementations details,
   settings of timers and parameters as well as overloaded routers 
   [see Appendix A]. 
   
   Just to pick one example, consider the propagation of updates in
   I-BGP through a series of route reflectors (RR): Updates will be
   delayed by approximately 10 seconds per RR by MRAI. Changing this
   timer setting or changing the network design (reduced number of
   cascaded RR that the update has to pass) will speed up convergence
   without protocol modifications. This example leads to the second
   area:
   
   
      2. Human-factor issues
         (or: misconfiguration questions)
  
   Network design as well as router configuration is not a trivial task.
   (e.g., [Caldwell03]). Therefore human error in router configuration
   and network design happens every day [Mahajan02].
   
   Various homegrown tools and approaches exists (e.g., see presentation
   at operators forums such as [NANOG]). Still research needs to focus
   more on solutions to minimize the error potential. 

   Here tools and accurate databases are desperately needed, but no
   changes to the protocols are necessary to minimize human errors. 

   On the other hand it is known that certain configuration mistakes can
   lead to BGP oscillations (e.g. [RFC3345]). The current approach is a
   patchwork which fixes bugs when they occur. This is not acceptable
   and we need some protocol enhancements, which leads to the third
   area:
   
   
      3. Protocol-design issues
         (or: protocol divergence, inter-domain TE, etc. questions)

   One beloved feature of BGP is that it is completely configurable
   through policies, but Tim Griffin has shown that todays existing MED
   oscillations are just the tip of the iceberg and that BGP can lead to
   diverging states on a much larger scale (e.g., [Griffin99]). 

   There are further demands from the market that can't be satisfied
   with our current version of BGP. This includes inter-domain
   equal-cost-multipath, "online" inter-domain traffic engineering (a la
   routeScience), etc... All this will not be possible as long as the
   best path decision process of a router selects only one best route. 
   
   Furthermore additional information about causes and origins of
   routing instabilities would be helpful for operators to locate and
   debug routing problems. 
 
   Even though the list above does not claim to be exhaustive, it is
   clear that some enhancement to BGP will be unavoidable!


How will BGP evolve?

Quite logically, vendors are mainly implementing those features that the
market is supposed to buy (e.g., MPLS/VPNs). From my perspective, all
three areas mentioned above, are not very attractive to vendors (e.g.
low cost-benefit ratio), but important for the future of the Internet.
That is the reason why those areas need support from research to evolve. 

To approach those problems, we need an in-depth understand of protocol
details, router limitations and interactions between protocols as well
as propagation patterns through the topology. Research should start with
answering questions from the following categories:


      1. Protocol analysis 
  
   Identify the root causes and the location of triggering events.
   Investigate interactions between routing protocols and topology. 

   For example questions here could be: How to identify the AS which
   originated an update? How many updates are due to what kind of
   events?
   
   
      2. Equipment scalability tests 

   Understand the scaling limitations of todays equipment before judging
   about the deployment of additional features. 

   For example questions here could be: How long does an update spend
   inside a router (under certain load conditions)? How much more load
   can inter-domain traffic engineering or a lower MRAI value impose on
   a router?


      3. Simulation

   Use network simulation to understand how routing updates traverses
   the network. Investigate interactions of various timers, of policies,
   between IGPs and BGP, etc.

   Example questions here could be: How to implement BGP in a way that
   the number of "dispensable" updates (caused by interconnectivity and
   timers) can be limited? ...



BGP is a protocol which evolved in over 15 years now. The most important
part is that network operators have full control over all settings and
their route distributions. 

My conclusion, regarding the future of BGP, is that a lot of problems
that we have with todays routing, are fixable within BGP and should be
fixed soon. Furthermore that enrichment (e.g., optional add-ons) to BGP
are not only necessary, but unavoidable! On the other hand a replacement
protocol will have a hard stand on the market. 

Therefore "mutations" are possible, but a replacement will be crushed by
"natural selection". This is the part of evolution theory that BGP is
subjected to - from my point of view.



References
----------

[Griffin99]   T. G. Griffin, and G. Wilfong, "An analysis of BGP
              convergence properties," in Proc. ACM SIGCOMM, 
              September 1999.

[RFC3345]     D. McPherson, V. Gill, D. Walton, and A. Retana, "Border
	      Gateway Protocol (BGP) Persistent Route Oscillation
              Condition", Request for Comments 3345, August 2002. 

[Mahajan02]   R. Mahajan, D. Wetherall, and Tom Anderson, "Understanding
              BGP Misconfiguration", ACM SIGCOMM, August 2002

[NANOG]       The North American Network Operators' Group 
              http://www.nanog.org/

[Caldwell03]  Don Caldwell, Anna Gilbert, Joel Gottlieb, Albert
	      Greenberg, Gisli Hjalmtysson, and Jennifer Rexford, "The
	      cutting EDGE of IP router configuration," unpublished
	      report, July 2003.

------------------------------------------------------------------------


Appendix A: Example, "the MRAI fight"
-------------------------------------

A critical factor in BGP update distribution is the Minimum Route
Advertisement Interval (MRAI) and the way it is implemented in router
software. The basic idea behind this timer is, to collect first all
updates arriving from different peers and pass only one "best" update
on. The RFC suggests that after one update for one prefix was send to
one peer, there should be a (jittered) delay of 30 seconds before
another update for the same prefix can be send to the same peer. Indeed
this limits the number of BGP messages that need to be exchanged. We
note that certain vendor specific implementations differs a lot from the
recommendation in the RFC and therefore introduce a significant
different propagation picture. Here are two examples:

From our current understanding of Cisco's MRAI implementation there are
two major differences with regards to the RFC. The first difference is,
that the timer is implemented on a per peer basis instead of per prefix
basis. Scalability reasons does not allow an implementation per peer and
prefix, but therefore almost ALL outgoing updates will be delayed - not
just two consecutive updates (close in time and belonging to one
prefix)! That means, that each and every update will be queued and only
propagated when the timer expires. The second difference is, that MRAI
is holding back withdraws as well as announcements. This is a major
cause of the observed BGP path exploration phenomena.  

Our current understanding is that MRAI on Junipers is called,
"out-delay" [https://www.juniper.net/techpubs/software/junos/junos57/
swconfig57-routing/html/bgp-summary32.html] and is disabled by default.
That means, Juniper is not holding back any BGP update messages. Indeed
this speeds up convergence, but at the risk that much more updates will
be send - which in turn triggers more damping.

The trade-off in this fight is between faster propagation and more
protocol messages. It is clear that in todays Internet more protocol
messages would lead to more damping, which doesn't improve convergence.
Even in a fictional Internet without damping, more protocol messages
would burn more CPU time. Therefore future research has to show whether
this is desirable (consider todays CPU speeds), or not (because of
scalability considerations). 



Position statement of

Phil Karn

(Qualcomm)






          
WIRED position Statement
October 2003
Phil Karn, KA9Q

In my opinion, the most crucial omission in the present Internet
routing protocols are mechanisms to automate the detection of and
reaction to denial of service attacks and worms. These attacks have
become endemic on the Internet in recent years, largely due to the
astonishing insecurity of Microsoft software that gives rise to the
ability of some worms (such as Blaster) to propagate throughout the
entire Internet in a matter of minutes.

There is also the widespread deployment, often by worm or virus, of
distributed DoS tools that can conduct a coordinated attack against a
single target by thousands of hijacked computers.

All these attacks threaten to destroy what is left of the end-to-end
model responsible for the Internet's success. They stand as a major
impediment to the deployment of more distributed services such as
end-to-end VoIP, especially on relatively slow and expensive
communication channels such as cellular telephony and satellites.

Even when an attack fails to affect legitimate traffic, the excess
bandwidth charges resulting from the DoS traffic can eventually drive
a service out of business.

Since computers and local area networks have increased greatly in
speed in recent years, DoS attacks generally more successful when they
target a customer's Internet access links rather than his computers.
Defenses at the computers are therefore useless; any effective
defenses must be deployed within the Internet itself.

I envision adding mechanisms to routers that perform the following
functions within the Internet:

1. Block specified packets addressed to a specific IP address UNDER
THE DIRECT CONTROL OF THE USER OF THAT IP ADDRESS.

Static, ISP-configured firewalls are almost always over-broad, clumsy
and inadequate.  Desired traffic is often blocked, and it may be too
time-consuming during an attack to call an ISP operator to manually
block specified traffic, which may change rapidly specifically to
evade such filtering. It is therefore essential that the firewalls
within the network be under the direct control of the user of the
target IP address, without the need for human intervention by the
network operators.

A good start would be an open standard for the secure remote control
of a generic packet-filtering firewall. Security on this control path
is obviously important, but it need not be extreme to be effective.

2. Automatically detect a DoS attack and coordinate a response among
the affected routers by filtering the attack packets as close as
possible to their source(s). The detection may be performed by
mechanisms similar to those used to implement quality of service;
e.g., by sustained output queue overflows. Messages could be exchanged
between neighboring routers to cooperatively block or limit traffic
that would be discarded downstream anyway. Care must be taken to avoid
dynamic responses that might allow an attacker to use a relatively
small amount of traffic to trigger the packet-dropping mechanisms in
such a way that legitimate traffic is unduly affected.

I note that these mechanisms may prove useful in blocking spam from
known sources. E.g., user-controlled firewalls could implement IP
blacklists, keeping said traffic off the user's own access link.  But
as annoying as spam is to the humans who receive it, it does not yet
present quite the threat to the Internet routing and transmission
infrastructure as worms and DoS attacks.

          

Position statement of

Randy Bush

(IIJ)






			   Happy Packets
		      Randy Bush / 2003.09.30

As routing researchers, we frequently hear comments such as
  o internet routing is fragile, collapsing, ...,
  o bgp is broken or is not working well,
  o yesterday was a bad routing day on the internet,
  o change X to protocol Y will improve routing,
  o etc.
And we often measure routing dynamics and say that some measurement
is better or worse than another.

But what is 'good' routing?  How can we say one measurement shows
routing is better than another unless we have metrics for routing
quality?  We often work on the assumption that number of prefixes,
speed or completeness of convergence, etc. are measures of routing
quality.  But are these real measures of quality?

Perhaps because I am an operator I think the measure which which
counts is whether the customers' packets reach their intended
destinations.  If the customers' packets are happy, the routing
system (and other components) are doing their job.

Therefore, I contend that, for the most part, we should be judging
control plane quality by measuring the data plane.  And we have
well defined metrics for the data plane, delay, drop, jitter,
reordering, etc.  And we have tools with which to measure them.

It is not clear that happy packets require routing convergence as
we speak of it today.  If there is better routing information near
the destination than at the source, maybe there is sufficient
information near the source to get the packets to the better
informed space.  This is not that unlike routing proposals, such as
Nimrod, where more detail is hidden the further you get from the
announcer.

If the routing system is noisy, i.e. there is is lot of routing
traffic, that may not really be a bad thing.  We know convergence
time can be reduced if announcement throttling (MRAI) is lessened.
As long as network growth increases load on the routers below
Moore's law, it is not clear we are in danger.

So, happy packets to you.



Position statement of

Renata Teixeira

(UCSD)






What is in a router's mind?

Network operators and researchers are often faced with the question of
understanding the behavior of routing in IP networks. Network
administrators need to know which paths are being used and why in
order to efficiently perform traffic engineering and to be able  to
detect a routing anomaly, identify the root cause, and fix the
problem. An accurate analysis of routing dynamics is also important
for obtaining realistic routing models that researchers and developers
can use.

Despite recent advances in monitoring and measurement techniques and
the resulting increase in the amount of information available about IP
networks,  identifying the  root cause of a routing change is still a
challenging task. Currently, most ISPs have monitors and taps that
collect a large volume of data. One can construct the routing state by
putting together routing messages, table dumps, router logs and
configurations, etc. Traffic flow and performance can be also be
inferred either by using active measurements or by collecting passive
traffic measurements (Netflow, Gigascope, IPMon, SNMP, RMON).

Why is it so hard to put all these pieces together to precisely
determine the cause of a routing change?

(i) There is too much data.

The underlying event (such as a fiber cut, a policy change, or a
misconfiguration) isn't directly visible. A single event can manifest
itself in a number of sources of data, and sometimes even multiple
times in a single data source.  For instance, a single fiber cut may
generate a number of link state advertisements and BGP update
messages, and a shift in traffic. Identifying the underlying event may
require combining all these data sources.

(ii) There is not enough data.

Routing information is distributed and understanding the network-wide
behavior depends on the interaction of a number of routers. Moreover,
the forwarding table in each router is constructed based on the
complex interaction of IGP, iBGP, eBGP, vendor-specific
implementations, and domain-specific policies. It is not  feasible to
instrument all  nodes in a large operational network (Could we
eventually do that?), so we don't have complete information even for a
single domain. None of the data sources available today have enough
information to determine the event that triggered a particular routing
change. Multiple events could  cause a similar stream of routing
messages. Routing messages carry enough information to route, but do
not explain the reason for choosing a particular route.

Thus, one piece of information that is missing in this puzzle is: "why
did the router change its mind?" In order to understand routing
behavior, we need to be able to pinpoint the event that triggered a
routing change. For networks using link-state protocols, this task is
made easier by the flood of link state advertisements to all nodes in
the network. However, most traffic that transits ISP networks is
routed using BGP. Unfortunately, there are various types of events
that  can trigger BGP routing changes and no single network
administrator has complete information to determine its root
cause. For instance, a policy change, a failure, or a BGP
misconfiguration may all be reported as a withdrawal of a set of
prefixes. The IGP   area structure and the iBGP route reflector
hierarchy introduce  further complexities to reasoning about routing
behavior. (Is this  extra complexity really necessary?)

Can routers help us understand their behavior?  How can routers
explicitly report the event that triggered a change in behavior? One
could envision at least two ways of obtaining this information:

(i) Annotate routing messages

BGP could allow the establishment of monitoring sessions. In these
sessions, a BGP speaker could send all alternative routes, so that the
monitor is aware of all the possible choices. Besides,   BGP updates
could have extra attributes that contain all information that is used
in the decision process (such as IGP distances and router IDs).  Then,
a monitoring box would have all the information the router had for
deciding which  routes to pick.

(ii) Expose state in routers

There are a number of factors that can trigger a routing
change. Router  logs could be extended to store the event that
triggered  a particular change. Examples of events could be an IGP
message, a BGP update, a configuration change, or missing
hellos. Event logs could then be used to trace back to the root cause
of a routing change. Would such logging be feasible? Would it provide
all the information needed to determine the root cause?

After collecting data from a number of routers, one would need to join
all these datasets in order to determine the root cause of a routing
change. Combining this information may still be  challenging due to
timing issues. Routing monitors record data  remotely and an event may
take some time to propagate through the network.  This factor has two
consequences: we cannot assume that all routers will react to  the same
event at the same time, and there may be a delay between the time  the
router changes its state and the time the monitor records the change.
Should the timestamps recorded by routing monitors be defined by the
router itself?

Since understanding routing behavior is such an essential part to
effectively managing a network, why not take it into consideration
when building routing protocols?


Position statement of

Sharad Agarwal

(Berkeley)






How Finely Do We Need to Control Internet Traffic?
==================================================

The Internet has grown tremendously both in the capacity of traffic
that it can carry and in the actual traffic that it does carry. At the
present, the cost of this capacity, measured either in the rates that
ISPs charge or in the cost of leasing dark fiber, is at the lowest
that it has ever been. With this increase in capacity and drop in
cost, one would expect to see a corresponding lack of interest in
controlling small aggregates of traffic. Yet, a desire for higher
performance, increased reliability and new services is driving a
curious trend toward controlling finer and finer amounts of traffic on
the Internet.

One such example is the almost religious debate between MPLS and
traditional IP routing. MPLS offers fine grained control over traffic,
with the ability to dictate the specific path through a network for
traffic from an ingress interface on one end to an egress interface on
another end of the network. IP routing proponents often cite the
common practice of over-provisioning networks given current market
conditions, which seems counter to the need for fine-grained traffic
control. A second such example is the common practice of multihoming,
which has led to de-aggregation. As more and more stub networks
purchase connectivity from more than one ISP, they find they have a
choice of multiple paths for sending and receiving traffic. Many
companies such as NetVMG, Opnix, Proficient Networks, Routescience,
and Sockeye provide devices that control the paths of egress traffic
to individual IP addresses. It is commonly believed that stub networks
are purposely de-aggregating their network block announcements to
split ingress traffic between inter-AS paths. This has brought out the
worries of associated routing table growth and protocol overhead. A
more direct example of this phenomenon is the current topic in
networking research of overlay networks. Many overlay networks rely on
application layer forwarding and providing better customized paths for
individual traffic flows than the underlying Internet can
provide. However, given all the feverish research and industrial
activity in all these areas of networking, their advantages in terms
of performance, reliability and enabling of new applications are still
debated.

Future thrusts into finer control of Internet traffic will undoubtedly
be influenced by the structure of the Internet. Certainly a possible
future is one where nothing is different - since the introduction of
SS7, the PSTN has not changed for over 20 years. Alternatively,
overlay networking may become common, and/or the current ISP model of
the Internet may change.

Two core research issues need to be pursued in this area. The first
will determine how overlay networks should evolve to the future
Internet. Many overlay networks cannot scale to all the hosts on the
current Internet. However, what happens when all the hosts on the
Internet are part of multiple disjoint overlay networks? Peering
agreements between nodes at the overlay level may become
commonplace. Routing decisions by different overlay networks may
interfere with each other by changing traffic patterns in the
underlying network. This can lead to an unstable system or one that is
not much better than the underlying Internet. Can overlay networks
co-exist or will measurements and routing decisions have to be
coordinated to still promise improvements over the current Internet?
Instead, should we abandon overlay networking but use the techniques
developed for it to improve routing in the underlying IP network? A
global, distributed measurement infrastructure can be built to detect
the capacities and utilization of various Internet paths. A control
network that reconfigures IP routes dynamically based on these
measurements can then be put in place. Such an approach can improve on
what IP routing offers today but less than what overlay networks
promise. However, this approach can be more scalable than overlay
network forwarding.

The other long term research issue considers Internet routing if the
current hierarchical nature of the AS topology no longer holds in the
future. Given the turmoil that many ISPs are facing, one can imagine a
future Internet without a core consisting of a few large ISPs. In
order to send a packet from California to New York, a path traversing
several small networks may be employed, instead of a path through a
single continental ISP's network. This Internet may be composed of a
large number of small ASes that peer with each other for transit, with
no clear hierarchy. Peering may be dramatically different where ASes
no longer determine peering tactics based on size or AS hierarchy
position. No longer will most of the traffic traverse a few global
sized, well engineered ISPs. If instead the majority of traffic
traverses the same few paths, and if smaller, less-provisioned
networks comprise these paths, will congestion occur more rapidly? We
may need to rethink the fundamental decisions of the current wide-area
routing architecture. Will fast routing convergence or agility in
re-routing around congestion become even more critical? Will
multi-path routing become a necessity, and hot potato routing a more
common occurrence? Perhaps the current two level IGP/EGP hierarchy
will not be sufficient. We may have to consider a third level, or
current overlay networks may fill that need. A more traditional
peering hierarchy may then appear at the overlay network level.

Clearly other issues will also be involved in determining how finely
we need to control traffic in the future. New services may dictate
stringent performance or security requirements that may require fine
control. Without any current, compelling services with these
requirements, we should be careful to not dismiss technologies for
fine traffic control : we may have a chicken and egg problem.

Acknowledgements : I want to thank Supratik Bhattacharyya, Chen-Nee
Chuah, Adam Costello and Gianluca Iannaccone for their feedback.

slides position statement slides panel 2

Position statement of

Tom Anderson

(U. Washington)




          
          
A Case for RIP (Re-architecting the Internet Protocols)
Tom Anderson
University of Washington
September 2003

This position paper starts from the premise that we are not in
control.  The primary determining factors for how Internet
routing will evolve over the next decade are the long term
trends in the relative cost-performance of communication,
computation, and human brainpower.  Academic research can help
optimize solutions to match these trends, but it can't buck
them.  Even the tussles between competing vendors and interest
groups, issues that can have substantial impact in the short
term, are over the long term steamrollered by technology
trends.

What are these trends?  Averaged over the past 30 years, wide
area communication has improved in cost-performance at roughly
60% per year.  While prices are never simply a direct
reflection of costs, reflecting the ebb and flow of monopoly
positions, over the long term they track fairly closely.  And
it is this long term improvement in cost-performance, rather
than any intrinsic nature of the Internet, which drives the
long term trends in Internet usage and operations.  For
example, the transmission bandwidth for an hour-long
TV-quality teleconference would have cost $500 a decade ago,
while 10 years from now it will cost a nickel.  Of course this
difference will result in a vast increase in the amount of
multimedia content distributed over the Internet.

While the long term improvement in WAN cost-performance seems
impressive, it pales compared to computing, local area
communication, and DRAM (each of which has improved at between
80-100% per year for the past 30 years).  Moore's Law gets the
publicity (the 60% per year improvement in circuit density),
but that figure misses a key factor - volume manufacturing.
Roughly ten billion microproces-sors were manufactured last
year, compared to only a handful of wide area communication
line cards; thirty years ago, the numbers were closer to
parity.  High volume technologies have a significant long term
edge in cost-performance.  While a gap of 20-40% may not seem
like much in any given year, over the long term it adds up to
about an order of magnitude per decade.   (To the extent that
prices diverge from costs, it is accentuating this effect -
the Internet is a less efficient market than CPUs and DRAM,
and thus is scaling even less quickly in the near term.)

One consequence is that the Internet was designed for a far
different world than the one we have today or will have in ten
years.  Thirty years ago, human time was cheap, and
computation and communication were expensive.  Today's
Internet, and increasingly so in the future, is one where
humans are expensive, wide area communication is cheap, and
computation is virtually free.  Indeed, the Internet became
possible at the point that computation became cheap enough
that we could afford to put a computer at the end of every
wide area link - that is, at the point that computation and
communication reached parity.  The Internet would not have
been feasible, purely from a cost standpoint, in 1960.  Even
fifteen years ago, TCP congestion control was carefully
designed to minimize the cycles needed to process each packet;
few would claim that TCP packet processing overhead is the
limiting factor for practical wide area communi-cation today.
Recall that firewalls were considered too slow a decade ago;
today, they still are, but only for LAN traffic.  These trends
will continue - activities such as routing overlays, link
compression, and traffic shaping, considered perhaps too slow
to be practical today, will eventually become commonplace.

This suggests that we should answer two questions.  How will
the Internet evolve in response to these trends, and what can
we do as researchers to leverage them to make the Internet
more efficient, more reliable, and more secure?  We make
several observations:

Ubiquitous optimization of backbone hardware.  BGP is
explicitly designed for scalability over performance, and thus
is ill-suited for the kinds of optimizations that are likely
in the future.  It is often impossible even to express optimal
policies in BGP.  Similar problems occur at the intradomain
level; it is idiotic to have an architecture that requires
humans in the back room to twiddle link weights for good
perform-ance.  The research challenge will be how to adapt our
routing protocols to accommodate ubiquitous op-timization.
Fortunately, networks will be run at the knee of the curve -
it makes no sense to run a network at high utilization if that
delays end users.   The control theory problems of managing
traffic flows over large, heterogeneous networks become much
simpler at low to moderate utilization.

Cooperation as the common case.  A widespread myth is that
Internet routing is dominated by competition - the "tussle"
between competing providers.  In the short term, the tussle
seems paramount, but over the long term, delivering good
performance to end users matters, and that is only possible
when providers cooperate.  Indeed, measurement studies have
shown that even today cooperation heavily influences the
selection of Internet routes.  Unfortunately, BGP is
ill-designed for cooperation - even something as sim-ple as
picking the best exit, as opposed to the earliest or latest,
is a management nightmare in BGP.  How can we re-design our
protocols to make cooperation efficient, and unfriendly
behavior visible and penalized?

Accurate Internet weather.   Many ISPs like to think of their
operations as proprietary, but information necessarily leaks
out about those operations along a number of channels.  Recent
measurement work has shown that it is possible to infer almost
any property of interest, including latency, capacity,
workload, policy, etc.  We believe an accurate hour by hour
(or even minute-by-minute) picture of the Internet can be
cost-effectively gathered from a network of vantage points.
Leveraging this information in routing and congestion control
design is a major research challenge.

Sophisticated pricing models.  Pricing models will become much
more complex, both because we'll be able to measure and
monitor traffic cost-effectively at the edges of networks, and
because the character of traffic affects how efficiently we
can run a network.  Smoothed traffic will be charged less than
bursty traffic, since it allows for higher overall utilization
of expensive network hardware with less impact on other users.
Internet pricing already reflects these effects at a
coarse-grained level, as off-peak bandwidth is essentially
free.  The trend will be to do this at a much more
fine-grained level.  Smoother traffic makes routing
optimizations easier, but perhaps the more interesting
question is how traffic shapers interoperate across domains to
deliver the best performance to end users - in essence, how do
we take the lessons we've learned from interdomain policy
management in BGP and apply them to TCP?

Interoperable boundary devices.  Far from being "evil" and
contrary to the Internet architecture, they are a necessary
part of the evolution of the Internet, as the cost-performance
of computation scales better than that of wide area
communication.  Even today, sending a byte into the Internet
costs the same as 10000 instructions (at least in the US, the
ratio for foreign networks is even higher). The challenge is
making these edge devices interoperate and self-managing - the
only way to build a highly secure, highly reliable, and high
performance network is to get humans out of the loop.   The
end to end principle in particular is a catechism for a
particular technology age - instead of thinking of how a huge
number of poorly secured end devices can work together to
manage the Internet, we will instead ask how a smaller number
of edge devices can cooperate among themselves to provide
better Internet service to their end users.

High barriers to innovation.  As we help evolve the Internet
to better cope with the challenges of the future, it is
important to remember that routers are a low volume product.
As typical of any niche software system, this makes them
resistant to change, since engineering costs can dominate.  As
researchers, we can help by redesigning protocols so that they
are radically easier to implement, manage, and evolve.

These observations and research challenges are animating our
work on RIP at UW.