Workshop on Internet Routing Evolution and Design (WIRED)

October 7-8, 2003
Timberline Lodge, Mount Hood, Oregon, USA

Position statement of

Renata Teixeira

(UCSD)






          What is in a router's mind?
          
          Network operators and researchers are often faced with the question of
          understanding the behavior of routing in IP networks. Network
          administrators need to know which paths are being used and why in
          order to efficiently perform traffic engineering and to be able  to
          detect a routing anomaly, identify the root cause, and fix the
          problem. An accurate analysis of routing dynamics is also important
          for obtaining realistic routing models that researchers and developers
          can use.
          
          Despite recent advances in monitoring and measurement techniques and
          the resulting increase in the amount of information available about IP
          networks,  identifying the  root cause of a routing change is still a
          challenging task. Currently, most ISPs have monitors and taps that
          collect a large volume of data. One can construct the routing state by
          putting together routing messages, table dumps, router logs and
          configurations, etc. Traffic flow and performance can be also be
          inferred either by using active measurements or by collecting passive
          traffic measurements (Netflow, Gigascope, IPMon, SNMP, RMON).
          
          Why is it so hard to put all these pieces together to precisely
          determine the cause of a routing change?
          
          (i) There is too much data.
          
          The underlying event (such as a fiber cut, a policy change, or a
          misconfiguration) isn't directly visible. A single event can manifest
          itself in a number of sources of data, and sometimes even multiple
          times in a single data source.  For instance, a single fiber cut may
          generate a number of link state advertisements and BGP update
          messages, and a shift in traffic. Identifying the underlying event may
          require combining all these data sources.
          
          (ii) There is not enough data.
          
          Routing information is distributed and understanding the network-wide
          behavior depends on the interaction of a number of routers. Moreover,
          the forwarding table in each router is constructed based on the
          complex interaction of IGP, iBGP, eBGP, vendor-specific
          implementations, and domain-specific policies. It is not  feasible to
          instrument all  nodes in a large operational network (Could we
          eventually do that?), so we don't have complete information even for a
          single domain. None of the data sources available today have enough
          information to determine the event that triggered a particular routing
          change. Multiple events could  cause a similar stream of routing
          messages. Routing messages carry enough information to route, but do
          not explain the reason for choosing a particular route.
          
          Thus, one piece of information that is missing in this puzzle is: "why
          did the router change its mind?" In order to understand routing
          behavior, we need to be able to pinpoint the event that triggered a
          routing change. For networks using link-state protocols, this task is
          made easier by the flood of link state advertisements to all nodes in
          the network. However, most traffic that transits ISP networks is
          routed using BGP. Unfortunately, there are various types of events
          that  can trigger BGP routing changes and no single network
          administrator has complete information to determine its root
          cause. For instance, a policy change, a failure, or a BGP
          misconfiguration may all be reported as a withdrawal of a set of
          prefixes. The IGP   area structure and the iBGP route reflector
          hierarchy introduce  further complexities to reasoning about routing
          behavior. (Is this  extra complexity really necessary?)
          
          Can routers help us understand their behavior?  How can routers
          explicitly report the event that triggered a change in behavior? One
          could envision at least two ways of obtaining this information:
          
          (i) Annotate routing messages
          
          BGP could allow the establishment of monitoring sessions. In these
          sessions, a BGP speaker could send all alternative routes, so that the
          monitor is aware of all the possible choices. Besides,   BGP updates
          could have extra attributes that contain all information that is used
          in the decision process (such as IGP distances and router IDs).  Then,
          a monitoring box would have all the information the router had for
          deciding which  routes to pick.
          
          (ii) Expose state in routers
          
          There are a number of factors that can trigger a routing
          change. Router  logs could be extended to store the event that
          triggered  a particular change. Examples of events could be an IGP
          message, a BGP update, a configuration change, or missing
          hellos. Event logs could then be used to trace back to the root cause
          of a routing change. Would such logging be feasible? Would it provide
          all the information needed to determine the root cause?
          
          After collecting data from a number of routers, one would need to join
          all these datasets in order to determine the root cause of a routing
          change. Combining this information may still be  challenging due to
          timing issues. Routing monitors record data  remotely and an event may
          take some time to propagate through the network.  This factor has two
          consequences: we cannot assume that all routers will react to  the same
          event at the same time, and there may be a delay between the time  the
          router changes its state and the time the monitor records the change.
          Should the timestamps recorded by routing monitors be defined by the
          router itself?
          
          Since understanding routing behavior is such an essential part to
          effectively managing a network, why not take it into consideration
          when building routing protocols?