Workshop on Internet Routing Evolution and Design (WIRED)

October 7-8, 2003
Timberline Lodge, Mount Hood, Oregon, USA

Position statement of

Z. Morley Mao

(UC Berkeley)





          Position statement for WIRED
          Z. Morley Mao
          UC Berkeley
          zmao@eecs.berkeley.edu
          
          How to debug the routing system?  
          ================================
          Problem: 
          --------
          
          Today, network operators have very limited tools to debug routing
          problems. Only primitive tools such as traceroute and ping are
          commonly used to identify existing routing behavior. There is very
          little visibility into the routing behavior of other ISPs' networks
          from a given ISP's perspective, making it even more difficult to
          identify the culprit of any routing anomalies. This also means that it
          is difficult to predict the impact any routing policy change has on
          the global routing behavior. Oftentimes, routing problems are noticed
          only after a customer complains about reachability or severe
          degradation of performance. There is lack of proactive, automated
          analysis of routing problems that detect routing problems at early
          stages. As certain routing problems initially may not be very obvious
          and result in suboptimal and unintended routes. Diagnosing Internet
          routing problems often requires analysis of data from multiple vantage
          points.
          
          Proposed solutions:
          -------------------
          (1) Build routing assertions, so that nothing fails silently.  When
          network operator configures a network, it is important to create a set
          of assertions, equivalent to integrity constraints in database or
          assertions in software programs. This generates the expected behavior
          of the routing protocols in terms of which routes are allowed, the
          resulting attributes of the routes, etc. These constraints can be
          checked dynamically by a route monitor.
          
          (2) cooperation among networks
          Each network builds a measurement repository to collect data from
          multiple locations. It builds a profile of the expected routing
          behavior to quickly identify any deviations using statistic
          techniques.  Cooperation across networks is absolutely necessary to
          diagnose global Internet routing problems. It is a challenge to
          provide summaries of measurement data at sufficiently detailed level
          to be useful but without revealing sensitive information about
          internals of ISP's networks. A complementary approach is to allow
          special distributed queries of the detailed network data from multiple
          vantage points without direct access to the data.
          
          (3) scalable distributed measurement interpretation and measurement
              calibrations 
          Routing measurement (e.g., BGP) can result in significant data volume
          and it may be infeasible to perform real-time or online interpretation
          of such measurement data by combining all the data from multiple
          locations in distinct networks at a centralized location. Distributed
          algorithms are useful to interpret measurement results locally and
          then aggregate them intelligently to identify routing anomalies.
          Interpreting measurement can be challenging as there is a lack of
          global knowledge of topologies and policies which can arbitrarily
          translate a given measurement input signal to observed output
          signals.  We propose the use of calibration points to help identify
          expected or normal routing behavior and correlate the output with the
          input. Calibration points are well-controlled active measurement
          probes with known measurement input. The BGP Beacons work is one such
          example of an attempt to understand the patterns of output for a known
          input routing change. 
          
          (4) Internet-wide emulation for network configurations
          The impact of a single routing configuration change caused by a policy
          change for example could be global; thus, it is important to emulate
          the behavior in advance to study its impact.  It is useful to abstract
          the routing behavior in a single network at a higher level to study
          the perturbation on the global routing system.  Currently, the routing
          configuration is done at a device level.  Higher-level programming
          support is needed to provide semantically more meaningful
          configuration of networks.  Predicting the output of a routing
          configuration implicitly assumes that routing is deterministic.
          However, nondeterministic routing may be more stable by preferring
          routes that have been in the routing tables the longest. Such
          tradeoffs are important to study.
          
          (5) Understanding the interaction of multiple routing protocols and
              implementation variants
          Internet routing consists of multiple protocols, e.g., interdomain,
          intradomain routing protocols, and MPLS label distribution protocol.
          All these protocols interact to achieve end-to-end routing behavior
          from an application's point of view. It is critical to understand
          their dependency on each other.  For instance, in BGP/MPLS IP VPNs,
          the label distribution protocol is needed to set up label switched
          paths across the network and if that is unsuccessful, BGP cannot find
          a route.  There is similar dependence of BGP on OSPF or IS-IS.
          Implementation variants among router vendors determine routing
          dynamics which is poorly understood. The interaction among the
          variants may result in unexpected behavior and needs to be studied.
          
          (6) Understanding routing "politics"
          When a customer complains about routing problems either in terms of
          reachability or poor performance, it typically is in the context of
          some applications. Network operators install route filters in the
          routers to determine which routes to accept in calculating the best
          path to forward traffic. Packet filters at the routers are much more
          flexible in the sense that they determine which packets are accepted
          for forwarding based on attributes of the packets, e.g., port numbers,
          protocol types.  Given a route in one's routing table received by
          one's upstream provider, there is no guarantee that all application
          traffic can reach the destination due to the presence of packet
          filters. Some networks, for instance, perform port-based filtering to
          protect against known worm traffic. When debugging routing problems,
          one needs to view from application's perspective to understand which
          type of application traffic is correctly forwarded.
          
          How to improve the application performance?
          -------------------------------------------
          Problem:
          --------
          Today, the Internet has no performance guarantees for real-time or
          delay-sensitive applications, such as VoIP, gaming, especially if
          traffic goes across multiple networks. To obtain flexible routing in
          terms of control over cost and performance of network paths, end users
          resort to either multihoming to multiple networks or overlay routing.
          However, studies have shown that there may be potential adverse
          interaction between application routing and traffic engineering at the
          IP layer. Multihoming, similarly, is not a perfect solution as it does
          not directly translate to paths with performance guarantees, has
          little impact on how incoming traffic reaches the customers, and may
          further amplify the amount of routing traffic during convergence.
          
          Proposed solution:
          ------------------
          Application is the king: correlate routing with forwarding plane,
          evaluate and improve in the context of application performance
          metrics: delay, loss rate, and jitter.
          
          When studying routing protocol performance, researchers often use
          convergence delay as a universal metric.  However it does not
          translate directly to metrics applications care about, e.g., delay,
          loss rate, and jitter. Understanding the stability of such
          measurements as a function of the network topology and time provides a
          way for overlay routing algorithms to intelligently route around
          network problems. Application performance measurements also expose the
          detailed interaction between the dynamics of forwarding plane and
          control plane.
          
          How to protect the routing system?
          ----------------------------------
          Problem:
          --------
          There has been relatively little studies on protecting the Internet
          routing infrastructure against attacks. Vulnerabilities in router
          architectures are relatively unknown and have not been widely
          exploited. The routing system can also be indirectly affected due to
          enormous traffic volume. Recently, there has been a large number of
          worms exploiting end host OS vulnerability. Significant attack traffic
          volume causes router sessions to time out. Session resets result in
          exchange of entire routing tables and disruption of routing.  Cascaded
          failures can occur if the session reset traffic subsequently cause
          router overload and other peering sessions to be affected.
          
          Proposed solution:
          ------------------
          (1) Understanding vendor implementation of routing protocols
          Through detailed black-box testing and support from vendors, one can
          better understand the obscure, undocumented behavior of routers that
          are not documented in RFCs and their implication on router security.
          
          (2) Understanding vulnerability points on the Internet
          Network topology and policy information are more widely known through
          various Internet mapping effort. Such mapping efforts help us discover
          vulnerability points by analyzing failure scenarios.
          
          (3) Higher priority for routing traffic
          The delay and loss of routing traffic, especially keepalive HELLO
          messages, can cause sessions to reset.  This can occur when there is
          significant data traffic. Increasing the queuing and processing
          priority of routing packets in the routers is one possibility to
          reduce the impact of bandwidth attacks on the routing system.
          
          (4) Automated dynamic installation of packet and route filters
          The attack against windowsupate.com was prevented just in time by
          invalidating the relevant DNS entry in the DNS system, which takes at
          least 24 hours to propagate any change globally.  To react to any
          attacks in real time, there needs to be a faster and automated
          way.  One possibility is to dynamically install relevant packet and route
          filters across a selected set of networks to eliminate/reduce the
          impact of the attacks. Routers have limited memory for such filters
          and the order of the filters determine the actual routes or packets
          permitted. We need to study efficient algorithms to compute such
          filters on the fly.