Routing Problems are Too Easy to Cause, and Too Hard to Diagnose
================================================================
IP routing protocols, such as OSPF or BGP, form a complex,
highly-configurable distributed system underlying the end-to-end
delivery of data packets. "Highly configurable" is a nice way of
saying "hard to configure" or "easy to misconfigure," and "distributed
system" is a nice way of saying "hard to understand" or "hard to
debug." As such, we have a routing system today where a single
typographical error by a human operator can easily disconnect parts of
the Internet, and diagnosing and fixing routing problems remains an
elusive black art. This is unacceptable for any technology that would
be considered a core communication infrastructure. I believe that
the networking research community should devote significant attention
to improving the state of the art in router configuration and network
troubleshooting.
Several factors conspire to make IP router configuration extremely challenging
- Vendor configuration languages are primitive and low-level, like
assembly language (e.g., a typical router may have ten thousand lines
of configuration commands)
- Routers implement numerous complex protocols (e.g., static routes,
RIP, EIGRP, IS-IS, OSPF, BGP, MPLS, and various multicast protocols)
that have many tunable parameters (e.g., timers, link weights/areas,
and BGP routing policies)
- The routing protocols interact with each other (e.g., "hot-potato"
routing in BGP based on the underlying IGP, use of static routes to
reach the remote BGP end-point, and route injection between protocols)
- Scalability often requires even more complex configuration to limit
the scope of routing information (e.g., OSPF areas and summarization,
BGP route reflectors and confederations, and route aggregation)
- Networks are configured at the element (or router) level, rather than
as a single cohesive unit with well-defined policies and constraints
- Key network operations goals, such as traffic engineering and
security, are not directly supported, requiring operators to tweak the
router configuration in the hope of having the right (indirect) effect
on the network and its traffic
Addressing these complicated problems will require research work in
configuration languages, protocol modeling, and network modeling, and
would hopefully lead to a higher level of abstraction for managing the
configuration of the network as well as tools for configuration
checking and, better yet, automation of configuration from a
higher-level specification of the network goals. Extensions (or
replacements!) of the routing protocols may also be necessary to
rectify some of these problems.
Detecting, diagnosing, and fixing routing problems are also very
complicated because:
- Routing protocols are hard to configure, making configuration
mistakes very common (see above!)
- Routing protocols do not convey enough information to explain why a
route has changed (or disappeared entirely)
- No authoritative record exists that can identify which routes are
valid (e.g., whether the originating AS is entitled to advertise the
prefix, or whether one AS should be providing transit service from one
AS to another)
- Failures, configuration errors, or malicious acts in remote
locations can affect the path between two hosts
- Reachability problems can arise for other reasons, unrelated to the
routing protocols (e.g., packet filtering or firewalls, MTU mismatches,
network congestion, and overloaded or faulty end hosts)
- The end-to-end forwarding path depends on the complex interaction between
multiple routing protocols running in a large collection of networks
- Route filtering and route aggregation (often necessary for scalability) can
lead to subtle reachability problems, including persistent forwarding loops
- The network does not have much support for active measurement tools
for measuring the forwarding path (i.e., traceroute is very primitive,
and limited in its accuracy and potential uses)
- The Internet topology is not fully known, at the router or the AS
levels (or in terms of AS relationships and policies), and may be
inherently unknowable
Like router configuration, network troubleshooting has received little
attention from the research community, despite its importance to
network practitioners. Research work in network support for
measurement, extensions to routing protocols to facilitate diagnosis,
and new diagnostic tools would be extremely valuable for improving the
state of the art.