What is in a router's mind?
Network operators and researchers are often faced with the question of
understanding the behavior of routing in IP networks. Network
administrators need to know which paths are being used and why in
order to efficiently perform traffic engineering and to be able to
detect a routing anomaly, identify the root cause, and fix the
problem. An accurate analysis of routing dynamics is also important
for obtaining realistic routing models that researchers and developers
can use.
Despite recent advances in monitoring and measurement techniques and
the resulting increase in the amount of information available about IP
networks, identifying the root cause of a routing change is still a
challenging task. Currently, most ISPs have monitors and taps that
collect a large volume of data. One can construct the routing state by
putting together routing messages, table dumps, router logs and
configurations, etc. Traffic flow and performance can be also be
inferred either by using active measurements or by collecting passive
traffic measurements (Netflow, Gigascope, IPMon, SNMP, RMON).
Why is it so hard to put all these pieces together to precisely
determine the cause of a routing change?
(i) There is too much data.
The underlying event (such as a fiber cut, a policy change, or a
misconfiguration) isn't directly visible. A single event can manifest
itself in a number of sources of data, and sometimes even multiple
times in a single data source. For instance, a single fiber cut may
generate a number of link state advertisements and BGP update
messages, and a shift in traffic. Identifying the underlying event may
require combining all these data sources.
(ii) There is not enough data.
Routing information is distributed and understanding the network-wide
behavior depends on the interaction of a number of routers. Moreover,
the forwarding table in each router is constructed based on the
complex interaction of IGP, iBGP, eBGP, vendor-specific
implementations, and domain-specific policies. It is not feasible to
instrument all nodes in a large operational network (Could we
eventually do that?), so we don't have complete information even for a
single domain. None of the data sources available today have enough
information to determine the event that triggered a particular routing
change. Multiple events could cause a similar stream of routing
messages. Routing messages carry enough information to route, but do
not explain the reason for choosing a particular route.
Thus, one piece of information that is missing in this puzzle is: "why
did the router change its mind?" In order to understand routing
behavior, we need to be able to pinpoint the event that triggered a
routing change. For networks using link-state protocols, this task is
made easier by the flood of link state advertisements to all nodes in
the network. However, most traffic that transits ISP networks is
routed using BGP. Unfortunately, there are various types of events
that can trigger BGP routing changes and no single network
administrator has complete information to determine its root
cause. For instance, a policy change, a failure, or a BGP
misconfiguration may all be reported as a withdrawal of a set of
prefixes. The IGP area structure and the iBGP route reflector
hierarchy introduce further complexities to reasoning about routing
behavior. (Is this extra complexity really necessary?)
Can routers help us understand their behavior? How can routers
explicitly report the event that triggered a change in behavior? One
could envision at least two ways of obtaining this information:
(i) Annotate routing messages
BGP could allow the establishment of monitoring sessions. In these
sessions, a BGP speaker could send all alternative routes, so that the
monitor is aware of all the possible choices. Besides, BGP updates
could have extra attributes that contain all information that is used
in the decision process (such as IGP distances and router IDs). Then,
a monitoring box would have all the information the router had for
deciding which routes to pick.
(ii) Expose state in routers
There are a number of factors that can trigger a routing
change. Router logs could be extended to store the event that
triggered a particular change. Examples of events could be an IGP
message, a BGP update, a configuration change, or missing
hellos. Event logs could then be used to trace back to the root cause
of a routing change. Would such logging be feasible? Would it provide
all the information needed to determine the root cause?
After collecting data from a number of routers, one would need to join
all these datasets in order to determine the root cause of a routing
change. Combining this information may still be challenging due to
timing issues. Routing monitors record data remotely and an event may
take some time to propagate through the network. This factor has two
consequences: we cannot assume that all routers will react to the same
event at the same time, and there may be a delay between the time the
router changes its state and the time the monitor records the change.
Should the timestamps recorded by routing monitors be defined by the
router itself?
Since understanding routing behavior is such an essential part to
effectively managing a network, why not take it into consideration
when building routing protocols?