Position statement for WIRED Z. Morley Mao UC Berkeley zmao@eecs.berkeley.edu How to debug the routing system? ================================ Problem: -------- Today, network operators have very limited tools to debug routing problems. Only primitive tools such as traceroute and ping are commonly used to identify existing routing behavior. There is very little visibility into the routing behavior of other ISPs' networks from a given ISP's perspective, making it even more difficult to identify the culprit of any routing anomalies. This also means that it is difficult to predict the impact any routing policy change has on the global routing behavior. Oftentimes, routing problems are noticed only after a customer complains about reachability or severe degradation of performance. There is lack of proactive, automated analysis of routing problems that detect routing problems at early stages. As certain routing problems initially may not be very obvious and result in suboptimal and unintended routes. Diagnosing Internet routing problems often requires analysis of data from multiple vantage points. Proposed solutions: ------------------- (1) Build routing assertions, so that nothing fails silently. When network operator configures a network, it is important to create a set of assertions, equivalent to integrity constraints in database or assertions in software programs. This generates the expected behavior of the routing protocols in terms of which routes are allowed, the resulting attributes of the routes, etc. These constraints can be checked dynamically by a route monitor. (2) cooperation among networks Each network builds a measurement repository to collect data from multiple locations. It builds a profile of the expected routing behavior to quickly identify any deviations using statistic techniques. Cooperation across networks is absolutely necessary to diagnose global Internet routing problems. It is a challenge to provide summaries of measurement data at sufficiently detailed level to be useful but without revealing sensitive information about internals of ISP's networks. A complementary approach is to allow special distributed queries of the detailed network data from multiple vantage points without direct access to the data. (3) scalable distributed measurement interpretation and measurement calibrations Routing measurement (e.g., BGP) can result in significant data volume and it may be infeasible to perform real-time or online interpretation of such measurement data by combining all the data from multiple locations in distinct networks at a centralized location. Distributed algorithms are useful to interpret measurement results locally and then aggregate them intelligently to identify routing anomalies. Interpreting measurement can be challenging as there is a lack of global knowledge of topologies and policies which can arbitrarily translate a given measurement input signal to observed output signals. We propose the use of calibration points to help identify expected or normal routing behavior and correlate the output with the input. Calibration points are well-controlled active measurement probes with known measurement input. The BGP Beacons work is one such example of an attempt to understand the patterns of output for a known input routing change. (4) Internet-wide emulation for network configurations The impact of a single routing configuration change caused by a policy change for example could be global; thus, it is important to emulate the behavior in advance to study its impact. It is useful to abstract the routing behavior in a single network at a higher level to study the perturbation on the global routing system. Currently, the routing configuration is done at a device level. Higher-level programming support is needed to provide semantically more meaningful configuration of networks. Predicting the output of a routing configuration implicitly assumes that routing is deterministic. However, nondeterministic routing may be more stable by preferring routes that have been in the routing tables the longest. Such tradeoffs are important to study. (5) Understanding the interaction of multiple routing protocols and implementation variants Internet routing consists of multiple protocols, e.g., interdomain, intradomain routing protocols, and MPLS label distribution protocol. All these protocols interact to achieve end-to-end routing behavior from an application's point of view. It is critical to understand their dependency on each other. For instance, in BGP/MPLS IP VPNs, the label distribution protocol is needed to set up label switched paths across the network and if that is unsuccessful, BGP cannot find a route. There is similar dependence of BGP on OSPF or IS-IS. Implementation variants among router vendors determine routing dynamics which is poorly understood. The interaction among the variants may result in unexpected behavior and needs to be studied. (6) Understanding routing "politics" When a customer complains about routing problems either in terms of reachability or poor performance, it typically is in the context of some applications. Network operators install route filters in the routers to determine which routes to accept in calculating the best path to forward traffic. Packet filters at the routers are much more flexible in the sense that they determine which packets are accepted for forwarding based on attributes of the packets, e.g., port numbers, protocol types. Given a route in one's routing table received by one's upstream provider, there is no guarantee that all application traffic can reach the destination due to the presence of packet filters. Some networks, for instance, perform port-based filtering to protect against known worm traffic. When debugging routing problems, one needs to view from application's perspective to understand which type of application traffic is correctly forwarded. How to improve the application performance? ------------------------------------------- Problem: -------- Today, the Internet has no performance guarantees for real-time or delay-sensitive applications, such as VoIP, gaming, especially if traffic goes across multiple networks. To obtain flexible routing in terms of control over cost and performance of network paths, end users resort to either multihoming to multiple networks or overlay routing. However, studies have shown that there may be potential adverse interaction between application routing and traffic engineering at the IP layer. Multihoming, similarly, is not a perfect solution as it does not directly translate to paths with performance guarantees, has little impact on how incoming traffic reaches the customers, and may further amplify the amount of routing traffic during convergence. Proposed solution: ------------------ Application is the king: correlate routing with forwarding plane, evaluate and improve in the context of application performance metrics: delay, loss rate, and jitter. When studying routing protocol performance, researchers often use convergence delay as a universal metric. However it does not translate directly to metrics applications care about, e.g., delay, loss rate, and jitter. Understanding the stability of such measurements as a function of the network topology and time provides a way for overlay routing algorithms to intelligently route around network problems. Application performance measurements also expose the detailed interaction between the dynamics of forwarding plane and control plane. How to protect the routing system? ---------------------------------- Problem: -------- There has been relatively little studies on protecting the Internet routing infrastructure against attacks. Vulnerabilities in router architectures are relatively unknown and have not been widely exploited. The routing system can also be indirectly affected due to enormous traffic volume. Recently, there has been a large number of worms exploiting end host OS vulnerability. Significant attack traffic volume causes router sessions to time out. Session resets result in exchange of entire routing tables and disruption of routing. Cascaded failures can occur if the session reset traffic subsequently cause router overload and other peering sessions to be affected. Proposed solution: ------------------ (1) Understanding vendor implementation of routing protocols Through detailed black-box testing and support from vendors, one can better understand the obscure, undocumented behavior of routers that are not documented in RFCs and their implication on router security. (2) Understanding vulnerability points on the Internet Network topology and policy information are more widely known through various Internet mapping effort. Such mapping efforts help us discover vulnerability points by analyzing failure scenarios. (3) Higher priority for routing traffic The delay and loss of routing traffic, especially keepalive HELLO messages, can cause sessions to reset. This can occur when there is significant data traffic. Increasing the queuing and processing priority of routing packets in the routers is one possibility to reduce the impact of bandwidth attacks on the routing system. (4) Automated dynamic installation of packet and route filters The attack against windowsupate.com was prevented just in time by invalidating the relevant DNS entry in the DNS system, which takes at least 24 hours to propagate any change globally. To react to any attacks in real time, there needs to be a faster and automated way. One possibility is to dynamically install relevant packet and route filters across a selected set of networks to eliminate/reduce the impact of the attacks. Routers have limited memory for such filters and the order of the filters determine the actual routes or packets permitted. We need to study efficient algorithms to compute such filters on the fly.