Page History

...

As part of its northbound API, ONOS provides applications with access to a global network topology view. Applications operate on this view to perform various network functions such as path computation and flow provisioning, among others.

Global Network Topology View

A primary concern in global network topology state management is the maintaining of While distribution has its benefits, it also presents some interesting engineering problems. Foremost among them is maintaining the consistency of state deployed across various ONOS instances. This is especially evident in the case of global network topology state management where each controller has to Each controller must expose a view of the entire network even though at any given point in time it only has direct visibility over a subset of the network.

Global Network Topology View

There are some properties that a good network topology state management solution ought to should have in a distributed setting:

Completeness: Even though each controller only has direct visibility and influence over a subset of the network, they should all work together to ensure each controller’s network topology view reflects the state of the entire network.
Accuracy: Switches, ports and links go up and down. Each controller’s network view should always expose the correct state for various network elements. This also means that each controller’s network view should quickly change to reflect any changes in the underlying network.
Low latency access: Network topology view is a heavily consumed piece of state and therefore it is very important that the chosen mechanism provide low latency access to the network view.

Approach

...

In ONOS, the entire global network topology state is cached in memory on each instance. This provides applications with low-latency access to topology state. Before we go into the details of how the topology state is kept in sync across instances and with the state of the physical network, it is useful to define couple of a few concepts:

Masters and Terms

In ONOSAs discussed in Device Subsystem, each network device (switch) has may have direct TCP connections to one or more controller instances. ONOS instances, and as described in the previous section, ONOS elects one of those controllers to serve as the master for the device. The system invariant is that at any given point in time a device can have one and only one master. If the current master dies, ONOS elects a new master from amongst the controllers that the device can talk to. ONOS ensures that the master is exclusively responsible for sending commands down to the device and also for receiving various event notifications from it. This is a good time to define a new concept called a mastership term. A mastership term is a per-switch device monotonically increasing counter (starting at 0) that is bumped each time a new master is elected. The very first time a switch comes online and a new master is elected, the mastership term value is 1. If that master dies and a new master is elected the term is bumped to 2 and so on.

...

In a given mastership term, the elected mater master receives various topology events from the device. These could be events such as switch connected, port online, port offline, link down, etc. The current master maintains a per-switch counter to tag the various topology events it has received from the device during its term. So the sequence number is a monotonically increasing counter that is initialized to 0 at the start of the term and is incremented when a new topology event is detected by the master.

...

The instance receives a topology event from a device that it is managing. The instance first assigns a new logical time stamp for this event. It then applies that event to its local topology state machine, and in parallel, broadcasts the event along with the time stamp to every other controller instance in the cluster.
The instance receives a topology event (and time stamp) broadcasted by a peer. The instance first checks to see if it already knows of a more recent update for this device. It does so by comparing the time stamp of the most recent event it knows about for this device with the one it just received. If the just received time stamp is older, it simply discards the event. Otherwise, it updates it local topology state machine by applying the received event.

...

Failure Handling and Anti-Entropy

In the set up setup described above, it is possible that a controller that is temporarily partitioned away will stop receiving updates from peers. Even otherwiseunder normal operation, there is no guarantee that each and every message that is broadcast will receive every controllerbroadcasted will be received by all members of the cluster. If left to its own devices :) , a system purely based on an optimistic replication technique like the one described above will get progressively out of sync and that is no good.

Another class of failures pertains to controller crashes that effectively results in a loss of topology updates. Consider a controller that, on receiving a topology event, promptly crashes before it could replicate that event to other controllers in the cluster. While ONOS automatically elects another controller as the new master for the device, the original topology event is still effectively lost. If that event was for a port going offline, the network view in each controller will continue to show the port as up if nothing else happens in the system. This is bad as well.

To detect and fix issues such as a the ones described above, ONOS employs a couple of techniques:

An anti-entropy mechanism based on Gossip protocol

...

, which works as follows: at fixed intervals (usually 3-5 seconds), a controller randomly picks another controller and they both synchronize their respective topology views. If one controller is aware of more recent information that the other controller does not have, they exchange that information and at the end of that interaction, their

...

respective topology views are mutually consistent. Most of the time the anti-entropy interaction will be

...

uneventful, as each controller already knows about every event that happened in the network. But when a controller state drifts slightly, this mechanism quickly detects that and brings the

...

controllers back in sync. This approach has the added benefit of quickly

...

synchronizing a newly-joining controller

...

with the rest. The first anti-entropy interaction

...

that a newly joining controller has with an existing controller will bring it up to speed

...

, without the need for a separate backup/discovery mechanism.
For detecting and recovering from complete loss of topology updates, each controller periodically probes the devices for which it is the master. If it detects that the device state is different from the information it has, it promptly updates its local topology state and replicates that update to all other controllers in the cluster.

...

Previous : Cluster Coordination
Next : Intent Framework

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Global Network Topology View

Global Network Topology View

Approach

Masters and Terms

Failure Handling and Anti-Entropy