The goal of the High Availability (HA) tests test suite is to verify how ONOS performs when there are control plane failures. As HA is a key goal in ONOS, we would like to see ONOS gracefully handle failures and continue to manage the data plane during these failures. In cases where ONOS becomes unavailable, say the entire cluster is rebooted, ONOS should quickly recover any persistent data and get back to running the network.

High Availability Tests Scenarios

Test

Failure Scenario

TestON Test name

Roadmap


HA Sanity Test	This tests runs through all the state and functionality checks of the HA Test suite but waits 60 seconds instead of inducing a failure. This is run as a 7 node ONOS cluster.

HAsanity

now


Minority of ONOS Nodes shutdown	Restart 3 of 7 ONOS nodes by gracefully stopping the process once the system is running and stable.

HAkillNodes

nownow

	HAstopNodes
Minority of ONOS Nodes continuous shutdown	Continuously (1000 times) restart 1 of 7 ONOS nodes iteratively by gracefully stopping the process once the system is running and stable. Then verify the node correctly restarts and joins the cluster.	HAcontinuousStopNodes
Minority of ONOS Nodes killed	Restart 3 of 7 ONOS nodes by killing the process once the system is running and stable.

HAstopNodes

HAkillNodes
Entire ONOS cluster restart	Restart 7 of 7 ONOS nodes by killing the process once the system is running and stable.	HAclusterRestart

now


Single node cluster restart	Restart 1 of 1 ONOS nodes by killing the process once the system is running and stable.	HAsingleInstanceRestart

now

Control Network partition

Partition the Control Network by creating IP Table rules once the system is in a stable state.

During Partition:

Topology is replicated within the sub-cluster the event originated in

Get flows will only show flows within a sub-cluster

Intents should only be available in the majority cluster (Intent behavior is not fully defined for the raft implementation)

ONOS nodes that cannot cluster with a quorum of the cluster will relinquish leadership of any topic and give up any device roles
Nodes in the majority partition should be able to take up all the work of the cluster and continue to function normally

Mastership behavior has not been defined for split-brain scenario

After partition is healed:

Topology is consistent across all nodes and should reflect the current state of the network(including updates during partition)
Flows view should be consistent across all nodes
Intents should be available on all nodes including any new intents pushed to the majority

sub-cluster

partition
Mastership should be consistent across all nodes

HAfullNetPartition

now

Partial control network partition

Partially partition the Control Network by creating IP Table rules once the system is in a stable state. (A and B can't talk, but both can talk to C)

Topology should be consistent across all nodes
Flows view will show reachable controllers (A sees AC and B see BC and C sees ABC)
Intents(Wait for Raft implementation, but will depend on which node is the raft leader)
Mastership(Wait for Raft implementation)

delayed

Dynamic Clustering: Swap nodes

Change membership of an ONOS cluster at run time

Start a five node ONOS cluster configured to use a remote metadata file
Run common state and functionality checks
Replace two of the five ONOS nodes in the metadata file with two new ONOS nodes
Check that the two new nodes join the cluster and the two replaced nodes leave
Run common state and functionality checks

HAswapNodes

now

Dynamic Clustering: Scale up/down

Change the size of an ONOS cluster at run time

Start a single node ONOS cluster configured to use a remote metadata file
Run common state and functionality checks
Scale the cluster up to seven nodes then back down to one node in increments of two nodes.
After changing cluster size:
- Check that the two new nodes join the cluster or the two old nodes leave
- Run common state and functionality checks

HAscaling


Offline Backup Recovery	Take a backup of ONOS data and resore ONOS using the backup Take a backup of each ONOS node's data Uninstall and reinstall ONOS without starting the service Copy the backed up data over to the new installations Start ONOS using the backed up data	HAbackupRecover
ISSU	Perform an In-Service Software Upgrade (ISSU) of ONOS Start ONOS Initiate the upgrade Restart a minority of nodes with the new ONOS version Verify the new nodes have come up Transition the cluster from the old version to the new version Restart the remaining nodes with the new version one at a time Commit the upgrade	HAupgrade
ISSU - Rollback	Rollback an In-Service Software Upgrade (ISSU) of ONOS Start ONOS Initiate the upgrade Restart a minority of nodes with the new ONOS version Verify the new nodes have come up Transition the cluster from the old version to the new version Rollback the upgrade Restart the upgraded nodes with the old version one at a time	HAupgradeRollback

now

State and Functionality Checks in the HA Test Suite

Description

Passing Criteria

Roadmap


Topology Discovery	All Switches, Links, and Ports are discovered All information (DPIDs, MACs, Port numbers) are correct ONOS correctly discovers any change in dataplane topology Each node in an ONOS cluster has the same correct view of the topology

now

Device Mastership

Devices have one and only one ONOS node as master
Mastership correctly changes when device roles are manually changed
Mastership fails over if current master becomes unavailable
Each node in an ONOS cluster has the same view of device mastership

now


Intents	Intents can be added between hosts Hosts connected by Intents have dataplane connectivity Intents remain in the cluster as long as some ONOS nodes are available Connectivity is preserved during dataplane failures as long as at least one path exists between the hosts

now


Switch Failure	Topology is updated and intents are recompiled

now

Link Failure

Topology is updated and intents are recompiled

now

Leadership Election

Applications can run for leadership of topics. This service should be safe, stable and fault tolerant.

The service is functional before and after failures, nodes can withdraw and run for election
There is always only one leader per topic in an ONOS cluster.

now

Distributed Sets

Call each of the following APIs and make sure they are functional and cluster wide

get()
size()
add()
addAll()
contains()
containsAll()
remove()
removeAll()
clear()
retain()

In addition, we also check that sets are unaffected by ONOS failures

now

Distributed Atomic Counters

Call each of the following APIs and make sure they are functional and cluster wide

incrementAndGet()

In addition, we also check that sets are unaffected by ONOS failures.

Note: In-memory counters will not persist across cluster wide restarts

now

Cluster Service

Every ONOS node should be clustered with every other node in the test (unless we specifically make one unavailable)

now


Application Service	Application IDs are unique to an application Application activation Application deactivation Active applications reactivate on restart

now

...

Page tree

Versions Compared

Old Version 18

New Version Current

Key

High Availability Tests Scenarios

State and Functionality Checks in the HA Test Suite

Page tree

Page History

Versions Compared

Old Version 18

New Version Current

Key

High Availability Tests Scenarios

State and Functionality Checks in the HA Test Suite