Test Suite Description

The goal of the High Availability (HA) tests is to verify how ONOS performs when there are control plane failures. As HA is a key goal in ONOS, we would like to see ONOS gracefully handle failures and continue to manage the data plane during these failures. In cases where ONOS becomes unavailable, say the entire cluster is rebooted, ONOS should quickly recover any persistent data and get back to running the network.

The general structure of this test suite is to start an ONOS cluster and confirm that it has reached a stable working state. Once this state is reached, we will trigger some failure scenario and verify that ONOS correctly recovers from the failure. Below you will find two tables, the first table describes the failure scenarios and the second table describes the functionality and state tests that are performed in each of the HA tests.

High Availability Tests Scenarios

Test	Failure Scenario	TestON Test name	Roadmap
HA Sanity Test	This tests runs through all the state and functionality checks of the HA Test suite but waits 60 seconds instead of inducing a failure. This is run as a 7 node ONOS cluster.	HATestSanity	now
Minority of ONOS Nodes restart	Restart 3 of 7 ONOS nodes by killing the process once the system is running and stable.	HATestMinorityRestart	now
Entire ONOS cluster restart	Restart 7 of 7 ONOS nodes by killing the process once the system is running and stable.	HATestClusterRestart	now
Single node cluster restart	Restart 1 of 1 ONOS nodes by killing the process once the system is running and stable.	SingleInstanceHATestRestart	now
Control Network partition	Partition the Control Network by creating IP Table rules once the system is in a stable state. During Partition: Topology is replicated within the sub-cluster the event originated in Get flows will only show flows within a sub-cluster Intents should only be available in the majority cluster (Intent behavior is not fully defined for the raft implementation) Mastership behavior has not been defined for split-brain scenario After partition is healed: Topology is consistent across all nodes and should reflect the current state of the network(including updates during partition) Flows view should be consistent across all nodes Intents should be available on all nodes including any new intents pushed to the majority sub-cluster Mastership should be consistent across all nodes
Partial network partition	Partially partition the Control Network by creating IP Table rules once the system is in a stable state. (A and B can't talk, but both can talk to C) Topology should be consistent across all nodes Flows view will show reachable controllers (A sees AC and B see BC and C sees ABC) Intents(Wait for Raft implementation, but will depend on which node is the raft leader) Mastership(Wait for Raft implementation)

State and Functionality Checks in the HA Test Suite

Description	Passing Criteria	Roadmap
Topology Discovery	All Switches, Links, and Ports are discovered All information (DPIDs, MACs, Port numbers) are correct ONOS correctly discovers any change in dataplane topology Each node in an ONOS cluster has the same correct view of the topology	now
Device Mastership	Devices have one and only one ONOS node as master Mastership correctly changes when device roles are manually changed Mastership fails over if current master becomes unavailable Each node in an ONOS cluster has the same view of device mastership	now
Intents	Intents can be added between hosts Hosts connected by Intents have dataplane connectivity Intents remain in the cluster as long as some ONOS nodes are available Connectivity is preserved during dataplane failures as long as at least one path exists between the hosts	now
Switch Failure	Topology is updated and intents are recompiled	now
Link Failure	Topology is updated and intents are recompiled	now
Leadership Election	Applications can run for leadership of topics. This service should be safe, stable and fault tolerant. The service is functional before and after failures, nodes can withdraw and run for election There is always only one leader per topic in an ONOS cluster.	now
Distributed Sets	Call each of the following APIs and make sure they are functional and cluster wide get() size() add() addAll() contains() containsAll() remove() removeAll() clear() retain() In addition, we also check that sets are unaffected by ONOS failures	now
Distributed Atomic Counters	Call each of the following APIs and make sure they are functional and cluster wide incrementAndGet() In addition, we also check that sets are unaffected by ONOS failures. Note: In-memory counters will not persist across cluster wide restarts	now
Cluster Service	Every ONOS node should be clustered with every other node in the test (unless we specifically make one unavailable)	now
Application Service	Application IDs are unique to an application Application activation Application deactivation Active applications reactivate on restart	now

Last Update:
by:

Page tree

Test Plan - HA

Test Suite Description

High Availability Tests Scenarios

State and Functionality Checks in the HA Test Suite