Test Suite Description
The goal of the High Availability (HA) tests is to verify how ONOS performs when there are control plane failures. As HA is a key goal in ONOS, we would like to see ONOS gracefully handle failures and continue to manage the data plane during these failures. In cases where ONOS becomes unavailable, say the entire cluster is rebooted, ONOS should quickly recover any persistent data and get back to running the network.
The general structure of this test suite is to start an ONOS cluster and confirm that it has reached a stable working state. Once this state is reached, we will trigger some failure scenario and verify that ONOS correctly recovers from the failure. Below you will find two tables, the first table describes the failure scenarios and the second table describes the functionality and state tests that are performed in each of the HA tests.
High Availability Tests Scenarios
Test | Failure Scenario | TestON Test name | Roadmap |
This tests runs through all the state and functionality checks of the HA Test suite but waits 60 seconds instead of inducing a failure. This is run as a 7 node ONOS cluster.
| HATestSanity | now | |
Restart 3 of 7 ONOS nodes by killing the process once the system is running and stable. | HATestMinorityRestart | now | |
Restart 7 of 7 ONOS nodes by killing the process once the system is running and stable. | HATestClusterRestart | now | |
Restart 1 of 1 ONOS nodes by killing the process once the system is running and stable. | SingleInstanceHATestRestart | now | |
Control Network partition | Partition the Control Network by creating IP Table rules once the system is in a stable state. During Partition:
After partition is healed:
|
|
|
Partial network partition |
Partially partition the Control Network by creating IP Table rules once the system is in a stable state. (A and B can't talk, but both can talk to C)
|
|
|
State and Functionality Checks in the HA Test Suite
Description | Passing Criteria | Roadmap |
Topology Discovery |
| now |
Device Mastership |
| now |
Intents |
| now |
Switch Failure |
| now |
Link Failure
|
| now |
Leadership Election | Applications can run for leadership of topics. This service should be safe, stable and fault tolerant.
| now |
Distributed Sets | Call each of the following APIs and make sure they are functional and cluster wide
In addition, we also check that sets are unaffected by ONOS failures | now |
Distributed Atomic Counters | Call each of the following APIs and make sure they are functional and cluster wide
In addition, we also check that sets are unaffected by ONOS failures. Note: In-memory counters will not persist across cluster wide restarts | now |
Cluster Service |
| now |
Application Service |
| now |
Last Update:
by: