Team
Name | Organization | |
Damian O'Neill | BTI Systems | |
Kieran McPeake | BTI Systems | |
Hayden Shorter | BTI Systems |
Overview
This project adds Fault Management of Network Elements (NEs) to ONOS.
When a fault or event occurs, a NE will typically send a notification to the network operator via SNMP. An alarm is a persistent indication of a fault that clears only when the triggering condition has been resolved. This proposal outlines a solution so that ONOS will provide support for such alarms.
Proposed work
Fault Management is in ONOS terms: a service, i.e. "a unit of functionality that is comprised of multiple components that create a vertical slice through the tiers as a software stack"
Three layers are will be updated to provide the new service:
- Provider for strongly typed SNMP
- Core (Fault Management)
- Applications
For context, the following diagram taken from ONOS design wiki, illustrates the relationship between the 3 layers.
SNMP Provider
Note: none of following is specific to Fault Management.
Provide Core – ONOS/providers/snmp
A new generic provider provides SNMP communication with NEs. It is a southbound plugin handling SNMP 2c bi-directional interaction with NEs. It is a Java OSGi bundle providing Java APIs mapping to MIB files’ definitions.
The provider has a core based on SNMP4J (Reference #1) At runtime, ONOS uses this, plus additional NE-specific automatically-generated jar libraries to provide a type-safe NE-specific programmatic interface, based on the MIB set for that NE type.
The SNMP provider provides a local cached ‘in-memory’ representation of tables deemed ‘interesting’ on the NE. This set of tables is defined in step #2 below.
The provider will
- Poll tables on NEs (via SNMP GET/GETBULK). The polling may be either:
- Scheduled
- Upon request by application. This may be :
- Manual e.g. in response to human user request
- Automatic e.g. in response to a SNMP traps. [Some NEs use an SNMP trap-based mechanism to communicate occurrence of NE database changes to a SNMP manager to reduce amount of SNMP polling required]
- Process incoming SNMP traps from NEs and send ‘callback’ notifications to registered ‘listener’ components.
- Send SNMP SETs to NEs; some NEs will support SNMP SETs as a registration/deregistration mechanism for SNMP trap listeners.
- Send SNMP traps.
Note: trap notifications (whether fault or configuration related) are an optimization: using them ONOS can receive faults before its next poll interval, but polling is the only guaranteed mechanism to have a correct picture of the NEs’ faults - in particular when faults pre-date management by ONOS or when the ONOS or network goes offline temporarily.
NE specific libraries
A set of standard MIB specific libraries will be supplied by default (allowing SNMP interaction with e.g. MIB-II compliant NEs).
In addition a mechanism will be provided to allow management of other NEs (with either standards-based or vendor-specific MIBs). A mechanism will be provided to allow generation of a Java library for such NEs in an offline step.
The SNMP Provider may be regarded as sitting side-by-side with the NETCONF provider equivalent. In future an adapter layer may be used to abstract SNMP or NETCONF provider specifics from FM.
Core – Fault Management (FM)
A new ONOS core (Fault Management) will track the state of the NE.
It will register its interest in specific (Fault Management) tables and notifications with the SNMP provider mentioned above, which will inform core about relevant events. This layer understands for a particular NE the MIB/Vendor-specific mappings from fault tables and fault notifications.
All communication between SNMP provider and ONOS Core (Fault Management) uses the Provider Service interface.
The north-facing API of the FM core will provide a mechanism for applications to get current alarms, for either a specified single NE or for all NEs. It will also provide a way for an application to register to be notified about alarm-state changes.
It will include ‘recently cleared’ alarms but these will get purged regularly. A NE will also have its alarms purged if it is deleted from ONOS i.e. undiscovered.
The persisted alarms model is as follows:
Field | Notes |
---|---|
Id | Identifier allocated by ONOS. |
Acknowledged? | Set to true if a ONOS-user has acknowledged this alarm. |
Description | From NE e.g. |
Device identify | e.g. NE address |
Is Service Affecting? |
|
Key Value | e.g. |
OID | e.g. |
Severity | As specified in ITU recommendation X.733, i.e.
|
Time Raised |
|
Time Cleared | If applicable. |
Raising Notification Id | If applicable. Not applicable if discovered by poll. |
Clearing Notification Id | As above. |
User Assigned | ONOS-user (if any) to whom this alarm has been assigned. |
A future FM release may support persisting faults over longer timeframes (including those related to NEs that are no longer managed) so that historical data is made available. Support for historical data mining is excluded from this release.
Applications
Applications using FM will be supplied with following functionality:
- New: SNMP notification generator- An ONOS application to notify an external SNMP-based NMS, about occurrence of significant events. It will provide a mapping from events (e.g. NE faults) to an ONOS-defined SNMP notification.
- New: REST API for retrieving current alarms.
- HTTP GET
/alarms
- HTTP GET
/alarms/{deviceId}
- HTTP GET
- Updates: Integration with existing applications to show current faults per NE (and potentially per service or port) e.g. the existing ONOS Web GUI
Project Plan
- Road map - TBD.
- Release - Goals and Dates - TBD.
Terminology
There are many resources online giving overview and definitions for fault management. We will use same definitions as the IETF Alarm MIB RFC 3877; whilst do not want to repeat that document the following extract may be a helpful.
- Error - A deviation of a system from normal operation.
- Fault - Lasting error or warning condition.
- Event - Something that happens which may be of interest. A fault, a change in status, crossing a threshold, or an external input to the system, for example.
- Notification - Unsolicited transmission of management information.
- Alarm - Persistent indication of a fault.
- Alarm State - A condition or stage in the existence of an alarm. As a minimum, alarms states are raise and clear. They could also include severity information such as defined by perceived severity in the International Telecommunications Union (ITU) model [M.3100] - cleared, indeterminate, critical, major, minor and warning.
- Alarm Raise - The initial detection of the fault indicated by an alarm or any number of alarm states later entered, except clear.
- Alarm Clear - The detection that the fault indicated by an alarm no longer exists.
- Active Alarm - An alarm which has an alarm state that has been raised, but not cleared.
- Alarm Detection Point - The entity that detected the alarm.
- Perceived Severity - The severity of the alarm as determined by the alarm detection point using the information it has available.
Other terminology used in this proposal:
- NE – Network Element, i.e. managed device.
References
1. http://www.snmp4j.org/ SNMP4J is an enterprise class free open source and state-of-the-art SNMP implementation for Java™ SE 1.4 or later*. SNMP4J supports command generation (managers) as well as command responding (agents). Its clean object oriented design is inspired by SNMP++, which is a well-known SNMPv1/v2c/v3 API for C++