Garbage Collection

ONOS relies on a variety of distributed systems protocols which use timeouts for failure detection. In production systems at scale, the various components in ONOS can generate enough garbage to cause multi-second GC pauses when not tuned correctly. To reduce the chance of false negatives (timeouts) occurring during GC pauses, we recommend deployments use the Garbage First (G1) Garbage Collector. The G1 garbage collector allows a maximum GC pause goal to be set, and the garbage collector will attempt to avoid pauses larger than the maximum. We recommend a GC pause goal around 200ms.

The G1 garbage collector can be enabled by overriding the $JAVA_OPTS environment variable in ONOS deployments:

export JAVA_OPTS="-XX:+UseG1GC -XX:MaxGCPauseTimeMillis=200"

Storage Timeouts

As mentioned above, ONOS relies on a variety of time-dependent distributed systems protocols for cluster coordination and storage. Because they depend on time, these protocols can degrade when latency suffers in clusters placed under high stress. ONOS and Atomix provide options for tuning most time-dependent protocols.

ONOS <=1.13

-Donos.cluster.raft.electionTimeoutMillis Sets the election timeout for Raft partitions. This value should be greater than the maximum estimated GC pause time for the cluster. Defaults to 2500
-Donos.cluster.raft.heartbeatIntervalMillis Sets the interval at which Raft leaders send heartbeats to followers. This value must be less than the election timeout and should be less than one second to reduce read latency. Defaults to 500
-Donos.cluster.raft.storage.level Sets the storage strategy for Raft logs. Must be either MAPPED or DISK. Defaults to MAPPED
cfg:set org.onosproject.store.cluster.impl.DistributedLeadershipStore electionTimeoutMillis Sets the election timeout for all leadership elections, including mastership elections. This value should be greater than the maximum estimated GC pause time for the cluster. Defaults to 2500

ONOS >=1.14

cfg:set org.onosproject.store.cluster.impl.DistributedLeadershipStore electionTimeoutMillis Sets the election timeout for all leadership elections, including mastership elections

Atomix

Async Logging

The default logging configuration for both ONOS 1.x (log4j) and ONOS 2.x (log4j2) can in some environments block multiple threads for long periods of time due to a single lock shared by all loggers. These blocks can lead to timeouts, mastership changes, and other problems. In ONOS 2.x, to avoid extensive blocking due to loggers, we recommend using AsyncLoggers.

log4j.appender.async=org.apache.log4j.AsyncAppender
log4j.appender.async.appenders=rolling

See the Karaf logger documentation for more information on async loggers.

For ONOS 1.x, log4j does support asynchronous appenders, but we have not found success with them. It is possible to upgrade ONOS 1.x to log4j2, but the process of upgrading breaks other logging features to an extent that makes it unsuitable for production.

Karaf Lock Timeout

When starting up or recovering ONOS nodes in a cluster, some core ONOS components can take a while to startup. This can result in exceptions during startup:

2019-01-16T19:58:35,712 | ERROR | FelixDispatchQueue | onos-core-primitives             | 183 - org.onosproject.onos-core-primitives - 2.0.0.SNAPSHOT | FrameworkEvent ERROR - org.onosproject.onos-core-primitives
org.osgi.framework.ServiceException: Service factory exception: Could not obtain lock
        at org.apache.felix.framework.ServiceRegistrationImpl.getFactoryUnchecked(ServiceRegistrationImpl.java:352) ~[?:?]
        at org.apache.felix.framework.ServiceRegistrationImpl.getService(ServiceRegistrationImpl.java:247) ~[?:?]
        at org.apache.felix.framework.ServiceRegistry.getService(ServiceRegistry.java:350) ~[?:?]
        at org.apache.felix.framework.Felix.getService(Felix.java:3737) ~[?:?]
        at org.apache.felix.framework.BundleContextImpl.getService(BundleContextImpl.java:470) ~[?:?]
        at org.apache.felix.scr.impl.manager.SingleRefPair.getServiceObject(SingleRefPair.java:73) ~[?:?]
        at org.apache.felix.scr.impl.inject.BindParameters.getServiceObject(BindParameters.java:47) ~[?:?]
        at org.apache.felix.scr.impl.inject.field.FieldHandler$ReferenceMethodImpl.getServiceObject(FieldHandler.java:519) ~[?:?]
        at org.apache.felix.scr.impl.manager.DependencyManager.getServiceObject(DependencyManager.java:2308) ~[?:?]
        at org.apache.felix.scr.impl.manager.DependencyManager$SingleStaticCustomizer.prebind(DependencyManager.java:1162) ~[?:?]

By default, Karaf imposes a 5 second limit on the time it takes to ignite a component. However, storage components in particular can take longer to startup. They often must run a leader election protocol and replay logs to rebuild the cluster state. We recommend increasing Karaf lock timers to avoid exceptions when starting up ONOS nodes.

Karaf lock timeouts can be overridden via the ds.lock.timeout.milliseconds system property. This property can be set again by overriding the $JAVA_OPTS environment variable:

export JAVA_OPTS="-Dds.lock.timeout.milliseconds=15000"

Page tree

Production Deployment Tuning