Fast Failure Detection & Notification Critical for NFV

Carlos Goncalves, R&D Engineer, NEC Europe, 4/8/2016

50%

Cloud computing and software-defined networking (SDN) have undoubtedly been improving deployment, management and operation of systems at midsize to large-scale companies. Allied with the network function virtualization (NFV) paradigm, the networking industry is now looking toward leveraging these concepts to realize the on-going definition of the fifth generation of mobile telecommunications, or 5G networks.

5G networks are expected to deliver improved high-availability mechanisms for both the service and the platform. This means that the ability to provide fast detection and notification of physical and virtual resource failures to upper management layers is critical.

Telecom nodes, due to their stringent high-availability requirements, often come in an active-standby (ACT-SBY) redundant configuration. When virtualized, the manager of such virtualized node application requires fault notification on the ACT node application in order to instantly switch to the SBY node application.

Doctor, a multi-vendor and multi-telco operator-driven Open Platform for NFV Project Inc. project with the goal to build an NFV infrastructure (NFVi) fault management and maintenance framework, is leading this industry effort. Fault management is a management component that allows operation teams to monitor, detect, isolate and automate resolution of faults. An efficient fault management system helps reducing unnecessary service downtime caused by unexpected circumstances, and thus improving quality of services overall.

While conceptually different, fault management and maintenance share very similar work and messages. Doctor’s work is on creating the best open reference platform supporting high availability of network services, featuring immediate notification of a wide range of failure events from the NFVi and support orchestration of recovery of virtual network functions (VNFs). This active project collaborates with and contributes to the worldwide-known ETSI NFV ISG and upstream open source projects (e.g., OpenStack).

Recently, at the first-annual OPNFV Summit held in San Francisco, Calif., Nov. 11-12, the Doctor team highlighted results achieved within the OPNFV initial scope and with several contributions accepted to OpenStack. The multiverse team showcased "an orchestrated platform focused on fast failure detection and notification for NFV," NEC Software Engineer and Doctor Project Team Leader Ryota Mibu explained.

The proof-of-concept (PoC) was demonstrated at NTT DOCOMO’s OPNFV Service Provider PoC booth. (Watch a video of the demo here.) NTT DOCOMO is one of the active carriers involved in the project and holds a key role by bringing requirements and discussing them in the open and broad standards and development communities.

The solution monitors the resource pool by an external resource monitor. As soon as the monitor detects a failure in any of the resource pool elements, it notifies Ceilometer, the OpenStack Telemetry project on data collection, monitoring and alarming. Ceilometer, with its modified notification function, instantly notifies the node application manager. The application manager then switches to the SBY node application. This reduces the downtime to nearly zero for such telecom node applications which host thousands of mobile subscribers’ connections.

Doctor! Doctor!

The Doctor functional block architecture.

In the Doctor project, the team has developed such failure event collections and an immediate notification feature in OpenStack Liberty released in October 2015. Anyone looking at immediate alarming can now leverage these contributions submitted and accepted in OpenStack.

Table 1:

Project Blueprint Spec Drafter Developer

Ceilometer Event Alarm Evaluator Ryota Mibu (NEC) Ryota Mibu (NEC)

Nova New Nova API call to mark nova-compute down Tomi Juvonen (Nokia) Roman Dobosz (Intel)

Support forcing service down Tomi Juvonen (Nokia) Carlos Goncalves (NEC)

In OpenStack without Doctor’s contributions, the delay to notify the application manager is in the order of several minutes. Given such long notification delay after a failure, in a telecom scenario, thousands of mobile subscribers would be disconnected from their cellular network. In contrast to that, in the developed solution, cloud operators can perform such failure notification within one second. The demo showed how well Doctor performs in failure detection and notification, thus being able to meet the high availability requirements of telecom node applications in NFV-based virtualized network systems.

Every Second Counts

Notification time is reduced from several minutes to within one second.

Fast Reactor

The upper layer application manager receives notification quickly and reacts to failure events.

Since its formation in September 2014, the OPNFV project has accomplished several important milestones: issued its first and platform releases, Arno and Brahmaputra; hosted its inaugural OPNFV Summit; grown the developer community to over 150 developers from service providers and commercial suppliers; and fostered a thriving network of 10-plus community labs via the Pharos project, among others.

On the standardization side, Doctor has been making a significant impact in the definition of fault management topics in ETSI NFV, namely in the Interfaces and Architecture (IFA) and Reliability (REL) working groups -- synchronization, alignment and feeding back to ETSI. As if not all of that was just beyond extraordinary in such short time, Doctor keeps going full power -- more standards and software contributions are already in the pipeline.

Doctor’s goals for 2016 remain the same: to continue fostering openness and collaboration opportunities among all OPNFV projects and members, as well as upstream open source communities and standardization bodies. For the upcoming OPNFV release, Colorado release, users can expect more features in various OpenStack projects (e.g., Nova, Congress, Neutron) contributed by the Doctor project addressing additional and much-needed fault management and maintenance requirements.

Users can also expect greater functional testing scenarios coverage, and improved installation support and documentation. The team has also fasten collaboration planning with other existing OPNFV projects (e.g., Software Fastpath Quality Metrics project) aiming to build up an integrated platform for NFV. These activities will demonstrate a sounded open collaboration cross-OPNFV and upstream projects.

— Carlos Goncalves, R&D Engineer, 5G Networks Team, NEC Laboratories Europe, special to Telco Transformation

Project	Blueprint	Spec Drafter	Developer
Ceilometer	Event Alarm Evaluator	Ryota Mibu (NEC)	Ryota Mibu (NEC)
Nova	New Nova API call to mark nova-compute down	Tomi Juvonen (Nokia)	Roman Dobosz (Intel)
	Support forcing service down	Tomi Juvonen (Nokia)	Carlos Goncalves (NEC)