In the 5G era, the ecosystem chain covers cloud, pipe, and device and everything in between. The innovative and cross-generation evolutionary wireless terminals, air interfaces of base stations and network transmission technologies accelerate the evolution of device and pipe architectures. However, cloud data centers are the core hub of the 5G digital ecosystem and will play a pivotal role in this evolution. So the question remains: what kind of cloud data center will be able to meet the network and service requirements of 5G?
The answer is simple: A distributed full-stack cloud data center that is open, efficient, flexible and intelligent.
Open: Open architecture means the cloud services at different layers are not locked by a single vendor. Instead, cloud data centers in the 5G era adopt mainstream open-source northbound service APIs and applications in compliance with industry standards. The infrastructure layer adopts the mature OpenStack elastic computing, storage, and network service APIs; the data layer adopts data operation and query APIs of the industry-recognized Big Data and database standards such as Hadoop, Spark, MySQL, and Redis; and the platform layer uses the Kubernetes container service APIs, which have become the mainstream for application deployment and microservices framework. In addition, Huawei cloud data centers employ standard service APIs on artificial intelligence (AI) platforms such as TensorFlow and MXNet: benchmarks for the industry's progressions with machine learning and deep learning.
Efficient: The 5G network requires a transmission rate and bandwidth more than 100 times higher than that of the 4G network. 5G networks also pose more stringent requirements on reliability and latency in application scenarios such as virtual reality (VR), ultra-HD video, intelligent manufacturing, and auto-piloting, which result in more challenges on 5G Cloud DataCenter Platform with ultra-high throughput and ultra-low latency capabilities for network intensive workloads such as vEPC and storage intensive workloads such as CDN and 4K/8K Video On Demand, best optimized energy and cost efficiency for compute intensive workload such as Machine Learning/Deep learning as well as 3D rendering workload.
Obviously general x86 architectures no longer meet these requirements. Data center vendors need to introduce heterogeneous computing architectures such as the ARM, SoC-based intelligent network adapter, GPU/FPGA, and neural-network processing unit (NPU) chips to ensure business can run with the highest energy efficiency ratio and cost-effectiveness.
Network elements (NEs) on the data plane of 5G networks such as virtualized evolved packet core (vEPC) and CloudRAN base stations normally require the throughput of a single NE or a single server to be increased from 10 Gbit/s to 100 Gbit/s. The conventional, software-only Overlay cloud networks based on x86 CPUs are facing severe performance bottlenecks. Therefore, it is necessary to leverage SR-IOV direct pass-through or DPDK user space mechanism and offload the overlay networking functions such as OVS, vRouter, vFW etc. from X86 to heterogeneous hardware such as Intelligent NIC, FPGAs with 100Gbps port speed.
For 5G Internet of Things (IoT) scenarios, massive data collected from distributed IoT terminals and edge nodes in various industries needs to be aggregated, stored, processed, analyzed, and managed in the centralized Cloud data center based on Object storage service as well as Hadoop/Spark/Stream Big Data Pipeline services. Futhermore, a huge amount of Machine Learning & Deep learning processing jobs, featured by high parallel computing operations of convolution, derivative, logarithm and matrix multiplication with Floating number, are required to extract valuable business insight and train the prediction model out of the input raw data, Offloading these computing capabilities onto the heterogeneous hardware such as GPU/FPGA clusters or even the NPUs improves the high-density computing cost effectiveness and energy efficiency ratio by more than 5 to 10 times. In addition, more and more customers require data interconnection between heterogeneous computing GPUs, FPGAs, and NPUs in massive parallel computing scenarios, which has caused a huge development in ultra-high-speed link connection technologies such as RDMA over Converged Ethernet and NVLink. This further ensures the performance advantages of heterogeneous computing clusters are fully brought into play.
Regarding ultra-high performance IOPS for storage intensive scenarios, as the next-generation storage-class memory (SCM) is increasingly used as the default flash storage medium with an acceptable unit capacity price, while providing read & write speed & latency comparable to memory. This has caused the SCM based Storage node with RDMA/RoCE link connection to Compute nodes are being more widely used in the distributed storage architecture, with typical shared storage latency less than 100 us and storage bandwidth even higher than local disk PCI speed.
Flexible: Another challenge for 5G networks is to agilely and flexibly orchestrate and reassemble network slices. The building of network slice capabilities requires evolution on NEs for 5G services and protocols as well as streamlining of the 5G IoT application data layer, core network layer, and wireless access layer on a cloud data center. These advancements will help implement end-to-end network functions on the management, control, and forwarding layers as well as support dynamic on-demand isolation of capacity and QoS.
Based on the specific requirements of vertical service scenarios, the flexible “5G network slicing” requires that deployment, capacity and service configuration network elements and service applications, as well as networking links in-between them within the shortest timeframe. Template-based orchestration service is introduced here to enable 5G networking elements as well as applications and their dependency can be automatically provisioned and configured based on heterogeneous resource flavors of virtual machines, and physical machines with their pre-defined topology dependency. Beside static resource topology, dynamic orchestration capabilities such as sequential, conditional, and loop service logic control for orchestrated services with ensured transaction integrity are also a necessity for PaaS service, which helps simplify and shorten 5G network construction, from complicated construction within several weeks or months to one-click, automated, repeated construction within several hours or minutes.
Distributed: In the 5G era, the majority of physical network access and routing network functions, as well as innovative applications such as IoT Core services, Big Data, and AI deep learning, analysis are typically deployed in centralized large scale Cloud DataCenter based on VM in a geo-redundant manner, while for 5G IoT device access and corresponding application platforms, as well as diversified third-party applications located at distributed sites. The data centers in this layout support up to tens of thousands of hosts. The access devices of the 5G data plane, such as vEPC gateways, are generally deployed near the metro aggregation access Points of Presence (POPs). This ensures cloud services of backup, disaster recovery (DR), video storage uploading, and other typical low-latency interactive services can be accessed with ensured QoS/SLA over non-blocking dark-fiber/MPLS bandwidth. These services can run on dozens or hundreds of small-scale satellite cloud sites that are configured in the one-stop Cloud in Box mode.
The ubiquitous access and coverage of 5G networks require ultra-low latency and ultra-high bandwidth. Various functions including radio air interface protocol processing, baseband control, wireless resource management and scheduling, network data tunnel, route forwarding, aggregation, and processing as well as service processing functions should be moved to the satellite cloud.
In synergy with centralized Cloud Region, beside satellite cloud located around Local city PoP, we still need large numbers of “Edge Node” beneath the “Satellite Cloud’, which is especially designed to better support IoT service. For example, predefined AI-enabled recognition of surveillance videos, stream processing filtering of IoT raw data, 3D content rendering of VR/augmented reality (AR) games, and real-time user operation interaction can be migrated from the centralized data center to the access network edge. By doing so, centralized intelligent analysis, agile development iteration, and local and real-time access processing capabilities can complement one another.
Moreover, edge nodes can convert a large amount of high-bandwidth, unstructured, or multimedia data on the terminal side to high-value and low-bandwidth structured data that can be uploaded to the cloud data center for centralized analysis and processing. The edge nodes deliver control commands for the edge and terminals. This effectively improves the overall throughput of E2E 5G networks and cloud service applications, which in turn will improve user experience. The seamless integration of edge computing services with full-stack cloud services, such as the cloud PaaS platform, Big Data, and AI services accelerates the development and rollout of innovative services such as IoT, video AI, and AR/VR games with ubiquitous access. One typical reference architecture of Edge cloud can be implemented by the reference architecture of centralized Kubernetes master nodes together with remote Minion node with Kublet agent. With northbound K8S APIs, the edge cloud is compatible with the Kubernetes ecosystem edge computing services and supports access registration and security certificate management for tens of thousands of distributed edge nodes. In addition, Edge nodes also need to support parallel batch container instances and Serverless instance deployment and life cycle management.
Intelligent: Enabling IoT applications is one of the main objectives of 5G network construction. IoT applications generate massive amounts of data, so data centers in the 5G era need to provide ultra-large elastic storage capacity and computing capabilities. In addition, an intelligent engine that is easy to configure and use with high efficiency, combined with diverse domain knowledge and data models, is needed to quickly learn and extract targeted valuable information and strategies from this data. This facilitates closed-loop control of IoT terminals and edge devices.
Typical application scenarios include:
- Image and video recognition in wireless video surveillance scenarios
- Vehicle GPS location tracking and driving behavior preference analysis in Internet of Vehicles
- Traffic congestion and violation detection in intelligent traffic scenarios
- Population density and mobility prediction in smart city scenarios
- Power usage distribution and peak prediction in smart grid scenarios
The intelligent engine introduced previously in the IoT scenario requires that the cloud data center rely on IoT data lakes on the Big Data platform. This enables the cloud data center to provide rich platform services and APIs with pre-integrated machine learning, deep learning, graphics engine, and search capabilities, as well as AI services and APIs in common fields such as visual, voice, natural language, and optical character recognition (OCR). These platform and general AI/machine learning service capabilities work closely with the heterogeneous computing hardware including GPUs, FPGAs, and NPUs and the scheduling system 5G Cloud platforms to implement in-depth software optimization.
Considering the fact that 5G Cloud DataCenters will be deployed over multiple distributed geo-locations with multi-tenant network slices enabled, and millions of resource nodes as the maximum Cloud Region size need to be supported, a truly intelligent and self-healing maintenance mechanism is urgently required. Moreover, the traditional local O&M and fault management needs to be replaced by proactive, predictive O&M and management based on powerful Big Data/AI service as part of 5G full stack cloud. AI/ML algorithms deployed on the basic platform with supervised learning, semi-supervised learning, and unsupervised learning can help analyze the massive amounts of log information collected from software and hardware subsystems. The algorithms support root cause analysis of failures, automatic identification of abnormal behavior patterns, and prediction of network and hard disk faults, thereby improving the hardware and software O&M efficiency dramatically. One O&M personnel can maintain more than 1,000 servers on average.
– Dennis Gu, Chief Architect of Cloud Computing solution, Huawei Technology