By, Kiran Bondalapati, Co-Founder and CTO of ZeroStack

asdfsfEven with the most dependable engine, a car will be a lemon if the other critical systems (e.g. transmission, suspension, etc.) aren’t engineered with the equivalent level of rigor. The same is true for clouds built with OpenStack. OpenStack — an open source platform for provisioning and managing processing, storage and networking resources throughout a data center — has matured to a point where the underlying technology itself is “enterprise-ready.” However, it’s unfair to integrate modules off the shelf and expect an OpenStack-based private cloud to provide a sufficient level of resiliency. The issue is not the reliability of OpenStack itself, but the way services are architected for high-availability (HA) on top of it.

In order to build a highly available OpenStack-based cloud, key steps must be considered and taken to harden the platform at each of the three fundamental levels.

  1. Controller and core services: The traditional method of providing high availability HA involves designing in fail-over mechanisms for the control plane. There are two inherent issues with this approach. First, it introduces potential for error as it requires manual intervention from the administrator to recover from a failure. For instance, consider a redundant pair of controllers managed by an HA proxy (or load balancer in an active-active configuration). In the event of a controller failure, while the HA proxy will re-direct traffic to the remaining node, this element becomes a single point of failure until the admin either repairs or replaces its counterpart. Second, the fact that the control functionality is distinct from the compute and storage tiers creates distinct silos with specialized nodes adds complexity and scaling challenges to the infrastructure.
  1. Virtual machine (VM): Ideally, a failed VM should be restarted with the same disks. However, identifying a dead VM and taking proper action is a non-trivial task. Just scanning for a network disconnect is an unreliable method as the VM may still be executing I/O transactions on its disks; it’s necessary to disconnect storage I/O during VM failures to avoid data corruption.
  1. Application: To ensure service availability, it is a common best practice to run applications across replicated resources hosted in multiple availability zones (AZs). Often, though, there is no sure way to ensure locality within an AZ and therefore the application user will experience higher latency for inter-tier or inter-VM requests. In addition, failure events could take an application down if the associated VMs are concentrated within a single rack, power domain or host.

The path forward is to incorporate battle-tested methods from web-scale infrastructures into the enterprise data center. The key, in my opinion, is to adopt techniques to build scalable and automated schemes for HA, for instance:

  • Consider symmetric, self-healing architectures built with a distributed control plane to obviate the need for special nodes and siloed deployments.
  • Consider a stronger means to detect and isolate failures to maintain system integrity.
  • Ensure better control on data placement (i.e. implement affinity rules within VMs across tiers along with anti-affinity rules for VMs within a tier) for improved application performance and higher reliability.

For additional insights and details, check out our presentation on this topic at the recent OpenStack Summit in Tokyo – “Reliable OpenStack – Designing for Availability and Enterprise Readiness”.