The Datacenter as a Computer (Second Edition)

Introduction

While a datacenter can contain a heterogeneous mix of applications, some applications are large enough to fit an entire datacenter; that is, the entire datacenter is centered around delivering one single application. For example, Gmail or AWS instances (from Amazon’s point-of-view, the application being offered is a hypervisor, they don’t really care what’s being run inside the hypervisor)

Effectively running one application inside a datacenter allows the designers of said Datacenter Application to think about the effects of orchestrating hundreds, thousands of servers.

The hardware design of a server (CPU registers, cache levels, RAM, SSD) impacts how software is written; for example, we store frequently accessed data in RAM and we make sure persisting data is on non-volatile memory. If we have access to a GPU, we only send parallel problems to our GPU, and when we do send the data, we only send it in bulk because of the transfer cost from main memory to the GPU’s memory.

In the same way, we can design software to take into consideration the properties of servers available within a datacenter: instead of writing the software for the server, we write it for the datacenter. For example, if two servers are in the same rack, it’s faster (less latency) for the first server to access data, through the rack’s network, held by the second server in RAM than to access its own data stored on local disk. If we accept that servers, racks and the entire datacenter are all actually working together to deliver the same application, we can take advantage of such realizations.

datacenter_perf_stats.png

Workloads and Software Infrastructure

Datacenter-Sized Internet Application

Parallelism: challenge is to not to find it but to efficiently harness the parallelism.

Workload Churn: behind stable APIs, code changes often, monthly if not weekly.

Platform (Servers) Homogeneity: Not as much variation between servers, and server configuration, in a datacenter.

No Fault-Free Operation: Given the quantity, servers failing is a daily occurrence

General-Purpose Computing System:

Section 2.2 has a table of high-availability and scaling techniques (e.g. sharding, watchdogs, canaries, tail-tolerance)

Platform-Level Software
Basic server-level software (OS, firmware, drivers, libraries Platform homogeneity and known environment means:

Cluster-Level Software
Software that makes the server a node part of a cluster; cluster sync, communication, accessing cluster services, offering services to the cluster

Resource Management:

Basic Cluster Services:

Deployment and Maintenance:

Programming Frameworks:

Unstructured Storage:

Application-Level Software
Application-specific software, the thing using the other levels to perform work (Gmail, Maps, etc)

2.5.1 - 2.5.3 Give a Nice Example for Application-Level Software

A Monitoring Infrastructure
Service-Level Dashboards:

Performance Debugging Tools:

Platform-Level Health Monitoring:

Buy vs Build
To use third-party solutions or build/modify in-house?

Tail-Tolerance
With enough scale, cannot guarantee all systems will perform well; for some requests, some arbitrary system will perform poorly. That poorly-performing system will be the bottleneck of the entire request. There are different strategies to deal with this e.g. have several servers redundantly handle the same part of the request simultaneously, in case one of them is the slow one (and hopefully the other performs well)

Hardware Building Blocks

How to choose cost-efficient hardware for the datacenter?
There’s different tradeoffs, (financial, technical) between buying wimpy (low-end servers) and brawny (high-end servers) components.

Using fewer Brawny vs many Wimpy

Hardware Design, Choices

Networking

Datacenter Basics

Datacenters can generally belong to 4 tiers:

Theoretical availability estimates used in the industry:

Power Systems
Power to datacenter floor

  1. (Outside) Utility Substation: high-voltage (>110 kV) to medium voltage (<50kV)
  2. Primary Distribution Center (Unit Substations): medium voltage to low-voltage <1000V.
  3. Low-voltage lines to UPS, UPS also connected to backup generator.
  4. (Inside the building) UPS wires to PDUs on datacenter floor

UPS

PDU

High-Voltage DC (reducing AC-DC transforms)
AC-DC/DC-AC conversions happen at the UPS and server’s power supply.

Conversion Reduction
One (oversimplified) way to reduce conversions is a single conversion from high-voltage AC to high-voltage DC and only use DC everywhere in the datacenter; after all, server components want DC.
This is complicated by the fact HVDC workers and equipment are still not mainstream.

Cooling
Fresh air cooling (open loop)

Simple Closed-Loop

Three-Loop Cooling (one example configuration)

Tradeoffs
“Each topology presents tradeoffs in complexity, efficiency, and cost. For example, fresh air cooling can be very efficient but does not work in all climates, does not protect from airborne particulates, and can introduce complex control problems. Two loop systems are easy to implement, are relatively inexpensive to construct, and offer protection from external contamination, but typically have lower operational efficiency. A three-loop system is the most expensive to construct and has moderately-complex controls, but offers contaminant protection and good efficiency when employing economizers.”

Mechanical and electrical components of cooling can add a lot (40%) to the power usage and thus cost (construction and operating) of the datacenter.

Airflow at Rack
The tiles of the raised floor, from under which the cool air originates, can have different perforation sizes; the perforations are changed depending on how much air flow we want to stream upwards towards the servers in the rack. Upwards cold airflow must match the warm horizontal airflow the servers are generating. If the cold airflow does not, some servers will ingest warm air instead of cool.

Crappy airflow requires cool air temperature to be lowered further, but this is more costly that simple proper airflow. Cost-wise, airflow is a limiter is power density (and in effect, how many servers per volume) .

In-Rack/In-Row Cooling
Cooling can happen at the rack or row of racks level directly: the hot server air is immediately cooled by cold water pipes running alongside the rack/row. In-Rack/In-Row can complement the CRAC or entirely replace it (effectively bringing the CRAC next to the servers).

Local Server Cooling
Liquid-cooled heat sinks over heat-dissipating parts, like CPUs.

Contained-Based Datacenters
Server racks put inside a 20ft-40ft container, each container have its own cooling, power, PDUs, cabling, lighting, etc. Containers still need outside help from CRACs, UPSs, generators. They provide higher densities due to better airflow control.

Energy and Power Efficiency

Power Usage Efficiency (PUE)

Problem with PUE

Efficiency Losses for PUE of 2
datacenter_pue_losses.png

Server PUE (SPUE)

Total PUE (TPUE)

Computation Energy Efficiency

Energy-Proportional Computing

Low-Power Modes

Software Role in Energy Efficiency
Software-wise, work can be distributed in energy-efficient ways.

Power Provisioning
Determining Power Budget for Servers

Oversubscribing Power
Study at Google datacenter showed:

Trends In Server Energy Usage
CPU Dynamic Voltage and Frequency Scaling (DVFS)

CPU Power Gating

Energy Storage for Power Management
Proposed ideas: use UPS power to

Power usage is a big costly component in datacenters, complex problem at different levels

Modeling Costs

datacenter_cost_per_watt.png Total Cost of Ownership (TCO) composed of Capital Expenses (Capex) and Operational Expenses (Opex)

Capex: deprecated, upfront construction costs, server costs
Opex: recurring fees e.g. electricity, repairs, labor

TCO = datacenter depreciation (amortization) + datacenter Opex + server depreciation (amortization) + server Opex

Capital Costs
Datacenter

Servers

Operation Costs
Datacenter

Servers (Hardware Maintenance, Electricity)

Case Study in Section 6.3

Real-World Datacenter Costs
Worse than the model (many Watts planned, not many used)

Reserves

Partially Filled Datacenter
“To model a partially filled datacenter, we simply scale the datacenter Capex and Opex costs (excluding power) by the inverse of the occupancy factor (Figure 6.3). For example, a datacenter that is only two thirds full has a 50% higher Opex. “

Cost of Public Cloud
“How can a public cloud provider (who must make a profit on these prices) compete with your in-house costs? In one word: scale. As discussed in this chapter, many of the operational expenses are relatively independent of the size of the datacenter: if you want a security guard or a facilities technician on-site 24x7, it’s the same cost whether your site is 1 MW or 5 MW. Furthermore, a cloud provider’s capital expenses for servers and buildings likely are lower than yours, since they buy (and build) in volume”

Dealing With Failures and Repairs

A system can be unavailable:

A system with no failures can still be less than 100% available.

Hardware Fails

Hardware no longer needs to run at all costs due to fault-tolerant software
Benefits:

Storage with Hardware Failure

Expectations from Hardware When Using Fault-Tolerant Software

Faults

Service-Level Failure Causes

Machine-Level Failures

Machine-Level Failure Causes

Predicting Faults

Repairs

Repair Diagnostics

Tolerating, Observing but Not Hiding Faults