The OVH outage is a reminder that flawless availability would cost too much

The OVH outage that occurred on October 13, 2021 caused a lot of talk about it because of the significant impact it had on the French web, among others. But before nailing the French host to the pillory, it is important to put things in context.

On Thursday, October 14, 2021, OVH was the victim of a huge outage that affected thousands of sites. Following a network configuration error, the French host took part of the web offline for a little over an hour. The blackout was very noticed, because it affected mixed media sites, institutional sites like data.gouv.fr, and commercial sites like that of Interflora.

That so many parts of the web have fallen due to a bug is both flattering and embarrassing for OVH. The company shows on the one hand that it has become an essential infrastructure for the proper functioning of the French-speaking and global web, but the incident also damages its brand image a few days before its IPO.

Quickly repaired, this failure nonetheless recalls an important truth: no computer system is infallible.

What is availability?

We tend to forget it sometimes, but the “web” is made up of thousands of routers and servers around the world. These machines, like ours, have their share of breakdowns and malfunctions. But when it’s the servers of a company like OVH that go down, the impact is felt more than when our phone reboots for no reason.

As a host, OVH must indeed ensure an almost irreproachable availability of service. Availability, in computer parlance, is quite simply the ability of a host to ensure access to its servers, and therefore to your favorite sites. On its site, OVH promises ” availability of […] 99,9 % »For a basic offer. A fairly standard and expected figure in this kind of industry.

OVH data center
On a network like OVH’s, every little thread is important // Source: OVH

On paper, it looks almost perfect. But even a 0.1% margin of error can have significant consequences, as this outage demonstrates. The reason why no mainstream web host shows 100% uptime is simple: it’s the cost.

As network specialist Stéphane Bortzmeyer explains, « making sure a network works 99.999% of the time isn’t a bit more expensive than making sure it works 99.99% of the time, that’s much more expensive “. The first percentage allows a 5-minute outage per year, while the first increases the “allowed” outage time to 52 minutes per year. The availability of “99.9%” promised by OVH leaves room for approximately 8 hours of failure per year. Every minute of difference represents significant human and technical efforts.

Ensuring impeccable availability is expensive, ” and this reliability is not essential for all uses »Adds the network engineer. An e-commerce site can tolerate an outage of a few hours a year, but for critical infrastructures such as hospitals or the police, it is more complicated. In cases like this, a so-called “multi-cloud” approach (that is to say with its data stored at several hosts in parallel) is essential. But doubling the volume of data to be stored quickly escalates the bill.

An inevitable mistake

For reasons of cost, a lot of sites therefore rely on a single host. However, the infrastructure of a service like OVH is excessively complex, with servers spread all over the world and routers that redirect millions of connections per day.

Even when testing configuration changes up front, it is difficult to identify all possible issues. ” The lab will never be an exact replica of the real network […] a network the size of that of OVH is a very complex socio-technical object and […] it is very difficult to predict the consequences of an action », Explains Stéphane Bortzmeyer. This complexity is further accentuated when we have clients all over the world. Impossible to plan a maintenance operation on a time zone that will suit everyone. With so many variables at play, even a tiny grain of sand in the machine can have huge consequences.

In his analysis of the failure, OVH gives some additional details. ” OVHcloud operates a backbone (core network Editor’s note.) global which covers all continents. To ensure the best possible reach for its customers, the backbone is fully meshed. By nature, this mesh means that all routers […] are directly or indirectly connected to each other and constantly exchange routing information. To put it simply, by wanting to improve the quality of its OVH network has also created weaknesses. A simple copy-paste error in a router configuration quickly spread system-wide. Creating this famous breakdown.

Maintaining a network architecture like that of OVH therefore requires juggling issues of efficiency, price and reliability. Suffice to say that the balance is complicated to find. The small hour that the outage lasted will at least have reminded us that the Internet is an extremely complex network and that no one is immune to a bug, whether our name is OVH or Facebook.

Leave a Comment