Why do Big Iron still Exist?


Abstract

In a world where many companies prefer to scale out instead of scaling up, buying many "small" machines instead than a few "big" ones, do "Big Iron" still make sense to exist?

The Hacker's Dictionary defines a "Big Iron" as a «Large, expensive, ultra-fast computers. Used generally of number-crunching supercomputers such as Crays, but can include more conventional big commercial IBMish mainframes. Term of approval; compare heavy metal, oppose dinosaur». Nowadays, big irons are mainly used for running big database such as Oracle, DB2 and, at a certain extent, also Microsoft SQL Server on top of Solaris on Sparc, AIX on Power and Windows Server on Xeon. Those machines usually have a considerable cost starting at $100K and going up to millions of dollars per single machine.

One obvious — and interesting — question could be: wouldn't that be better to have a big amount of smaller machines running, maybe, Linux and a free database such as MySQL or PostgreSQL?

There are a good variety of reasons why those machines are still reasonable to buy.

Hardware and software integration: Usually, these machines are sold by big vertical companies. A vertical company does not sell one product; it sells solution stacks. For instance, IBM — and Oracle after the acquisition of Sun Microsystems — sells everything: The hardware, the operating system, the database, the application frameworks and, not less important, an army of consultants to integrate their platform in another company's infrastructure. The fact that all the components live in the same place and that they are all gears of the same money-making machine makes possible to perfectly integrate all of them and have, for instance, the introduction of features in the operating system that fit the needs of the database running on top of it. This, on a certain extent works also for the Windows/SQL Server integration even if Microsoft is missing the hardware part of the stack.

Cost of ownership: By cost of ownership, we mean the cost of running and maintain one — or more machines — up and running. Dealing with many machine imply setting up systems to deploy and update software in a distributed environment, sharding the data to be served in the distributed environment, and so on. In a distributed environment, especially if composed by many inexpensive machines, the failure of one single disk can compromise the functionality or the entire machine and increasing the number of machines, automatically increases the chance of having at least one machine of the group being down. Having fewer large machines, while having a higher cost for acquiring them, keeps lower the cost of maintaining them. Usually, one single disk failure does not bring down the entire machine as they employ redundancy and disks can be hot swapped. Having just two large machines working in master/slave — or mirror — mode, instead of many small ones running in a totally distributed way, simplify the operations of deploying and updating the software because just for the time of the change, one at the time can be removed from the pool instead of having multiple waves of roll-outs in a totally distributed installation. Moreover, if the architectural requirements of the company or the ones of the applications require a better insulation between the applications, a single large machine can be divided into multiple virtual machines all sharing the same hardware. The cost of running a virtual machine is lower than the one of running a physical one and the cost of running one large host machine is still lower than running many.

Operations in distributed memory: With particular reference to the databases running on those machine, while many operations can be done in a distributed way across many machine, some of the cannot easily done in a performing way. The main operation that cannot be done with a satisfactory performance in a distributed way is the join operation. A join, represents a multi-set multiplication. This means that if the data is sharded across multiple machines, the data will need to move across the network, especially if we are joining them on an element that is not the one that we used to shard the data. While this operation in extremely expensive in a distributed environment, it can be decently performed in parallel by multiple threads with all the data in shared memory or at least not paying the communication cost to move all the data around the network.

Fast communications among different applications/services: Even in a world now dominated by micro-services, having such large machines can give benefits just because many different applications can run on the same hardware and, when in need to communicate to each other, they can do that in a faster way as the kernel can take care of transforming the usual socket communications into faster IPC ones without the application to know. Moreover, if the same machine hosts both the applications and the database, those communications become inexpensive with respect to the interactions between different machines.

That being said, there are some cases in which having many machines can be a good, if not the best choice.

Those cases are the ones in which there is no need to perform expensive distributed operations and all those cases in which the data consistency among the machines is not a strict requirement. For instance, all the cases in which all the data can be replicated on all the machine and such data can safely be out of sync among the machines for a brief time, e.g., this is the usual case of front end or application layer machines that need to preserve very little — if not no-state at all — among subsequent requests and the fault tolerance can be enforced horizontally to the architecture.

The case of front end machines running just a webserver, a little amount of logic to fit data into templates and serve static content (as the logo of the company and other static data) can benefit from running into a large amount of totally equally capable machines. If one machine goes down, another random one can continue serving the users, limiting the cost of ownership of those boxes and machines can easily be taken out for updates or maintenance without the fear of compromising the overall functionality of the system. Moreover, as those machines are usually I/O bound and doing very little processing, having them spread among many inexpensive machines can improve the usage of the inbound and outbound bandwidth.

Some system configurations tend to leverage the both best of best words. For instance, as reported by High Scalability, the architecture of Plenty of Fish (PoF for short), an online dating website, is based on a "big iron" running the expensive operations in SQL Server and a handful of smaller machines operating as webservers (11 servers plus 2 load balancers) running Windows Server as well. The database machine is a HP ProLiant DL785 with  512 GB of ram and 8 CPUs — i.e., 32 cores — running SQL Server on Windows Server 2008, as for the 2009 post from the blog of PoF. The cost of the machine — while the data in the post is incomplete — should cost a between $100,000 and $150,000 plus the software licenses for both the OS and the database. Jeff Atwood in Scaling Up vs. Scaling Out: Hidden Costs does some math on the cost of scaling out vs scaling up for the case of PoF showing that things are not as obvious as they can seem.

November 6th, 2015: Thanks to a reader, I fixed "Big Irons" → "Big Iron"