There has been a lively discussion on the Sanger’s internal e-mail list, inspired by this extract from Free: The Future of a Radical Price by Chris Anderson, editor in chief of Wired. The take-home message of the extract is that companies and institutions are stuck in a 20th Century way of thinking about computing resources: we plan for scarcity, assume that memory or CPU time are limited, and get people to keep to strict rules to keep usage down. But these days, computing is cheep; what is scarce is human time, ingenuity, and exploration, and heavily restricting computing use wastes and stifles these things.
This specific issue raised is that the Sanger’s computing maintainers often e-mail us to tell us that our shared hard drives are getting full, and we each need to clear out a few gigs. Someone pointed out that it takes a good half an hour to cut back space, translating to a few dozen person hours across the Institute, and the end result, a few hundred gigs of space, could be purchased from amazon for 50 quid or less. This, the argument goes, gets the cost-benefit analyses the wrong way around; 20 person hours is worth so much more than 200Gb of disk space.
The reply, and resulting discussion, hammered out something that I could call a consensus ensemble of computing resource philosophies, if I wasn’t worried about alienating everyone I’ve ever loved.
How We Think About Memory
Those of us that have minds mildly attuned to computing resources tend to think of them as a single processor attached to a chunk of RAM (fast access, a few gig, short term) and a hard disk (slow access, a terabyte or so, long term). All these things last essentially forever, or at least until you buy a newer computer every few years. Programs are written to run on the processor, stick stuff in RAM and read/write to disk - the program is written and compiled on the computer, the data is there for processing, and the results will stay on the computer at the end. We may encounter a computing cluster, that has lots of computers in it, and requires special programs or code to handle, but we still think of it as a load of PCs, each with it’s own disk and RAM, connected together by cables.
But this model doesn’t really apply when you are doing massively parallel computing, on thousands of processors, with petabytes of memory. As I just said, the ‘thousands of processors’ part is something we can generally handle: we design programs that pass messages between processors, or structure our data into independent chunks that we can processes separately. This is all pretty intuitive.
However, memory is different. We have thousands of disks, each with about a terabyte of memory, but we still try to fit this into our head as basically ‘a disk attached to a computer’. At the Sanger, we have a number of complex filesystems, like NFS and Lustre, which are design to make the thousands of separate disks we have behave like one single, massive disk. These filesystems run on a load of special processors which monitor the disks, keep track of where everything is, and, most importantly, respond to individual computers as if they were one single harddisk sitting next to the computer; if I enter the directory ‘/lustre/scratch/’ I see one set of files, and if I ask to move a few of them, the filesystem shifts them around without me having to figure out which disk they are living on. People can access files from all across in the Sanger’s massive databanks without seeming to be doing anything complicated.
The memory banks are something more than just a load of commodity disks that you could buy from amazon stuck together: you couldn’t connect 5000 commodity disks to a single computer; you need the file servers to keep track of everything. And even if you could, the unreliability of commodity disks means that one would fail every few days, loosing all the data on it. Instead, we have expensive disks (about £500 a TB, I think) that are far more reliable, and they are connected together in RAID arrays; complex, redundant systems that have lots of spare memory, constantly move things around to make sure that if part of the system goes down, no memory is lost. This is all expensive, and becomes harder and harder to maintain as the system gets bigger and bigger; you need more reliable disks, more redundancy, and faster processors to stop data getting lost or taking ages to get hold of.
How We Should Think About Memory
But we need more memory. Within 5 years we will be regularly dealing with thousands of genomes; the rate at which we produce DNA sequence grows about ten-fold every year. The question that was bouncing around the genome campus was Where we go from here?
The systems elves at the Sanger have a policy that is usually characterised as ‘clean up after yourself’, but is actually pretty nuanced. The idea is the ‘scratch/archive’ approach; we have long-term archive space, which is robust but slow, and scratch space, which is fast to read and write, but not necessarily as stable. Whenever you want to process something, you move whatever you need from archive to scratch. You then run your programs there, which produce some files. Finally, you move just the files you want to store back to the archive, and the scratch data is all wiped. Nothing on scratch is used for storage, only for running things on. People often don’t keep to this policy, and this results in the ‘you need to clean up scratch103!’ e-mails that started this conversation in the first place.
A more extreme way of dealing is embodies by Google’s tactic, implemented in the Google File System (GFS) and MapReduce framework, and more recently in the open source system Hadoop, which we are currently testing out at the Sanger.
These systems don’t view memory as one big disk. They treat them as a network of connected drives and processors. Like RAID, they go in for a form of redundancy, but to a much greater extent, with every file stored in 2 or more different disks. When you write code for a Hadoop system, you write in explicitly operations that copy files between disks, and then run various processes in different places in the network. The system then makes sure you get the memory you need, and makes sure that your processes are run on processors near the files they need. This thus creates a cluster that treats disk space and files like processors; as a resource to be allocated and carefully managed. This is why files are kept in multiple locations; so if two people want to access them, they won’t treat on each others resources.
One of the big differences in this view is that files aren’t particularly protected; files are kept all over the place, and if one copy goes down, then there is another somewhere else. If all the copies go down, then somewhere there should be a process for regenerating it. Google uses cheap-as-chips commodity disks, and accepts the high failure rate as worth the price cut.
New Thinking
I’ll wrap this up now. I think that the real point of all this is that we have to change the way we think of disk space: not as a massive lump of memory, but as a parallel resource that is scattered all over the Institute; or all over the world. The real source of our memory problem is that people want to treat the Sanger’s memory resources, which is measured in petabytes, as if they were a desktop computer. Even if we think of it as scratch vs archive memory, or even as a fully parallel resource, we need to think of it as something fundamentally new.
Pingback: Thinking about Memory
I think I actually understood most of that! Over the last three weeks I’ve been doing some software and hardware programming, which has given me a much better understanding of what’s actually going on inside computing and how it links up (especially the hardware programming, we got to play with Arduino boards. I’m still not sure what my computer means when it tells me it’s running low on “virtual memory” though…
Virtual memory is a ‘way of combining RAM and disk space, basically letting a program use more RAM then is available by shunting stuff off to disk. If you are out of virtuasl memory, it means you need to reallocate some hard disk space to make the virtual memory bigger.
Technically, what you refer to should be called storage and not memory.
Also, after great thought, I have come to the conclusion that people need to think of storage like a nonrenewable resource. Even if you have a slow CPU, you will be able to perform an arbitrary number of operations given enough time. But if you use up a GB of HDD space, that GB is gone forever.