Google File System(GFS) is a Distributed File System(DFS) developed by Google Inc. to handle the enormous amount of data they gather. GFS uses commodity hardware, which is the most interesting, cost-effective, and reliable approach, as discussed in our blog on Map Reduce Framework. Google is currently not using this original GFS, although back then it was a perfect solution for tackling a large amount of data generated from their search engine.
Design Assumptions
Building a File System Architecture that addresses the questions of 'Big data processing', needs several radical as well as traditional design decisions. GFS also shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. Let us walk through some design assumptions that the Google Engineers had their attention on building this new architecture.
The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
The system stores a modest number of large files, expecting a few million files, each typically 100 MB or larger. Multi-GB files are the common case and should be managed efficiently. While small files must be supported, optimization for them is not a priority.
The workloads consist of primarily two kinds of reads; large sequential reads and small random reads. Large streaming reads account for several hundreds of KBs while small random reads are typically few KBs. Once written, files are seldom modified.
The system must efficiently handle concurrent append operations from multiple clients to the same file, with hundreds of producers appending data concurrently.
The system prioritizes high sustained bandwidth over low latency. Most target applications emphasize processing data in bulk at a high rate, with less emphasis on individual read or write response times.
The general idea is to divide the data into a predefined size of data chunks and store them across the cluster of commodity machines. This single line will stack our brains with loads of questions. What happens if these commodity machines fail? How do we divide the data into these chunks? What is a chunk size? So these questions were also struck by the design engineers of GFS and here's how they defined them.
Architecture
A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients. Note that these chunk servers are commodity-level Linux machines.
Each file is divided into fixed-size chunks. Each chunk is then assigned a 'chunk handle' which is a 64-bit piece of data carrying the information about the chunk. It is unchangeable and globally unique. This handle is created by the master at the time of chunk creation. These chunk servers store chunks of data in their local disk and whenever they need to read/write it look for the chunk handles and the necessary byte range for the operation.
What about fault tolerance? What if a chunk server fails, which is very common since they are commodity-level hardware? To address this, each chunk is replicated among several chunk servers and therefore if one fails we have two others. By default, the system creates 3 replicas and stores them in different chunk-servers.
The master contains all the metadata of the chunks. It has several responsibilities which can be brought down as follows.
Namespace Management: The master maintains the namespace, which is the hierarchy of directories and files within the file system. It keeps track of file names, directory structures, and their corresponding identifiers.
Access Control Information: Access control information defines permissions and privileges associated with files and directories, determining which users or processes can perform specific actions such as reading, writing, or executing files.
Mapping from Files to Chunks: The master needs to know where the chunks of a particular file are situated across the chunkservers. This mapping allows the file system to locate and access the individual chunks constituting a file, facilitating data retrieval and storage operations.
Garbage Collection of Orphaned Chunks: The master identifies and removes orphaned chunks, which are no longer associated with any file or are redundant copies. Garbage collection helps reclaim storage space and maintain system efficiency.
Chunk Migration Between Chunk Servers: The master coordinates the migration of chunks between chunk servers to balance the system's load, optimize data placement, and ensure data availability and reliability.
So the master contains all the file system's metadata. Other than that it does not involve any data transferring through itself. This is a super important gesture to notice. If the master were to handle data routing through it, it would create a bottleneck for sure, and the core concept of moving the processing to the data rather than moving the data to the processor would be terminated. Clients never read or write through the master. Keep it in mind.
How does a read operation work?
Remember that a file is divided into fixed-size data units called 'chunks'. When a read operation needs to be executed, the client translates the filename and byte range specified by the application into a chunk index. The chunk index refers to the position of a chunk within a file. The client calculates the chunk index based on the fixed chunk size and the byte offset provided. It determines which chunk within the file contains the requested data. This calculation involves dividing the byte offset by the chunk size and obtaining the integer part to determine the chunk index.
This chunk index and the file name itself are communicated with the master. Now the master replies to the client with the chunk handle and the location of the replicas of the requested chunks. Recall that a chunk handle is a 64-bit piece of data carrying information about a chunk.
So now the client knows where to look in the chunk servers for the requested chunks. It is noteworthy that the client caches this information provided by the master by keeping the file name and the chunk index as a key. The client then sends the request to one of the replicas of the chunks and retrieves the relevant information. Until the cached information is available(not expired) or the file is not opened somewhere else, the client can access the data without any interaction with the master again. In most cases, the client requests multiple such chunks and keeps client-master interactions to the minimum.
Chunks
Chunks being the bricks of the Google File System, we should dive a little more into it specifically. The GFS has decided the chunk size to be at 64MB which is fairly large than a typical file system data block. Several advantages come bundled with choosing a larger chunk size as follows.
Since most client read/write requests lie on the same chunk, it minimizes master-client interactions.
Reduces the size of Metadata stored within the master server. More chunk size means less number of chunks per file. Less number of chunks means less Metadata.
On the other hand, we can think of a disadvantage of choosing a larger chunk size as when a small file needs to be stored it will probably have only one chunk. If this particular file needs to be accessed by several clients it would create a hotspot for sure. GFS experienced these types of hotspots when it was first launched.
I hope this could be a wrap since this article extended too long than I expected it to be. The information is quite sufficient to get a broader view of the Google File System. After 2010 the GFS was replaced by Collosal a new approach to handle big data storage systems by Google. I hope to bring a detailed description of it in the future too. Till then, stay curious.