This is Part 2 in a series about distributed file systems.
Read Part 1 to learn more about the differences between distributed file systems and object storage.
Before we explore the different flavors of distributed file systems, here’s a quick recap of my last post just to make sure we’re all on the same page.
Industry research shows that the enterprise storage market is undergoing a major shift towards distributed file systems and object storage as enterprises look for efficient ways to cope with the explosion of unstructured data. Distributed file systems and object storage enable enterprises to scale linearly (scale-out) in a cost-effective manner to address their performance and capacity needs.
There are three fundamental differences between distributed file systems and object storage:
- Arrangement – Files are arranged in a hierarchy of folders, while object storage arranges objects in flat buckets.
- Update semantics – File systems allow for random writes anywhere in the file, while object storage only allows atomic replacement of entire objects.
- Consistency model – Object storage supports eventual consistency, while distributed file systems can support strong or eventual consistency (per vendor).
The CAP Theorem and Distributed File Systems
Not all distributed file systems are created equal – and the reason for this is firmly rooted in computer science theory. The CAP Theorem states that a distributed data store can have no more than two out of the following three properties:
- Consistency: Every read receives the most recent write or an error
- Availability: Every request receives a (non-error) response – without the guarantee that it contains the most recent write
- Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
As such, it follows that there are two flavors of distributed file systems on the market today:
Clustered Distributed File System
Consisting of a strongly coupled cluster of nodes, Clustered Distributed Filesystems (DFS) are geared towards strict data consistency and are especially suitable for high scale computing use cases (e.g., big data analytics) at the enterprise core.
Clustered DFS focuses on the Consistency and Availability properties of the CAP theorem. Strong consistency guarantees do not come without a price – they create fundamental limitations on system operation and performance, particularly when the nodes are separated by high latency or unreliable links. Examples of Clustered DFS include products like Dell EMC Isilon and IBM Spectrum Scale.
Federated Distributed File System
Federated Distributed Filesystems are focused on making data available over long distances with partition tolerance. As such, Federated DFS is well-suited for weakly coupled edge-to-cloud use cases such as unstructured data storage and management for remote offices. Federated DFS focuses on the Availability and Partition tolerance properties of the CAP theorem and trades away the strict consistency guarantee.
In a Federated DFS, read and write operations on an open file are directed to a locally cached copy. When a modified file is closed, the changed portions are copied back from the edge to a central file service. In this process, update conflicts may occur and should be automatically resolved. It could be argued that Federated DFS combines the semantics of a filesystem with the eventual-consistency model of object storage.
The following comparison table sums it all up:
|Clustered DFS||Federated DFS|
|Strongly consistent||Partition tolerant, eventually consistent|
|Deployed in the core||Deployed at edge and core|
|Strongly coupled nodes||Weakly coupled edge nodes|
|Ideal for high performance computing (HPC); databases; analytics||Ideal for archiving; backup; media libraries; mobile data access; content distribution to edge locations; content ingestion from edge to cloud; ROBO storage; hybrid cloud storage|
Clustered DFS and Federated DFS both have their places in the enterprise. To maximize benefits from a distributed file system, enterprises need to understand the differences between the two flavors and choose the option that best meets their application needs.