A Brief Introduction to HDFS
Have you got lots of data you need to store and process in a fault tolerant manner? There's a good chance you're considering HDFS as one of the options. So just what is HDFS?
The Hadoop Distributed Filesystem has been designed with commodity hardware in mind and is highly fault tollerant despite it's design, to ensure users don't loose their data if a server were to fail. Also, HDFS provides users access to files regardless of which server the file is actually stored on, so users don't have to login to a specific server to gain access to that file.
What runs an HDFS cluster?
Clearly to manage an HDFS cluster you need servers, but just what runs on these servers to keep all your data in sync?
Namenodes are the bosses of the filesystem. They provide client access to the servers, manage operations on the boxes and ensure security policies are adhered to. They are commodity hardware boxes and usually you will have 1 per cluster along with a hot spare or failover of some kind.
Datanodes hold the data, they are the workers and do the heavy lifting. Also running on commodity hardware, you run as many as your data requires. HDFS usually replicates your data 3 times, but this is configurable. Block size should also be taken into account, by default 64MB, if you have lots of small files they will fill up the space quickly as that will be the minimum file size.
What does HDFS offer?
Fault detection - Should a machine fail the data will redistribute amongst the cluster to ensure data is continued to be stored in a fault tolerant fashion.
Scale - You can scale indefinitly it's no uncommon to have 100's if not 1000's of nodes in the cluster to help applications store their data effectively.
Effective Processing - HDFS ensures that the processing capability is as close to the data as possible. This ensures that data is processed in an efficient manner without having to move the data elsewhere prior to processing.