When it comes to really big data, a commodity for storage is not the best option for Big Data?

Commodity hardware cheap right? Well yes, but when it comes to petabytes of data, it becomes more expensive.

Let’s think how much servers do you need to run 1 PetaByte of data? It simple you need 3 PB of storage because if you have HDFS, native filesystem for a Big Data framework, it will create three copies of your data and will spread those pieces across the cluster randomly.

How much server nodes do you need? The biggest NL-SAS HDD 3.5” you can find nowadays is 12TB (actually 10.91 TiB), the biggest SAS HDD 2.5” is 2.4 TB (2.18 TiB) and the biggest SSD drive 2.5” out there is 32TiB (more like 30TiB), but not all servers supports that, and nearest supported drive is 1.6TB (1.46 TiB). So, the SSD with 2.5” has the most compact data footprint and most performant, but the most expensive one.

To get 1PiB of storage with HDFS, we will need 3PiB of raw capacity, which is (sorted from highest to lowest number of drives):

  • 3PiB/1.46TiB = 2104 drives with 1.6TB SSD
  • 3PiB/2.18TiB = 1409 drives with 2.4TB SAS
  • 3PiB/5.46TiB = 563 drives with 6TB NL-SAS
  • 3PiB/9.09TiB = 338 drives with 10TB SSD
  • 3PiB/10.91TiB = 281 drives with 12TB NL-SAS
  • 3PiB/30TiB = 102 drives with 32TB SSD (Not supposed with most servers yet)

How much 2.5” drives you can put to a rack server? About ten drives into 1U rack server or up to 24-26 drives into 2U. Moreover, when it comes to NL-SAS, you can put maximum 12 drives in a 2U rack server. Having 10-26 SSD drives per server is a good way to fully utilize performance potential of SSD drives.

  1. In this case, you’ll need either (1.6TB SSD) 2104/26=81 (2U) servers or 1409/26=54  (2U) servers with 2.4TB SAS for SFF drives and might have too many servers & more computing power than you actually need in your Big Data server farm. Moreover, when it comes to more than 20 nodes, usually, you need more than a couple of
  2. Alternatively, you might need (12TB NL-SAS) 281/12=23 (2U) or 563/12=47 servers with 6TB NL-SAS for LFF drives and that number might have not enough computing power then you need, or in contrary, be too much for you.

And let me remind you, that at the time this article written those disk drives are the best case scenario since usually, the biggest drives have not the best $/TB price. And therefore in real Big Data clusters, you normally will find drives with smaller space, thus the number of drives higher than we are using for this article and thus needs more servers.

There are high-density servers like:

  • Cisco S3260 which can contain up to 56 NL-SAS 3.5” LFF drives (Maximum NL-SAS disk capacity is 6TB, and 10TB SSD)
  • Alternatively, HPE disk enclosures which can be connected to a server with 96 LFF or 200 SFF drives

However, If you’ll put SSD in a server with 56 slots for SFF disk drives, theoretically you’ll need 3 servers (two needed, but the minimum is three) in case of 32TB SSD (only 1 needed but the minimum is 3), and that number might have not enough CPU & RAM to run your tasks but majority of servers still not support 32TB drives. While with 38 (2104/56) servers in case of 1.6TB SSD might be a good ratio to utilize the full potential of SSD drives but might be too much of computing power for your Big Data farm. Again with only 6 servers (338/56) with 10TB SSD drives you'll not be able to utilize the full potential of the drives themselves. And with all NL-SAS drives, there might be not enough CPU & RAM for your Big Data cluster if you’ll have only 3 (281/96) servers and you have extremely slow storage subsystem.

If you’ll put SSD in a server with 200 slots for SFF disk drives, theoretically you’ll need 3 servers (1 needed but minimum is 3) in case of 32TB SSD (only 1 needed but minimum is 3), and that number might have not enough CPU & RAM to run your tasks & fully utilize SSD performance bat again majority of servers still not support 32TB drives. While with 10 (2104/200) servers in case of 1.6TB SSD also not enough to utilize the full potential of SSD drives themselves. Needles to say situation goes even worse with SSD drive performance utilization in the case of only 3 servers (338/200= ~2, but 3 minimum) with 10TB SSD drives and 3 servers might be not enough for computing power. While with all NL-SAS drives there might be not enough CPU & RAM for your Big Data cluster if you’ll have only 3 (281/96) servers and you have extremely slow storage subsystem.

Do you see how storage medium and space determine your Big Data server farm?

The idea of computing separation from storage comes naturally to make Big Data more flexible.

Additional HDFS overheads

When you choose a strategy to reduce costs as much as you can, you might choose slow NL-SAS & high-density servers, and obviously, you’ll try to choose a server which can support a lot of CPU & Memory. In this case, when it comes to cluster expansion for storage or CPU or memory, you’ll have to buy another big server with a lot of CPU, Memory, and storage to keep your nodes in the HDFS farm more or less equal, whether you actually need that resources or not. In another word, high-density servers are increasing the granularity of your server farm expansion and forcing you to buy resources you might not need.

Also, 12TB or 6TB might seem like a good choice for TB/$, but they are also consuming way more electricity, and they are extremely slow compared to SSD, so NL-SAS not suitable for some workloads like Machine Learning & Deep Learning.

HDFS have in its architecture Checkpoint Node which copying hourly or even daily metadata out of NameNode Master RAM, which means in case of Master (and Backup Node is you have one) collapse for any reason you will lose all the data after last time metadata been backed up even though your data is there.

Probabilities

There is another most annoying thing coming from HDFS architecture. When you have 23 or 81 servers and your HDFS creating three copies of your data it throws them into the nodes in the cluster randomly. What does it mean that the cluster stores data randomly? Let’s calculate what is the possibility that you will find a single piece of information on a given server? In a best-case scenario the probability is 3/23 or in the worst case, it is 3/81.

Of course, your cluster will try to run your tasks on nodes that have (almost) all the required data as part of Data locality strategy, but what is the possibility that you’ll have all the data your task needs on a single server? The more data pieces you have for a given task, the less possible to have all the data on a given server making possibility even less than 3/23 (or 3/81). Ok, you might say that situation might be not so bad as I am drawing because you are running more than one task on more than one server thus increasing the possibility to have your data locally. However, the problem in this situation that files bigger than 64KB broken up into pieces (blocks) and stored separately across all the nodes in a cluster so there might not even be a single node which storing all the pieces of a file thus committing to the base probability of local data access. Moreover, also, if that server which has data needed for your task, currently running another task and fully loaded while other servers are not loaded but do not have required piece of information, that’s where you have inefficient cluster resources utilization. In another word HDFS architecture increasing the probability of network traffic node communications, the more nodes you have in a cluster and the bigger size of files you have.

The more nodes you have and bigger size of your files, the more probability of requesting data from other nodes increasing cross-switch network traffic.

Three copies

Not efficient because of:

  1. Network Congestion
  2. High levels of IO over server system bus
  3. Poor disk space utilization

Data replication causes additional memory consumption on servers and memory problems are a large part of support calls. Server degradation causes performance degradation with data rebalancing. Cluster performing rebalancing if one storage node low in free space.

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience.

NFS with Big Data

While with NAS storage like NetApp FAS/AFF systems you will have only about 35% space overhead compare to 200% overhead in HDFS (replication factor three) and be able to scale storage space and computing power separately reducing unneeded switches & server resources and allows customers to choose servers based on CPU & Memory characteristics eliminating storage from consideration.

Moreover, yes, 30TiB drives supported with AFF systems. With only 24x 32TB SSD in a NetApp AFF system you can get ~1PiB of effective space, in case of 2:1 data reduction, which gives extremely small physical footprint in the data center and power consumption.

A dedicated NAS storage aggregates all the drives into a single pool and capable of expanding flexible volumes on the fly without the need for cluster rebalancing.

NetApp In-Place Analytics Module is a plugin allowing to use NAS as primary or secondary storage for big data solutions.

Differentiators

NetApp ONTAP systems can replicate your data set for disaster recovery purposes & then replicate only new changed blocks of information as deltas to a secondary site which is essential for big data sets. NFS, unlike HDFS, allows modifying files if it's needed but, on another hand, if you need to make sure your golden image of data not been modified you can use thin clones (FlexClones) to make sure nothing happened to your original data.

The unique feature like FabricPool allows utilizing SSD drives as primary storage for frequently accessed data and transparently destage cold data on cheap cold S3 compatible storage (and back), to further reduce storage costs from one hand but still use SSD drives on another hand for hot, frequently accessed data. Data reduction capabilities like Deduplication can significantly reduce data footprint without losing performance even on the smallest systems.

When you have two characteristics (Computing & Storage) to choose in a single solution, it will always lead to inefficiency & compromise.

Summary

When it comes to really big data, HDFS simply killing the solution because of the replication factor tree and underlying architecture, storage must be separated from a Big Data cluster to make it more flexible and surprisingly even cheaper than a commodity, especially when it comes to petabytes of data.