Question-49: Select correct statements with regards to Cloudera Private Cloud Base and Disk setup?

  1. Cloudera does not support more than 200 TB per data node.
  2. Cloudera does not support drives larger than 8 TB.
  3. Running CDP DC on storage platforms other than direct-attached physical disks can provide suboptimal performance.
  4. Cloudera Runtime and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O.

Answer:

Exp: Hard drives today come in many sizes. Popular drive sizes are 1-4 TB, although larger drives are

becoming more common. When picking a drive size the following points need to be considered.

  • Lower Cost Per TB – The larger the drive, the cheaper the cost per TB, which makes for lower TCO.
  • Replication Storms – Larger drives means drive failures will produce larger re-replication storms, which can take longer and saturate the network while impacting in-flight workloads.
  • Cluster Performance – In general, drive size has little impact on cluster performance. The exception is when drives have different read/write speeds and a use case that leverages this gain. MapReduce is designed for long sequential reads and writes, so latency timings are generally not as important. HBase can potentially benefit from faster drives, but that is

dependent on a variety of factors, such as HBase access patterns and schema design; this also implies acquisition of more nodes. Impala and Cloudera Search workloads can also potentially benefit from faster drives, but for those applications the ideal architecture is to maintain as much data in memory as possible.

Cloudera does not support more than 100 TB per data node. You could use 12 x 8 TB spindles or 24 x 4 TB spindles. Cloudera does not support drives larger than 8 TB.

Running CDP DC on storage platforms other than direct-attached physical disks can provide suboptimal performance. Cloudera Runtime and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O. 


Other Popular Courses