Popular Certifications Big Data

Big Data & Data Engineer

Most popular courses for Cloud Computing

Cloud Computing

Most popular Certifications in SAS

SAS & Analytics

Most popular Certifications in Programming

Certifications in Programming

Popular Certification By Category

Learn Big Data

You can learn Anywhere, with this courses.

Browse All BigData Courses

Learn Cloud Computing

You can learn Anywhere, with this courses.

Browse All Cloud Computing Courses

Learn NoSQL

You can learn Anywhere, with this courses.

Browse All NoSQL Courses

Learn Programming

You can learn Anywhere, with this courses.

Browse All Programming Courses

Learn Data Analytics

You can learn Anywhere, with this courses.

Browse All Data Analytics Courses

Popular Snowflake Courses & Certifications

Snowflake

Popular Technology Mock Interviews

Mock Interviews

Popular NoSQL Courses

NoSQL

Popular Certifications Data Science & Analytics

Data Science & Analytics

Question-59: Please select the correct statements with regards to Cloudera HDFS Data Balancing.

  1. Hadoop can help mitigate this by rebalancing data across the cluster using the balancer tool.
  2. Running the balancer is a manual process that can be executed from within Cloudera Manager as well as from the command line.
  3. Running the balancer is an automated process within Cloudera Manager.
  4. By default, the maximum bandwidth a DataNode uses for rebalancing is set to 1 MB/second.
  5. It is not recommended running the balancer on an HBase cluster.

Answer:

Exp: HDFS spreads data evenly across the cluster to optimize read access, MapReduce performance, and node utilization. Over time it is possible that the data distribution in the cluster can become out of balance due to a variety of reasons. Hadoop can help mitigate this by rebalancing data across the cluster using the balancer tool. Running the balancer is a manual process that can be executed from within Cloudera Manager as well as from the command line. By default, Cloudera Manager configures the balancer to rebalance a datanode when its utilization is 10% more or less from the average utilization across the cluster. Individual datanode utilization can be viewed from within Cloudera Manager.

By default the maximum bandwidth a datanode uses for rebalancing is set to 1 MB/second (8 Mbit/second). This can be increased but network bandwidth used by rebalancing could potentially impact production cluster application performance. Changing the balancer bandwidth setting within Cloudera  Manager requires a restart of the HDFS service, however this setting can also be made instantly across all nodes without a configuration change by running the command:

hdfs dfsadmin -setBalancerBandwidth <bytes_per_second> 

This command must be run as an HDFS superuser. This is a convenient way to change the setting without restarting the cluster, but since it is a dynamic change, it does not persist if the cluster is restarted. The Recommended configurations for the Balancer page provides more insights into scenarios and tunables with suggested values.

Cloudera does not recommend running the balancer on an HBase cluster as it affects data locality for the RegionServers, which can reduce performance. Unfortunately, when HBase and YARN services are colocated and heavy usage is expected on both, there is not a good way to ensure the cluster is optimally balanced.

You can configure HDFS to distribute writes on each DataNode in a manner that balances out available storage among that DataNode's disk volumes. By default, a DataNode writes new block replicas to disk volumes solely on a round-robin basis. You can configure a volume-choosing policy that causes the DataNode to take into account how much space is available on each volume when deciding where to place a new replica.

Other Popular Courses