Popular Certifications Big Data

Big Data & Data Engineer

Most popular courses for Cloud Computing

Cloud Computing

Most popular Certifications in SAS

SAS & Analytics

Most popular Certifications in Programming

Certifications in Programming

Popular Certification By Category

Learn Big Data

You can learn Anywhere, with this courses.

Browse All BigData Courses

Learn Cloud Computing

You can learn Anywhere, with this courses.

Browse All Cloud Computing Courses

Learn NoSQL

You can learn Anywhere, with this courses.

Browse All NoSQL Courses

Learn Programming

You can learn Anywhere, with this courses.

Browse All Programming Courses

Learn Data Analytics

You can learn Anywhere, with this courses.

Browse All Data Analytics Courses

Popular Snowflake Courses & Certifications

Snowflake

Popular Technology Mock Interviews

Mock Interviews

Popular NoSQL Courses

NoSQL

Popular Certifications Data Science & Analytics

Data Science & Analytics

Question-57: Which of the following tools comes with the Hadoop to benchmark and baseline the Hadoop overall performance.

  1. ZooKeeper
  2. BenchmarkHadoop
  3. Teragen
  4. Terasort

Answer:

Exp: The teragen and terasort benchmarking tools are part of the standard Apache Hadoop distribution and are included with the Cloudera distribution. In the course of a cluster installation or certification, Cloudera recommends running several teragen and terasort jobs to obtain a performance baseline for the cluster. The intention is not to demonstrate the maximum performance possible for the hardware or to compare with externally published results, as tuning the cluster for this may be at odds with actual customer operational workloads. Rather the intention is to run a real workload through YARN to functionally test the cluster as well as obtain baseline numbers that can be used for future comparison, such as in evaluating the performance overhead of enabling encryption features or in evaluating whether operational workloads are performance limited by the I/O hardware. Running the benchmarks provides an indication of cluster performance and may also identify and help diagnose hardware or software configuration problems by isolating hardware components, such as disks and network, and subjecting them to a higher than normal load.

The teragen job generates an arbitrary amount of data, formatted as 100-byte records of random data, and stores the result in HDFS. Each record has a random key and value. The terasort job sorts the data generated by teragen and writes the output to HDFS. During the first iteration of the teragen job, the goal is to obtain a performance baseline on the disk I/O subsystem. The HDFS replication factor should be overridden from the default value 3 and set to 1 so that the data generated by teragen is not replicated to additional data nodes. Replicating the data over the network would obscure the raw disk performance with potential network bandwidth constraints. Once the first teragen job has been run, a second iteration should be run with the HDFS replication factor set to the default value. This applies a high load on the network, and deltas between the first run and second run can provide an indication of network bottlenecks in the cluster. While the teragen application can generate any amount of data, 1 TB is standard. For larger clusters, it  may be useful to also run 10 TB or even 100 TB, as the time to write 1 TB may be negligible compared to the startup overhead of the YARN job. Another teragen job should be run to generate a dataset that is 3 times the RAM size of the entire cluster. This ensures you are not seeing page cache effects and are exercising the disk I/O subsystem. 

The number of mappers for the teragen and terasort jobs should be set to the maximum number of disks in the cluster. This is less than the total number of YARN vcores available, so it is advisable to temporarily lower the vcores available per YARN worker node to the number of disk spindles to ensure an even distribution of the workload. An additional vcore is needed for the YARN ApplicationMaster. The terasort job should also be run with the HDFS replication factor set to 1 as well as with the default replication factor. The terasort job hardcodes a DFS replication factor of 1, but it can be overridden or set explicitly by specifying the mapreduce.terasort.output.replication parameter.

Other Popular Courses