Apache Ozone

HDFS Problems

You can run clusters with 1000's of nodes and these clusters can store 100 Petabytes of data and service thousands of the concurrent clients. HDFS works best when most of the files are large or like more than 100MBs. HDFS suffers from the famous "small files limitation" and struggles with over 400 Million files. And there was always a demand for an HDFS like storage system that can scale to billions of small files. Ozone is a distributed key-value store that can manage both small and large files alike. While HDFS provides POSIX-like semantics, Ozone looks and behaves like an Object Store.

Apache Ozone is an object store designed for Big Data applications. Object stores are more straightforward to build and use than standard file systems. Scaling of object store is easier. Even Cloud vendor itself provide their own object store like AWS S3 by Amazon, Google Storage by GCP etc. If you use your own Object store than it is cheaper than what Cloud vendor's are providing.

Similar to HDFS when you want to access a file from HDFS you use url starting with hdfs:// and if you are accessing object stored in Ozone then your URL would be like O3fs:// and your Spark based application can start consuming data from Ozone object store.

Ozone is a Scalable, redundant and distributed object store for Hadoop. You can store billions of objects with different sizes, Ozone can function effectively in containerized environments like Kubernetes.

Apache Ozone and Big Data Applications

Application like Apache Spark, Hive and YARN work without any modifications when using Ozone
Ozone comes with Java Client Library, S3 Protocol Support and Command Line Interface which helps you work with Ozone easily.

CDP Private Cloud and Ozone

Apache Ozone is available with Cloudera Data Platform Private Cloud.
CDP Private Cloud uses Ozone to separate storage from compute, which enables it to handle billions of objects on-premises, which is similar to Public cloud deployments which benefit from the likes of S3.
Ozone is fully compatible with S3 API, so that it can be future proof solution and enabling CDP Hybrid Cloud.
Apache Ozone supports interoperability of the same data for various use cases. For example, a user can ingest data into Apache Ozone using FileSystem API, and the same data can be accessed via Ozone S3 API*. This would potentially improve the efficiency of the user platform with on-prem ObjectStore.

Ozone and Objects

Ozone consists of volumes, buckets and Keys.
Volumes: Volumes are similar to user accounts. Only administrators can create or delete volumes. An administrator will typically create a volume for an Organization or team.
Buckets: Buckets are similar to directories. A bucket can contain any number of keys, but buckets cannot contain other buckets. A volume can contain zero or more buckets. Ozone buckets are similar to AWS S3 bucktes.
Keys: Keys are similar to files. Ozone stores data as keys inside these buckets. Keys are unique within a given bucket and are similar to S3 Objects. Key names can be any string. Values represent the data that you store inside these keys, currently Ozone enforces no upper limit on the key size.

Ozone Introduction

Apache Ozone is a scalable, redundant and distributed object store and very well optimized for Big Data Workloads.
Apache Hive work natively on Ozone without any modifications.
Ozone is available in CDP Private Cloud Base deployment.

Keys and Data Nodes

When you write a key (or file/object), Ozone stores the associated data on Data Nodes in chunks called blocks.
Each key is associated with one or more blocks.
Within a Data Node, multiple unrelated blocks can reside in a storage container.

Ozone Components

Below diagram shows the various components of Apache Ozone

Data Nodes:
- Note that Ozone DataNodes are not the same as HDFS DataNodes.
- Ozone DataNodes store the actual user data in the Ozone cluster.
- DataNodes store the blocks of data that clients write.
- Storage Container: Collection of these blocks is a storage container. This is built using an off-the-shelf key-value store RocksDB. This is a lowest layer, Ozone stores user data in Storage container. A container is a collection of key-value pairs of block name and its data. Keys are block names which are locally unique within the container. Values are the block data and varu from 0 bytes to 256MB. However, Block names need not to be globally uni
- Each container support following operations
- Containers are stored on disk using RocksDB, and some optimization for large values.
- Chunk File: The chunk files constitute what gets actually written to disk.
- The Client streams data in the form of fixed-size chunk files (4MB) for a block.
- HeartBeat: The DataNode sends HeartBeats to SCM at fixed time interval. With every heartbeat, the DataNode sends container reports.
- RocksDB Instance: Every storage container in the DataNode has its own RocksDB instance which stores the metadata for the blocks and individual chunk files.
- Pipeline: A pipeline of DataNodes is actually a Ratis Quorum with an elected leader and follower DataNodes accepting writes.
- Every DataNode can be a part of one or more active Pipelines.
- Reads: Reads from the client go directly to the DataNode and not over Ratis.
- SSD: Cloudera recommends an SSD on a DataNode to persist the Ratis logs for the active pipelines and significantly boosts the write throughput.
Ozone Manager:
- This is a namespace manager that implements the Ozone key-value namespace.
- The Ozone Manager (OM) is a highly available namespace manager for Ozone.
- OM manages the metadata for volumes, buckets, and keys.
- OM maintains the mappings between keys and their corresponding Block IDs.
- When a client application requests for keys to perform read and write operations, OM interacts with SCM for information about blocks relevant to the read and write operations, and provides this information to the client.
- OM uses Apache Ratis (Raft protocol) to replicate Ozone manager state.
- While RocksDB (embedded storage engine) persists the metadata or keyspace, the flush of the Ratis transaction to the local disk ensures data durability in Ozone. Therefore, Cloudera recommends a low-latency local storage device like NVMe SSD on each OM node for maximum throughput.
- A typical Ozone deployment has three OM nodes for redundancy.
Storage Container Manager: This is a service which manages the lifecycle of replicated containers.
- The Storage Container Manager (SCM) is a master service in Ozone.
- A storage container is the replication unit in Ozone.
- Unlike HDFS, which manages block-level replication, Ozone manages the replication of a collection of storage blocks called storage container.
- The default size of a container is a 5GB.
- SCM manages DataNode pipelines and placement of containers on the pipelines.
- A pipeline is a collection of DataNodes based on the replication factor. For example, given the default replication factor of three, each pipeline contains three DataNodes.
- SCM is responsible for creating and managing active write pipelines of DataNodes on which the block allocation happens.
- The client directly writes blocks to open containers on the DataNode, and the SCM is not directly on the data path.
- A container is immutable after it is closed.
- SCM uses RocksDB to persist the pipeline metadata and the container metadata.
- The size of this metadata is much smaller compared to the keyspace managed by OM.
- SCM is a highly available component which makes use of Apache Ratis.
- A typical Ozone deployment has three SCM nodes for redundancy.
- SCM service instances can be collocated with OM instances; therefore, you can use the same master nodes for both SCM and OM.
Recon Server:
- Recon is a centralized monitoring and management service within an Ozone cluster that provides information about the metadata maintained by different Ozone components such as OM and SCM.
- Recon takes a snapshot of the OM and SCM metadata while also receiving heartbeats from the Ozone DataNode.
- Recon asynchronously builds an offline copy of the full state of the cluster in an incremental manner depending on how busy the cluster is.
- Recon usually trails the OM by a few transactions in terms of updating its snapshot of the OM metadata.
- Cloudera recommends using an SSD for maintaining the snapshot information because an SSD would help Recon in staying updated with the OM.
Ozone File System Interfaces: Ozone is a multi-protocol storage system with support for the following interfaces:
- ofs: Hadoop-compatible file system allows any application that expects an HDFS like interface to work against Ozone with no changes. Frameworks like Apache Spark, YARN, and Hive work against Ozone without the need for any change.
- s3: Amazon’s Simple Storage Service (S3) protocol. You can use S3 clients and S3 SDK-based applications without any modifications to Ozone.
- o3fs: A bucket-rooted Hadoop Compatible file system interface.
- o3: An object store interface that can be used from the Ozone shell.
S3 Gateway:
- S3 gateway is a stateless component that provides REST access to Ozone over HTTP and supports the AWS-compatible s3 API.
- S3 gateway supports multipart uploads and encryption zones.
- In addition, S3 gateway transforms the s3 API calls over HTTP to rpc calls to other Ozone components.
- To scale your S3 access, Cloudera recommends deploying multiple gateways behind a load balancer like haproxy that support Direct Server Return (DSR) so that the load balancer is not on the data path.
HDDS (Hadoop Distributed Data Store): This is a generic distributed storage layer for blocks that does not provide a namespace.
RAFT Consensus protocol: Apache Ratis is an open source Java Implementation of RAFT optimized for high throughput.

Design considerations

Strongly Consistent: Ozone is designed for Strong Consistency which helps in simplifying application design.
Simple Architecture: Ozone has simple architecture as we discussed above.
Layered Architecture: Ozone has a layered architecture or file system. It separates the namespace management from block and node management layer, which allows users to independently scale on both the axes.
Painless Recovery: Ozone has a robust design, where it can recover from catastrophic events like cluster-wide power loss without losing data and without expensive recovery steps.
Works well with Hadoop Ecosystem: Ozone is usable by existing Hadoop ecosystem and related applications
- Hadoop Compatible FileSystem API: OzoneFS is a Hadoop Compatible FileSystem which allows Hive, Spark to use Ozone as a storage layer with a zero modifications.
- Data Locality: Data locality was key to the original HDFS/MapReduce architecture by allowing compute tasks to be scheduled on the same nodes as the data. Ozone will also support data locality for applications that choose to use it.
- Side-by-side deployment with HDFS. Ozone can be installed in an existing Hadoop cluster and can share storage disks with HDFS.

HDFS and Scalability Issues

Namespace Scalability: We can no longer store the entire namespace in the memory of a single node. The key insight is that the namespace has locality of reference so we can store just the working set in memory. The namespace is managed by a service called the Ozone Manager.
Block Map Scalability: This is a harder problem to solve. Unlike the namespace, the block map does not have locality of reference since storage nodes (DataNodes) periodically send block reports about each block in the system. Ozone delegates this problem to a shared generic storage layer called Hadoop Distributed DataStore (HDDS).

What is Ozone?

Ozone is an object store designed for big data applications. Big data workloads tend to be very different from standard workloads and Ozone is born out of lessons learned from running Hadoop in thousands of clusters.

Why an Object Store?

Object stores are more straightforward to build and use than standard file systems. It is also easier to scale an object store. The fact that most big data applications and frameworks like Apache Spark, YARN and Hive can run both on cloud and on-prem, makes having an object store on-prem very attractive.

Can I use Ozone in the cloud?

While the technical answer is Yes, there is always a question of why you are not happy with the file system provided by the cloud vendor. It is often cheaper to run against the cloud vendor’s base object store. You should prefer S3, Azure and Google storage for Object store in public cloud.

Is there a cloud where I can use Ozone as the object store?

Not that we know. Ozone is an Apache-licensed open source product; Nothing prevents someone from offering Ozone as a product in the cloud.

I have an Apache Spark Application. How do I use it with Ozone?

You have a Spark based application and you want it to work with Ozone. If your current storage system is HDFS, then you are passing the location of data to your application by using an URL that begins with hdfs://. If you replace hdfs:// with o3fs:// Spark application will start using data from an Ozone bucket.

I have an application that is reading and writing to S3 buckets. How do I use it with Ozone?

Ozone supports S3 protocol as a first-class interface. So you can take an existing S3 based application and change the S3 server address. That is it.

You said I could change the S3 server address, but I need to put the access key id and secret access key into my .aws file; where do I find it?

If you have a secure Ozone cluster, please login by running “kinit” and please ask for ozone aws credentials by running “ozone s3 getsecret”. This command will present your Kerberos credentials to Ozone manager, and Ozone manager will return both access key id and secret access key that you can put into your .aws file. If the security is not turned on, you can put anything in these fields as security will not be enforced on the server side.

You said Kerberos, but that is not how S3 security works. So?

Thank you for pointing that out. Your S3 credentials are created using your Kerberos identity. From the administrator’s point of view, this allows us to have a centralized identity and policy management for Ozone. Ozone supports the AWS Signature Version 4 protocol, so from the user’s point of view it just works.

Okay, I am convinced I would like to try out Ozone. How do I get started?

Please take a look at the getting started document. For the impatient, we have docker containers, which allows you to start a fully functional cluster on your machine to explore and see if Ozone works for you.

There are instructions on how to start Ozone on conventional machines or VMs if you would like to go that way.

You can download it and try it out.

You said K8s in the first page, How do I deploy Ozone on K8s?

Please look at the getting started with K8s page. Ozone fully supports deployments to a K8s cluster.

I have an existing Apache Hadoop cluster, and I run things like HDFS and YARN. How do I use Ozone?

Ozone is designed to run concurrently with HDFS and YARN in a normal Hadoop Cluster. So you can upgrade to the latest Hadoop, untar Ozone bits in the same cluster and have Ozone working concurrently. Please see the getting started with HDFS to get more information.

How does Ozone scale and how it is different from HDFS?

Ozone has done some things differently from HDFS. First and foremost, Ozone took an ancient idea that was floating around in the HDFS community and created separate services to do different operations. So Ozone has a namespace server, block management server and a management console. Pretty much all of these are ideas borrowed from HDFS.

Does it mean a new Hadoop server component?

Both Ozone and HDDS requires one additional master component - Ozone Manager and Storage Container Manager, respectively. The worker parts of Ozone/HDDS can be started as an HDFS Datanode plugin or as standalone.

This sounds very interesting. How can I contribute? I have never contributed to Open source/I am a novice/I don’t know how to contribute/I am intimidated, what can I do?

As Lao Tzu said, “The journey of a thousand miles begins with one step.” You might have never worked on Distributed systems, but you will always be able to help out on Ozone. If you speak a foreign language like Chinese, you can help us out by translating documentation to Chinese. If you are a high school student and starting to learn Java, you are most welcome to come and pick up work items that fit your skill and schedule. If you know HTML, please feel free to work our websites and Ozone UI. If you are graphics artist and feel that Ozone can be improved, please let us know.

Ask us, please don’t feel shy. We are an open and welcoming community. Feel free to open a JIRA under HDDS if you know how. The easiest thing is to send us an email in any discussion channel. Someone will always be there to help you out.

You are also welcome to join our weekly community call, and just talk to us. There are several ways to start contributing to Ozone, please see the guideline for details.