04 Aug '23

Calculating AWS DocumentDB Storage I/Os

Amazon DocumentDB is a fully managed native JSON document database that is mostly compatible with MongoDB. Why mostly? Because it it has a few functional differences from MongoDB and some MongoDB features are not supported. Despite from these limitation, customers benefit from a managed database service that has built-in security, backup integration, scalability and fault-tolerance. This rids customers from many operational burdens. AWS DMS (Database Migration Service) supports the migration from MongoDB to DocumentDB.

Apart from the functional evaluation, pricing should of course be taken into consideration before migrating to DocumentDB as well.

DocumentDB’s pricing is based on four factors:

Instance type and number of instances: This is similar to other services such as EC2 or RDS.
Database storage: $0,119 per GB-month (region eu-central-1)
Backup storage: $0,023 per GB-month (region eu-central-1)
I/Os on the cluster’s storage volume: $0,22 per 1 million requests (region eu-central-1)

While number 1-3 are fairly simple to estimate especially when migrating from an existing MongoDB, cost factor number 4 can be a tricky one. How to we get down to the number of I/Os that we will presumably consume, before actually consuming and paying for them? Let’s look into the details.

How DocumentDB calculates I/Os

I/Os are billed when pushing transaction logs to the storage layer. What is the storage layer? Looking at DocumentDB’s architecture, we can see that compute resources are detached from the storage resources. No matter how many instances you have in your DocumentDB cluster, all data will always we stored in six copies across three availability zones. This architecture is similar to Amazon Aurora databases and a big advantage compared to other database solutions. In the context of I/O calculation, this is important to know because it means that I/Os are not counted multiple times for multiple instances inside a cluster.

DocumentDB Architecture

The second important thing to note is that I/Os are calculated in 4KB chunks for write operations and 8KB chunks for read operations. That means, if we write 1KB, we will be charged 1 I/O, if we write 4 KB, we will also be charged 1 I/O, and if we write 5 KB, we will be charged 2 I/Os. The same principle applies to the read operations.

Hence, there is a couple of things that we need to measure or estimate in average or by data category:

Size: How large is the data written per operation?
Frequency: How often do we write it (e.g. per second)

The monthly I/O consumption for write operation can then be calculated as:

ceil(Size/4) x Frequency per Second x 2628000 (<- one month has this many seconds)

Size: How large is the data read per operation?
Frequency: How often do we read it (e.g. per second)

The monthly I/O consumption for read operation can then be calculated as:

ceil(Size/8) x Frequency per Second * 2628000

However, for read operations, there is another thing to factor in: The cache hit ratio. DocumentDB has a built-in cache. The cache utilisation depends on the number of repeated transactions, but also on the memory that is available on the instance, and hence on the instance type selection. To simplify, we will assume that the instance type is large enough to keep the entire database in memory. The third question to answer for read operations is then;

Cache Hit Ratio: What proportion of read operations are repetitions of previous read operations (e.g. 80%)

Consequently, we need to adjust above formula for the read operations:

ceil(Size/8) x Frequency per Second x 2628000 x (1 - Cache Hit Ratio)

Another particularity of write operations is that they can be batched together when they run concurrently, thus reducing the overall amount of I/Os consumed. We will ignore that fact for the I/O estimate as the potential savings of this are very dependent on the usage pattern of the DocumentDB. However, when executing lots of small write operations, this can have a significant savings effect.

Example

Let’s make an example with the following figures:

Write:

Size: 2 KB
Frequency: 3 per second

The number of monthly I/Os for write operations is:

1 x 3 x 2628000 = 7.884.000

Read:

Size 6KB
Frequency: 5 per second
Cache Hit Ratio: 75%

The number of monthly I/Os for read operations is:

2 x 5 x 2628000 x (100% - 75%) = 6.570.000

The sum is roughly 15 Mio. of monthly I/Os, multiplied by the price per Mio. of 0,22$, the result is 3,30$.

15 x 0,22 = 3,30$

Title Photo by Towfiqu Barbhuiya on Unsplash

Benjamin Wagner

Benjamin is a former tecRacer Employee

Similar Posts You Might Enjoy

04 Aug '23

Scaling Down EKS Clusters at night

Scaling down workloads at night or at the weekends is a common implementation task for companies building on AWS. By running only the applications that need to be available at any point in time, the total consumption of infrastructure resources can be reduced, and thus customers can benefit from the pay-by-use pricing models of cloud providers. - by Benjamin Wagner

23 Dec '23

Streamlined Kafka Schema Evolution in AWS using MSK and the Glue Schema Registry

In today’s data-driven world, effective data management is crucial for organizations aiming to make well-informed, data-driven decisions. As the importance of data continues to grow, so does the significance of robust data management practices. This includes the processes of ingesting, storing, organizing, and maintaining the data generated and collected by an organization. Within the realm of data management, schema evolution stands out as one of the most critical aspects. Businesses evolve over time, leading to changes in data and, consequently, changes in corresponding schemas. Even though a schema may be initially defined for your data, evolving business requirements inevitably demand schema modifications. Yet, modifying data structures is no straightforward task, especially when dealing with distributed systems and teams. It’s essential that downstream consumers of the data can seamlessly adapt to new schemas. Coordinating these changes becomes a critical challenge to minimize downtime and prevent production issues. Neglecting robust data management and schema evolution strategies can result in service disruptions, breaking data pipelines, and incurring significant future costs. In the context of Apache Kafka, schema evolution is managed through a schema registry. As producers share data with consumers via Kafka, the schema is stored in this registry. The Schema Registry enhances the reliability, flexibility, and scalability of systems and applications by providing a standardized approach to manage and validate schemas used by both producers and consumers. This blog post will walk you through the steps of utilizing Amazon MSK in combination with AWS Glue Schema Registry and Terraform to build a cross-account streaming pipeline for Kafka, complete with built-in schema evolution. This approach provides a comprehensive solution to address your dynamic and evolving data requirements. - by Hendrik Hagen

08 Dec '23

🇩🇪 Verbesserung der deutschen Suche im Amazon OpenSearch Service

Der Amazon OpenSearch Service, der auf dem robusten OpenSearch-Framework basiert, zeichnet sich durch seine bemerkenswerte Geschwindigkeit und Effizienz in Such- und Analysefunktionen aus. Trotz seiner Stärken sind die Standardkonfigurationen des Dienstes möglicherweise nicht vollständig darauf ausgelegt, die spezifischen sprachlichen Herausforderungen bestimmter Sprachen zu bewältigen. - by Alexey Vidanov

Share