Modeling has_many Relationships with DynamoDB

Written by: Parth Modi

Recently our team had the requirement of storing data coming from sensors, and we chose DynamoDB for its ability to handle a massive amount of data efficiently. As a fully managed NoSQL database service with the promise of high-performance and scalability, DynamoDB was a perfect choice.

Coming from a strong SQL background and being a fan of using ORM such as Active Record, adapting to the NoSQL mindset gave me an opportunity to rethink and explore the capabilities of NoSQL. But soon I realized that normalization is not always the right choice in the case of NoSQL databases. And more important, handling relations is a bit different.

One To Many

Relationships in NoSQL databases are very different and more flexible than relational databases. You can store one to many relations in a single table or you can opt for a traditional approach and store in different tables.

Choosing to normalize or denormalize data, whether to reduce redundancy or to keep redundant data and handle database anomalies with application logic, it all depends upon the needs of your application. In the case of using NoSQL databases, it's better to know and understand which kind of queries are you going to need frequently, and constantly optimize structure and indexes to solve a problem.

Store one-to-many in a single table

In many web applications, there is a requirement that a user can have multiple emails. You can handle this with Active Record by creating users and emails tables and then adding has_many in a User model. A very simple and standard approach. But later on when I access a user, I often need his primary email as well. This means I either preload the primary email or join the relation.

Now, in DynamoDB, I wouldn't worry about maintaining two different tables, one for users and one for emails. The reason is that I mostly need emails with users, and having the email is of no use without the related user's information. Besides, a user will never add 100, 50, or even 10 emails to his account. On average, a user maintains two or three email IDs.

As a proof of concept, let's create a user model with name as a partition key. In a real-world scenario, you should use a proper, unique partition key and sort key whenever appropriate. You can read DynamoDB Best Practices to get a proper understanding of how to choose the right partition and hash keys. For convenience, the user model will have a primary key made of just the partition key and no hash key.

The user can have name, age and emails. The user model with migration would look like this:

# user.rb
class User
  include ActiveModel::Model, ActiveModel::Serialization
  TABLE_NAME = self.name.pluralize.downcase
  attr_accessor :name
  attr_accessor :age
  attr_accessor :emails
  def self.migrate_table
    $ddb.delete_table(table_name: TABLE_NAME) if $ddb.list_tables.table_names.include?(TABLE_NAME)
    $ddb.create_table({
      attribute_definitions: [
        {
          attribute_name: "name",
          attribute_type: "S",
        }
      ],
      key_schema: [
        {
          attribute_name: "name",
          key_type: "HASH",
        }
      ],
      provisioned_throughput: {
        read_capacity_units: 1,
        write_capacity_units: 1,
      },
      table_name: TABLE_NAME
    })
    $ddb.wait_until(:table_exists, {table_name: TABLE_NAME})
  end
end

In $ddb.create_table({...}), the attribute_definitions array consists of an attribute name and the type of attributes that are to be used as hash key, range key, and global and local secondary indexes. The key schema specifies the attribute role, such as whether the attribute is a hash key or range key.

Apart from this, you need to specify other information, such as provisioned throughput and local and global secondary indexes in migration. Refer to the DynamoDB documentation for a better understanding.

Including ActiveModel::Model gives the user class a few methods such as #new and #persisted?. You can further add methods such as #create, #save, and #update to make it easier to access items.

I prefer to add migration inside the model's class method #migrate_table. This method recreates the table if it already exists. Additionally, specifying all attributes that need to be persisted with attr_accessor makes it easier to interact with DynamoDB records. It also makes attributes available for further use in defining other methods such as #save, #update, #all, etc.

After creating a table, it's time to create and persist new records in DynamoDB.

# user.rb
class User
...
  def save
    item_hash = {
      name: self.name,
      age: self.age,
      emails: self.emails.to_a
    }
    $ddb.put_item({
      table_name: TABLE_NAME,
      return_consumed_capacity: "TOTAL",
      item: item_hash
    })
  end
...
end

The emails array is passed with the user item hash, which allows us to store emails with user records. Now in the Rails console:

> user = User.new(name: "jon", age: 15, emails: ["jon@gmail.com"])
> user.save

This will persist user with many emails.

I know, this doesn't look like a has_many association at all for the people coming from a SQL background. However, there are a few positives, such as how you don't need to maintain another table just for emails, and you don't need to fire a separate query for email.

Changing the underlying email object is also easy. For example, instead of just email ID, you now want to store email provider information, so you can just wrap it inside a hash and store it. No more PostgreSQL migration or table schema updating!

{
  name: 'jon',
  age: 15,
  emails: [{ id: 'jon@gmail.com', provider: 'gmail' }]
}

However, there are certain limitations with this approach.

  • In DynamoDB, a particular record (item) can store a maximum of 400 KB, attribute names included.

The maximum item size in DynamoDB is 400 KB, which includes both attribute name binary length (UTF-8 length) and attribute value lengths (again binary length). The attribute name counts toward the size limit.

  • Storing a large number of related items inside an item is not a maintainable approach. Various kinds of data anomalies tend to creep up, which can degrade database integrity.

  • You cannot query specific emails without using filters. Local Secondary Index and Global Secondary Index can be created on scalar values only.

  • Read and write cost will increase as related items are increased. You need to access a whole user item to perform CRUD operations on email. This is explained nicely in AWS DynamoDB Best Practices.

So, for storing emails, which are almost always needed whenever we need user, storing them inside the same table is a good choice. But never ever try to store articles created by user to the same table.

Store one-to-many in a different table

This is semantically a similar approach to the classic one-to-many in SQL databases.

A movie can have many songs, and each song belongs to a movie. So storing the primary key of the movies table as movie_id in the songs table will let us query songs related to a movie.

Schema for the movie table:

Movie table has composite Primary Key made of Partition Key and Hash Key, where name as partition key and year as hash key.

Schema for the song table:

Here, movie_id stores the movie's partition key and is Global Secondary Index. This is because movies need to be able to query songs through movie_id, and in DynamoDB, it's possible to query through either the partition key or indexes.

You can then create a migration for the Movie and Song models according to the schema structure explained above.

!Sign up for a free Codeship Account

Query the songs for a movie

Now, to model a movie has_many songs relationship, you need to find songs related to a movie by a foreign key. For that purpose, first create a class method Song#by_movie in the song model which fetches songs by movie.

# song.rb
class Song
  ...
  TABLE_NAME = self.name.pluralize.downcase
  def self.migrate
    ...
  end
  def self.by_movie(movie_id)
    q = {
      table_name: TABLE_NAME,
      index_name: 'song-movie-index',
      expression_attribute_names: {
        '#movie_id' => 'movie_id'
      },
      expression_attribute_values: {
        ':movie_id' => movie_id,
      },
      key_condition_expression: '#movie_id = :movie_id'
    }
    $ddb.query(q).items.map{ |item| Song.new(item) }
  end
  ...
end

Let's talk in detail about the query hash used in the above method for fetching songs.

  • index_name - As we are querying through index, we need to specify which index we are using in the Song#by_movie method.

  • expression_attribute_names - This hash specifies an alias for the partition key and hash key fields against which we are querying. Common convention is to start a query attribute name with # and any friendly name which you can use in a key condition expression. The value for each key value pair in the expression_attribute_names hash is some partition key, hash key, or index key name.

  • expression_attribute_values - This hash specifies an alias for values against which you need to query the table. The common convention is to prepend : before the value alias.

  • expression_attribute_names - This string specifies the query that needs to be run for a given table_name with expression_attribute_names and expression_attribute_values for the specified index, if present.

Finally, the data received querying is returned after wrapping it in the Song object array.

has_many in action

An instance method called .songs uses Song#by_movie to fetch songs related to a movie.

class Movie
  ...
  def songs
    Song.by_movie(self.name)
  end
  ...
end

Let's see it all in action:

  • create movie:

> movie = Movie.new(name: 'Bahubali', year: 2017, released: true, actor: 'Prabhas', actress: 'Anushka', song_ids: [])
> movie.save
  • create songs:

130.times do |i|
  song = Song.new(name: "song#{i}", movie_id: movie.name)
  song.save
  movie.song_ids << song.name
  movie.save
end
  • Fetch songs of a movie:

> movie = Movie.find('Bahubali')
> songs = movie.songs
> [{"name"=>"song71", "movie_id"=>"Bahubali"},
 {"name"=>"song65", "movie_id"=>"Bahubali"},
 {"name"=>"song39", "movie_id"=>"Bahubali"},
 ...
 ]

This approach eliminates a few limitations from the first approach:

  • The item size is small, and related items are spread across multiple tables. Additionally, there is no restriction on how many related items you can store. So a movie can theoretically have an infinite number of related songs.

  • You can use indexes on a related table; this allows you to query songs efficiently without applying filters.

  • Read and write operations for specific items only consumes Read/Write capacity for a particular table.

Conclusion

While NoSQL databases such as DynamoDB provide flexibility and the ability to store schemaless data, it's still important to choose a database schema along with a proper selection of partition and hash keys with required Global and Secondary Indexes.

Having a proper understanding of user access patterns enables you to predict frequent queries and data requirements. This ultimately helps you design proper database schemas with the right amount of database normalization/denormalization, based on how you model relationships between tables.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.