Issue 08

Using ETS to speed up database imports by almost 300%

A guide for Elixir developers (and why you might want to)

Aug 8, 2022 · 7 minute read

What is ETS?

ETS stands for Erlang Term Storage - it's an Erlang module that lets us store any Erlang term to an in-memory table.

You can perform basic tasks like creating a table, adding and checking for records by key look-up and even create the tables in a way that they're shared across your application instances - ETS is cool.

Think of it like Redis only it comes built into the Elixir & Erlang language as part of OTP.

Ahh, Elixir 💖

In this post we're gonna talk about why you might want to reach for ETS to speed up some tasks in your applications or processes.

Premature caching warning

⚠️ Never pre-maturely optimise your application - carefully analyse your application and figure out the bottlenecks and fix them after the system is working. That's not to say you should intentionally write your code in ways that are slow and cause long-running tasks, but don't do it too early - you never know how your system might change during implementation and ruin any pre-mature optimisations you may have done.

What are we backfilling?

As discussed in a previous post, at work, I've been working on backfilling ~1m database rows from a legacy system into our Elixir platform. We're taking over a feature and we need to bring in all the existing user data - each of these records have relations to two other tables in our database and we need to transform the data as we go. The transformations are quite simple, converting epoch time into timestamps and so on.

To stop our system from grinding to a halt in production, we're consuming our data (from a CSV file exported via psql dump) by iterating over it and firing a RabbitMQ message to our topic, which our system picks up and processes.

For each entry, we need to fetch and make a few associations to other records describing the account that the user is linked to and a link to a category stored in another table. The category is simple enough, we have an category_id in our entry which acts as a foreign key to the category table - easy.

DB fetch count: 1

The other association is a little bit trickier: our entry provides us with a legacy_user_id (from the old system), we need to fetch the user associated to the legacy id and grab the account record that the user correlates to - so two more DB fetches.

DB fetch count: 3

With 1m entries to process that's 3 million database reads 😬. It's also worth noting that each user has around 20 entries associated to them.

The un-optimised way

The original code to fetch the data looked like this:

  defp category_by_legacy_id(legacy_category_id) do
    Repo.get_by(LogCategory, legacy_id: legacy_category_id)
  end

  defp account_by_user_id(user_id) do
    case Users.by_user_id(user_id) do
      {:ok, %{uuid: user_uuid}} -> Repo.get_by(Account, user_id: user_uuid)
      _ -> nil
    end
  end

We've got two functions, one that fetches the category by the category_id, and another that uses our Users context to look up the user (from a different database we don't have direct access to - hence why we can't just do a Repo.get), then fetches the account associated. The two records are then used to enrich the entry log changeset using put_assoc.

This wasn't too bad - it worked, but a few benchmarks showed that importing the data took around 23 minutes - a long-running database import with 1m user records is scary, things can go wrong, the faster they get sorted, the faster the entire thing is over. This import is also going to happen in the dead of night to ensure minimal user impact in case something bad does occur and I don't want anyone to be awake longer than they have to at 4 am. It's also a matter of pride, this code can definitely be improved upon.

Let's take a look at how ETS comes into play.

Optimisation Part I

Where to begin?

Well, considering these categories that we're associating were just a few simple Elixir structs with a name and an ID, why should we hit the database every time to fetch them? Storing them in memory is perfect for what we want and ETS has an option we want to take advantage of: read_concurrency - a way to tell our tables to optimise for concurrent read operations, read more here.

Let's start by creating our in-memory tables and priming that cache with the information it needs to be useful.

  defp fetch_and_cache_categories_in_ets do
    :ets.new(:categories, [:named_table, read_concurrency: true])

    # cache categories in ETS
    Repo.all(LogCategory)
    |> Enum.each(fn category ->
      :ets.insert(:categories, {category.legacy_id, category})
    end)
  end

Not bad, we've created a new ETS table, named it :categories, told it to optimise for concurrent read operations and then we've stuck our 10 or so categories in there right when the script for importing starts up.

We need to edit our code for actually fetching the data too now:

  defp category_by_legacy_id(legacy_category_id) do
    [{_, category}] = :ets.lookup(:otj_categories, legacy_category_id)
    category
  end

A simple switch up - we know the categories will always be there because those are the only possible values so we don't need a fallback to the database.

What was the result?

20 minutes - we shaved off 3 minutes, not too bad for such a small fix right? We've just eliminated 1 million database fetches from our import with a few minor tweaks.

But we can do better.

Optimisation Part II

The user lookup and account fetch are clearly what's slowing us down here, each call depends on the last database fetch to success and we're doing it across databases. That doesn't mean we can't use caching to make our lives easier though.

Let's start by making a new ETS table:

  defp create_account_ets_cache do
    :ets.new(:otj_accounts, [:named_table, read_concurrency: true])
  end

We can't prime our cache before anything has started this time sadly, but we can check our ETS cache for a record and fallback to a database fetch if we don't get a hit - when we fetch from the database we can just pop it into our cache for the next time we come across the same legacy_user_id.

  defp account_by_user_id(user_id, :db) do
    case Users.by_user_id(user_id) do
      {:ok, %{uuid: user_uuid}} ->
        account = Repo.get_by(Account, user_id: user_uuid)
        :ets.insert(:otj_accounts, {user_id, account})
        account

      _ ->
        nil
    end
  end

  defp account_by_user_id(user_id, :ets) do
    account = :ets.lookup(:otj_accounts, user_id)

    case account do
      [] ->
        account_by_user_id(user_id, :db)

      [{_, account}] ->
        account
    end
  end

In our import code, we use the :ets version of our function, which will call the :db version if it needs to.

So what has all of this done to our import?

A slight tune-up - 278% faster

The new version of the code finishes importing all 1,200,000 records in 8 minutes 20 seconds - the average RabbitMQ message acknowledgements went from 870 ack/s to 3200 ack/s.

We just increased the rate of our import by 278% with a git diff that comprises of about 35 lines of code.

Ahh, Elixir 💖

I hope you found this post useful, subscribe to my Substack below for similar content and follow me on Twitter for more Elixir (and general programming) tips.

If you're building a side project I'd also encourage you to take a look at madepublic.io!

Enjoyed this content?

Want to learn and master LiveView?

Check out the book I'm writing

The Phoenix LiveView Cookbook

fin

Share to Twitter Share to Hacker News Share to LinkedIn