Fight flakiness and speed up your Cypress tests. Rails 6.1 horizontal sharding & data cleaning mechanism (Part 2)

In Part 1 we covered how a rails application can become multi-tenant with a separate database per tenant for the purpose of running cypress in parallel groups in complete isolation of each other.

Data cleaning mechanism

If we zoom in on a single parallel test group we need to ensure that each of its tests starts from a certain database state(usually one represented by the initial data coming from seeds or fixtures), execute their scenario and at the beginning of the next test the initial database state is restored, guaranteeing that data created by one test won’t interefere with the execution of another.

We need some form of equivalent of Postgre’s pg_restore encapsulated in Rails dsl.

Before each parallel group is run we will:

Seed the corresponding cypress database with initial data.
Dump the database in the form of INSERT statements (so that the dump contains the initial data, in .sql file)

Then before each test scenario is run we will:

Restore the cypress database using the .sql dump

Restoring the database includes emptying it first. One way of doing that is emptying its tables using TRUNCATE. If so, before executing the ‘INSERT’ statements you may need to disable foreign key constraints and enable them afterwords so that the order of the INSERTs does not lead to errors. That may be a tricky job as it might involve additional privilages, workarounds and more raw SQL lines.

So the alternative I came up with is appending the inserts in structure.sql right before the constraints are defined and after table definitions and then simply realod the structure.sql. That way INSERTS will be inserted regardless of their order without disabling any constraints, and restoring the database becomes simply reloading the structure.sql.

Let’s look at the service that will help us manage that:

class CypressDumpManager
  EXCLUDE_TABLES = ['ar_internal_metadata', 'schema_migrations'].freeze

  class NonCypressShardError < StandardError
  end

  def initialize
    @connection = ActiveRecord::Base.connection
  end

  def generate_dump
    verify_cypress_shard
    create_inserts_file
    create_structure_sql_with_data
  end

  def restore_from_dump
    verify_cypress_shard

    @connection.execute('DROP SCHEMA public CASCADE;CREATE SCHEMA public;')
    @connection.execute(File.read(structure_with_data))
    @connection.reconnect!
    @connection.tables.each { |table| @connection.reset_pk_sequence!(table) }

    ActiveRecord::Base.connection.pool.disconnect!
  end

  private
    # to be defined later
    def create_inserts_file; end
    def create_structure_sql_with_data; end
    def transform_values; end

    def current_shard; end
    def inserts_file_name; end
    def structure_with_data; end
    def verify_cypress_shard; end
end

generate_dump calls two methods:

create_inserts - creates the the dump file (.sql) containing INSERT statements:

....
`INSERT INTO {table1} (col1, col2) VALUES (val1, val2)`
`INSERT INTO {table2} (col1, col2, col3) VALUES (val1, val2, val3)`
...

create_structure_sql_with_data - creates a copy of structure.sql and appends the content of the dump file(generated by create_inserts) in it after table definitions and right before any constraints or column alterations are made - indicated by ALTER TABLE ONLY. So, we append our dump before the first match of that statement.

restore_from_dump - when called it drops public schema and reloads it using the new structure.sql that also has the initial data via INSERT statments.

Let’s define the empty methods:

def create_inserts_file
  tables = @connection.tables - EXCLUDE_TABLES

  File.open(inserts_file_name, 'w+') do |file|
    tables.each do |table|
      model = table.singularize.classify.constantize
      columns = model.new.attributes.keys
      model.all.each do |record|
        values = transform_values(record)
        row = "INSERT INTO public.#{table} (#{columns.join(', ')}) VALUES (#{values});"
        file.puts(row)
      end
    end
  end
end

def create_structure_sql_with_data
  lines = File.readlines('db/structure.sql')
  first_match_detected = false
  File.open(structure_with_data, 'w+') do |file|
    lines.each do |line|
      if !first_match_detected && /^ALTER TABLE ONLY/.match?(line)
        File.readlines(inserts_file_name).each { |insert| file.puts insert }
        first_match_detected = true
      end
      file.puts line
    end
  end
end

To create INSERT rows based on AR records, we use transform_values to make column values insertable.

def transform_values(record)
  record.attributes.values.map do |value|
    case value
    when Time, Date, String
      @connection.quote(value)
    when NilClass
      'NULL'
    when ActiveSupport::HashWithIndifferentAccess, Hash
      "'#{value.to_json}'"
    when Array
      "'{#{value.join(',')}}'"
    else
      value
    end
  end.join(', ')
end

Let’s define the rest of the empty methods:

def current_shard
  @current_shard ||= ActiveRecord::Base.current_shard
end

def inserts_file_name
  "/tmp/#{current_shard}_dump.sql"
end

def structure_with_data
  "/tmp/#{current_shard}_structure_with_data.sql"
end

def verify_cypress_shard
  unless ApplicationRecord::CYPRESS_SHARDS[current_shard]
    raise NonCypressShardError, "Cannot restore a non-cypress database: #{current_shard}"
  end
end

ActiveRecord::Base.current_shard gives us the database we’re currently connected to so before each restore we verify that we’re actually working with a test database.

Both files that are generated per shard are saved in /tmp folder of the server hosting the Rails app. That way after restore file is generated, the database it relates to can simply be restored via endpoint call.

The set-up

With CypressDumpManager we can call:

CypressDumpManager.new.generate_dump
CypressDumpManager.new.restore_from_dump

to generate a database dump and later use it for restore.

To see how evrything we covered in Part 1 and Part 2 is put into practice let’s have the following context:

Rails application in api mode as backend.
React/Vue/Angular application as frontend.
Cypress as E2E testing framework.

Now, in Rails let’s:

Configure database.yml with 3 cypress shards that will be used for testing.
Add CypressConnection middleware so each requests opens a connection to the right test db.
Define /cypress/seeds endpoint which seeds with initial testing data and then creates a dump:

# let's pretend YourSeedGenerator creates the initial data
YourSeedGenerator.generate
CypressDumpManager.new.generate_dump

Define /cypress/restore endpoint which restores the corresponding cypress database via:

CypressDumpManager.new.restore_from_dump

Let’s configure Cypress to:

Run each parallel process(test group) using separate subdomain and before starting /cypress/seeds is called.
call /cypress/restore before each test (in a beforeEach block)

With those steps you should have an example of an multi-db architecture that allows running cypress in parallel with parallel processes being isolated between each other in terms of database operations and having a data cleaning mechanism that guarantees same starting point for each test.

⚠️ Caveats

Such set-up comes with some drawbacks:

Working with threads becomes trickier. When creating a new thread via Thread.new, it opens a new database connection to primary database by default. So if you want to have AR code in Thread.new you must ensure a connection to the right one is made. Let’s say a test from cypress-d2 is executing and so there’s a connection to cypress-db2, then if you have something like:

Thread.new { Student.first.update name: 'John' }

Student.first will reach in the primary database and not cypress-db2.

If Rails app communicates with third-party services, they might also need to be adapted to the multi-db setup. For example if those services call interal endpoints, calls may have to start including additional data in the headers or body of the requests to indicate which database should be used.
If there’re other processes alongside Rails server, they may also need to be adapted. For example, a background job processor.
In case of using sharding for optimizations like running cypress in parallel, increasing the parallel processes means adding databases manually.

Moving to a multiple database setup should be approached carefully after weighing pros and cons with respect to the application needs and long-term development.

Really hoping this example will be helpful to you and inspire ideas for applying horizontal sharding in other use cases 😉

Fight flakiness and speed up your Cypress tests. Rails 6.1 horizontal sharding & data cleaning mechanism (Part 2)

Data cleaning mechanism

The set-up

⚠️ Caveats

About Mihail Panayotov

Comments

Read Next

Fight flakiness and speed up your Cypress tests. Rails 6.1 horizontal sharding & data cleaning mechanism (Part 1)

Developer bites with Maarten Parmentier. What makes one a 10x engineer, his view on Ember.js and Vue.js (+ more)

Data cleaning mechanism

The set-up

⚠️ Caveats

About Mihail Panayotov

Comments

Subscribe to my newsletter

Read Next