As a project evolves, running cypress to cover its behaviour starts facing hurdles. When the number of tests grows you want the whole test suite to finish faster, fight flakiness and deal with optimization problems from infrastructural point of view.
Here I want to focus on a problem that already has well-established solutions when executing unit tests, but does not for E2E tests - data cleaning & seeding your test database to ensure each of your cypress tests starts on a clean state.
The set-up Iâll show you involves running cypress in parallel against a Rails application that uses multiple cypress/test databases utilizing Rails 6.1 horizontal sharding functionality and Postgres as database engine. Iâll go over the decisions I made along the way to come up with this set-up.
The set-up assumes good understanding of how transacations, threads & processes work in the context of database connections within a Rails application, so before proceeding with the details for the actual set-up I want to go over some fundamental angles.
Nested transactions in Rails
Since most database engines do not support nested transactions, when we place a transaction within another one, Rails realizes those via Transaction Savepoints. A savepoint indicates a state of the database in a transaction and you can always rollback to that state. Hence, when you write nested transaction blocks in Rails, under the hood you actually have one SQL TRANSACTION with multiple SAVEPOINTs.
If you look at Rails core, when a transaction is begun, it creates either a RealTransaction or a SavePointTransaction instance depending on wheter you have a regular or a nested transaction.
Letâs look at this example:
ActiveRecord::Base.transaction do
Student.first.update(first_name: 'David')
ActiveRecord::Base.transaction do
Student.last.update(first_name: 'John')
raise ActiveRecord::Rollback
end
end
Now this might be confusing, but here to our surprise, both first and last studentsâ names will be updated and the rollback will be ignored. According to official docs about nested transactions all database statements in the nested transaction block become part of the parent transaction, handled by a RealTransaction instance underneath and no savepoints.
Hence, raising ActiveRecord::Rollback
to trigger a ROLLBACK will not revert all operations within the parent transaction block. Since itâs a special exception and not re-raised, the nested transaction block will capture it but since under the hood all database statements will join/become part of the solely created sql TRANSACTION(handled by RealTransaction) for the parent transaction block, both updates will be commited.
In order to trigger new transaction (think savepoint) for each nested block thereâre two options:
# Option 1 - requires_new: true
ActiveRecord::Base.transaction do
Student.first.update(first_name: 'David')
ActiveRecord::Base.transaction(requires_new: true) do
Student.last.update(first_name: 'John')
raise ActiveRecord::Rollback
end
end
requires_new: true
forces a new savepoint for the transaction block.
or
# Option 2 - joinable: false
ActiveRecord::Base.transaction(joinable: false) do
Student.first.update(first_name: 'David')
ActiveRecord::Base.transaction do
Student.last.update(first_name: 'John')
raise ActiveRecord::Rollback
end
end
joinable: false
switches off the default behaviour of having database statements in nested transaction blocks join the parent transaction.
In both cases only the first student name will be updated as the nested transaction block will result in a savepoint under the hood that will be reverted as dictated by ActiveRecord::Rollback
.
New threads and workers mean more database connections
In Rails each new thread obtains a database connection, so itâs best to have the limit of your database connection pool equal to the number of threds youâve configured puma with. Default configuration for a new rails project is set with 5 threds:
# puma.rb
max_threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
# workers ENV.fetch("WEB_CONCURRENCY") { 2 }
# database.yml
default: &default
adapter: sqlite3
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
timeout: 5000
Apart from threads, in puma.rb you can also increase the workers. A worker is a new OS process in which a separate instance of your Rails app runs and in puma.rb itâs controlled by WEB_CONCURRENCY
by default.
Each worker uses threads, hence the number of the maximum database connections that can be opened is WEB_CONCURRENCY
* RAILS_MAX_THREADS
. If you set 2 workers and 5 threds, then you must ensure that your database engine can support 10 connections.
Alright, knowing all that, we can proceed with the actual set-up.
The Problem
Flakiness
To give more context, in a rails project we had the following dilemma. Using Cypress for E2E testing, as the project evolved and the number ot tests increased, we started running cypress in parallel to cut the waiting time. Even though tests were written in fairly well-isolated manner and tests in one parallel cypress process were independent of tests in another, we did experience some flakiness - managable at the time.
Data cleaning
The other constraint we had was related to refreshing data values created in a test. When two tests are executed sequentially, the data created by the first test should be cleaned/reverted so that it doesnât interfere with the second test. Over time it became obvious that this is hardly attainable when you have long scenarios that perform many interactions. So as the number of tests incresed, that factor also contributed to the growing flakiness.
We needed a data cleaning mechanism.
Since each of our parallel cypress processes was preceeded by those 2 steps:
- Generate test seeds used as a base for cypress tests
- Generate cypress fixtures based on test seeds, so that fixtures mirror test data in the database
the first approach for cleaning up data we came up with was via comparing timestamps. Before each test we deleted all records created after the most recent created_at
out of all fixtures.
That was more or less a variation of what one would describe as a DELETE
approach when it comes to restoring database. It was a straightforward and quick solution which reduced flakiness but only temporarily in the short run.
Transactional approach
Over time with flakiness becoming more frustrating we felt the need for an equivalent of how DatabaseCleaner works for rspec tests. That meant either transaction or truncate strategy.
The idea of keeping all interactions which a cypress test performs in a transaction that we can easily roll back in the beginning of each test was appealing. While researching if it can actually work in our parallel set-up we stumbled upon cypress-rails gem which made us more confident as it relies on a custom transactional mechanism as a data cleaning strategy.
Hereâs a huge constraint:
â ď¸ transactions assume one database connection since database connections do not share transactionsâ state by default.
This Rails MR provides an option for making all threads that obtain connections from the same database connection pool share the same connection via:
connection_pool.lock_thread = true
which is also what cypress-rails relies on:
def begin_transaction
@connections = gather_connections
@connections.each do |connection|
connection.begin_transaction joinable: false, _lazy: false
connection.pool.lock_thread = true
end
@connection_subscriber = ActiveSupport::Notifications.subscribe("!connection.active_record") { |_, _, _, _, payload|
...
if connection && !@connections.include?(connection)
connection.begin_transaction joinable: false, _lazy: false
connection.pool.lock_thread = true
@connections << connection
end
}
...
end
def gather_connections
...
# pool.connection retrieves the connection for the current thread
ActiveRecord::Base.connection_handler.connection_pool_list.map(&:connection)
end
So when cypress is run and tests start interacting with rails api resulting in new threads being utilized by puma which on their own will try to establish database connections - they will get the same database connection.
Along with pool.lock_thread = true
a new transaction is started in the beginning of cypress launch and whenever a new connection is being requested (.subscribe
part).
As a test rolls down and transactions for the same connection are being begun, each new transaction becomes nested and because theyâre started with joinable: false
, under the hood Rails builds one bulky SQL TRANSACTION with multiple SAVEPOINTS which later one can be rolled back.
Transactional approach for cypress running in parallel
All good up to now, but can that be applied when running cypress in parallel groups?
A shared database connection obviously wonât be sufficient as parallel groups should not share each otherâs data. If you split cypress tests in 3 groups and run them in parallel, youâll need 3 database connections dedicated for each group and logic that associates requests coming from a group to its dedicated connection.
Instead of caching database connections by Thread.current (which is what .lock_thread = true
does), youâll likely have to cache them by a requestâs unique indicator, which can be a subdomain if you split test groups by subdomains for example.
Such implementation seemed convoluted, error-prone and difficult to maintain, and a blocker for the whole transaction strategy, so we ended up not taking that route.
Even if we had build such implementation, another hurdle wouldâve been the puma workers. If youâre running cypress tests against an app with multiple puma workers handling requests underneath, then workersâ threads wonât access same database state as they will have separate database connections.
While youâre not likely to run puma with multiple workers in a test environment, that might not necessarily be the case in a staging or some other pre-production environment.
Hence, such implementation wouldâve also been resource-dependent.
Truncate approach
Rulling out transactions, weâre left with cleaning up the database manually. Emptying all tables and essentially the whole database again is easy to picture with one process executing all tests, but not so much when running tests in parallel. We canât afford to reset the database in the beginning of a test in one parallel group while a test from another is still running.
Hereâs when the idea of multiple test databases serving cypress test groups emerged. Each cypress test group could use a dedicated test database completely in isolation from other test groups and their databases - that way we can empty each test database at any time without worrying about the others - allowing true parallelism between test groups.
To link a cypress test group with its dedicated database weâll also run the groups using dedicated subdomains. Letâs say youâre executing your tests against https://myawesomeapp.com in 3 parallel processes, you can set:
- https://cypress-db1.myawesomeapp.com
- https://cypress-db2.myawesomeapp.com
- https://cypress-db3.myawesomeapp.com
subdomains serving your rails app (regarless if your app is server-side rendered html or a completely client-side js) and run a test group against each. You can think of each of those as tenants in a multi-tenant application where each tenantâs data is saved in its own database.
â ď¸ This requires adapting your cypress configuration, rails configuration and infrastructure in charge of running cypress to such multiple subdomains set-up. Here weâll focus on key points in the applicationâs code only.
Horizontal sharding
Since 6.1 Rails supports horizontal sharding - database functionality that allows you to have multiple databases(shards) that have same structure:
# database.yml
default: &default
adapter: postgresql
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
timeout: 5000
development:
primary:
<<: *default
database: 'development'
cypress-db1:
<<: *default
database: 'cypress-db1'
cypress-db2:
<<: *default
database: 'cypress-db2'
cypress-db3:
<<: *default
database: 'cypress-db3'
cypress-db1, cypress-db-2 and cypress-db3 databases have the same structure as the primary.
# config/environments/development.rb
config.x.cypress_shards = [
'cypress-db1',
'cypress-db2',
'cypress-db3'
]
# application_record.rb
class ApplicationRecord < ActiveRecord::Base
self.abstract_class = true
CYPRESS_SHARDS = Rails.application.config.x.cypress_shards.each_with_object({}) do |cypress_db, hash|
hash[cypress_db.to_sym] = { writing: cypress_db.to_sym, reading: cypress_db.to_sym }
end.freeze
connects_to shards: {
default: { writing: :primary, reading: :primary },
**CYPRESS_SHARDS
}
end
Okay, with that database.yml and the changes in application_record we have enabled switching between shards in our application. Swapping happens via connected_to
method:
module Middlewares
class CypressConnection
def initialize(app)
@app = app
end
def call(env)
request = Rack::Request.new(env)
subdomain = request.referer&.split('https://')&.second&.split('.').&first
is_cypress = Rails.application.config.x.cypress_shards.include?(subdomain)
if is_cypress
ActiveRecord::Base.connected_to(shard: subdomain.to_sym, role: :writing) do
@app.call(env)
end
else
@app.call(env)
end
end
end
end
CypressConnection
is a custom middleware responsible for connecting to the right cypress database based on the subdomain the request comes from.
With the configuration up to this point you can split your cypress tests in 3 groups and run the subdomain-isolated groups in parallel without worrying if data created in one group would interefere with the data created in another as groups have dedicated databases to save their data in.
But thereâs still nothing in place to ensure the same database state before each test within a group. Letâs go to Part 2
Comments