Selecting users into A/B buckets is the core construct of any web-based experimentation platform. Yet building and designing an A/B selection algorithm is surprisingly difficult, and the process of bucketing users is far more complex than just flipping a coin or spinning a roulette wheel.

Ensuring a persistent experience across multiple devices and various user states is hard. Depending on your requirements, what may start out as a simple selection algorithm can quickly devolve into a complex multi-key user identity system with real-time operational requirements.

In this post, we’ll dive into the various selection schemes that we’ve used here at Simon as well as in other experimentation platforms that we’ve built in the past.

Table Stakes

At a minimum, a binning algorithm must bucket deterministically based on a user’s identifier. Given an experiment Elmo and user Urkel, such a function bucket(Elmo, Urkel) must always evaluate to the same bin to ensure a consistent experience and a valid test. If Elmo is an experiment that tests changes to search result rankings, then if Urkel is placed into variant B, he must remain in variant B when browsing subsequent search result pages; otherwise his experience will be inconsistent and the test will be invalid.

If you only need to experiment across logged in users with known emails, then you can perform your A/B selection in an entirely stateless fashion using a basic hashing scheme. Here’s some pseudocode:

def hash_bin(experiment_name, user_email)
    hash = md5(experiment_name + “-” + user_email)
    bin = int(hash) mod 100
    bucket = bin < 50 ? “A” : “B”
    return bucket

The first line ingests the experiment name and user’s email to deterministically form a token that drives the randomization process. The second line then converts the hash to an integer (ignoring overflow issues that need to be dealt with) and assigns the user into a bucket. Finally, in line three, the bin is chosen assuming two variants with a 50/50 split.

The above scheme works equally well when applied to any set of users who are identified by a single consistent set of identifiers - e.g. emails, device ids, or cookies.

If, however, your experimental system requires binning across multiple forms of identity - say, both logged out users represented by cookies and logged in users represented by user ids - then things get much more complicated.

The Impossibility of Perfect Consistency

Before diving into technical details on how to deal with multiple forms of identity, let’s dig into a simple example that shows the limitations of the problems we’re solving.

Day 1: Urkel surfs over to your homepage on his laptop, logs into the site, and is binned into variant B.

Urkel uses his laptop

Day 2: Urkel surfs over to your homepage from his iPad but doesn’t log in and is binned into variant A; he spends another 5 minutes browsing the site and then eventually logs in.

Urkel uses his ipad

In this case, we’re faced with two options on how to treat Urkel in his day 2 experience:

Option 1: Switch Urkel back to bin B on login. This will maintain consistency across his logged in experience, yet will result in a potentially awkward user experience when he abruptly switches buckets from one screen to the next.

Option 1 for bucketing Urkel

Option 2: Keep Urkel in bin A on login. This will ensure a consistent user experience across day 2, but the resulting statistics for the test may be skewed since Urkel saw both variants A and B in his logged in state.

Option 2 for bucketing Urkel

Of course, both result in different forms of inconsistencies; neither is ideal.

Minimizing Inconsistencies

If Urkel then revisits the site on his iPad on day 3, Option 1 would maintain his inclusion in variant B for both logged in as well as logged out contexts.

To implement this, we need to record state. When Urkel logs in on day 2, we need to explicitly record that we’re going to “override” the simple hash function that assigned him into bucket A during his logged out iPad browsing. When Urkel returns on day 3, we now need to check the database first to see if he’s a known subject with a pre-existing bin.

The pseudocode should look something like this:

def stateful_bin(experiment_name, user)
    # Lookup against current identifier scope (email or cookie)
    bin = lookup_bin(experiment_name, user.identifier)                                     

    if bin is empty
        bin = hash_bin(experiment_name, user.identifier)
        # Save bin across all available user identifiers
        assign_bin(experiment_name, user, bin)

    return bin

Unfortunately, the cost of implementing this can be quite high. First, the function has interleaved reads and writes, so some sort of persistent random access and ideally transactional database supported may be needed - mysql, redis, etc. Second, if you’re supporting both logged in as well as logged out users, the size of this database can grow quite large as the number of logged out users in many contexts can be an order of magnitude larger than the number of registered or logged in users. And finally, whereas a deterministic hashing scheme requires only a handful of machine instructions and a fraction of a microsecond of compute time, a database lookup with a network connection can take milliseconds and can have a material impact on overall system performance.

Other Considerations and Multi-key Identity

Fundamental here is a problem of identity management. Users can be represented via multiple logged out states, various customer identifiers, or different marketing contexts from email to phone number to mailing address.

The state-based method outlined above is akin to lazily building an identity management repository as users are added into your test. One could imagine another approach in which identity management is dealt with independently and buckets are assigned to a user as a whole instead of their specific identifiers one at a time.

If the problem is approached as one of identity management first, then you can preemptively enforce even better consistency. For example, in the case of Urkel’s day 1 and day 2 experience above, if his logged out identities were known beforehand, then he could have pre-allocated consistent buckets across each of his devices. For an experiment such as a 50% promotional offer, this can be critically important.

Conclusions

We’ve collectively built and architected three separately designed A/B testing platforms over the course of our previous startup, Etsy, and at Simon. And across each of these contexts, requirements and infrastructure were quite different.

At Etsy, we got a very long way with simple hash-based bucketing that keyed off of either logged out cookies or logged in user ids, but not both. This was effective in managing a wide array of tests conducted across many tens of millions of unique devices per month.

At Simon, our testing problems are more diverse and multi-channeled, and we’ve built significant infrastructure around multi-key customer identification to support a wider array of use cases.

Both solutions represent distinct points on either end of the complexity curve. Neither solution is perfect, and results certainly vary from company to company and from experiment to experiment. The goal here isn’t to eliminate mistakes, but rather to setup your experimental context to minimize them.

Stay tuned for future blog posts as we dive into more detail around this and related challenges surrounding user identity management.