Experimentation Lab/Automated analysis of experiments

This guide helps you define a metric (if it does not yet exist in the Metrics Catalog) and register a (configured) experiment for automated analysis.

Diagram showing architecture of the MVP of the system for automated analysis of experiments conducted with Wikimedia Experimentation Lab.

For now, the Metrics Catalog and registration of experiments for automated analysis are separate configurations from configuring experiments in xLab (formerly MPIC), and the reporting is done separately via Superset. We are considering consolidating the user experience into xLab UI as a potential hypothesis for FY25–26 SDS 2.1 KR.

Prerequisites

For experiments using edge unique identifier for enrollment, the collected data must include performer_is_logged_in contextual attribute.

This guide assumes that you have created a measurement plan (a source of truth describing what metrics you're planning to measure and why) and an instrumentation specification (a source of truth describing how you're planning to measure those metrics and what events you will need to collect).

Please refer to the guide to conducting an experiment for help with this step.

This guide also assumes that you have implemented the instrumentation specification and that the experiment has been configured. That is, mpic.wikimedia.org/api/v1/experiments?format=analytics includes the experiment.

Development environment

Please refer to these instructions on setting up your environment for development and testing.

Define metric

Metrics are defined in metrics_catalog.yaml and each entry must include the following:

name
The unique name by which the metric can be referred to in the experiment registry.
type
"mean" metric measures some numeric outcome, usually calculated with SUM or COUNT (e.g. pageviews, edits, session length)
"proportion" metric measures a binary outcome with a success or failure state (e.g. clicked through, retained, registered)
"prepared" is for advanced usage; refer to guide on prepared metrics
description
Descriptions can include links to additional documentation and to instrumentation source code.
query_template
Parameterized query used to calculate the metric. See below for details.
business_data_steward
Who stewards the definition and business logic, usually someone on the Product Analytics team.
technical_data_steward
Who stewards the technical implementation of the metric, usually someone on the product team that instrumented the metric.

Anyone in Product Analytics can review and merge MRs which define metrics for use in experiment analysis. Ping the team via @product-analytics-team in #working-with-data channel in WMF Slack.

Query templates

The parameterized query in query_template can be as simple or complex as you need it to be and include CTEs, but in the end it should yield 3 columns:

  • variation: the group to which the subject was assigned
    • "control" for the control group
    • "treatment" for the other group, regardless of the name in the experiment
  • subject_id – see subject ID in experiment schema fragment
  • outcome: aggregated for each subject
    • For "mean" type metrics, outcome is any number (e.g. calculated with SUM or COUNT)
    • For "proportion" type metrics, outcome is 0 or 1; see example below
      • 0 for failure (e.g. user has not clicked on link)
      • 1 for success (e.g. user has clicked on link)

Query FROM {table}, which will be filled-in by the analysis job using table from the experiment's entry in the registry (see below).

For brevity, the WHERE clause in the query should include AND {where_boilerplate} which will be replaced with:[1]

experiment.coordinator = 'xLab'
AND experiment.enrolled = '{experiment_name}'
AND CONCAT(LPAD(year, 4, '0'), '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '{start_date}'
AND CONCAT(LPAD(year, 4, '0'), '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '{end_date}'

This means that you can also refer to {start_date} and {end_date} (start and end dates of the experiment, respectively) if your metric requires it. For example, if you need to limit a metric to a cohort of users from the first 3 days of the experiment your query can use the following:

TO_TIMESTAMP('{start_date}') + INTERVAL 72 HOUR

Example queries

Mean-type metric

Suppose we have an experiment to increase each user's page visits. The metric could have a query like:

SELECT
  IF(experiment.assigned = 'control', 'control', 'treatment') AS variation,
  experiment.subject_id,
  COUNT(1) AS outcome
FROM {table}
WHERE action = 'page-visit'
  AND {where_boilerplate}
GROUP BY 1, 2

If you have a proportion measured at the subject level, such as:

  • Revert rate: proportion of edits published by subject which are then reverted
  • Per-user clickthrough rate: proportion of impressions by each subject that resulted in a click

then you would still define it as a mean type metric, because the group's measurement is averaged across all the subjects.

Proportion-type metric

Suppose we have an experiment to increase proportion of users who click on some element on the page. The metric could have a query like:

SELECT
  IF(experiment.assigned = 'control', 'control', 'treatment') AS variation,
  experiment.subject_id,
  CAST(SUM(IF(action = 'click', 1, 0)) > 0 AS INT) AS outcome
FROM {table}
WHERE action IN('impression', 'click')
  AND {where_boilerplate}
GROUP BY 1, 2

Register experiment

You must register your experiment for analysis at least 1 day before the experiment is scheduled to end for it to be automatically analyzed. We recommend that you register for analysis shortly after configuring the experiment in xLab (https://mpic.wikimedia.org/) so that the system can learn about your experiment in time and start analyzing it when the experiment launches.

Registration is adding an entry to experiments_registry.yaml.

Each entry must have the following:

  • enrolled: The machine-readable name of the experiment, as used in the instrumentation and when configuring the experiment in MPIC/xLab. This is what we would find in experiment.enrolled in the event data.
  • table: Where the data for your experiment is available. Your event data will be available in a Hive table with the same name as the event stream, with dots and dashes replaced with underscores.
  • metrics: Names of metrics as defined in the Metrics Catalog (metrics_catalog.yaml). CI will check for presence of at least one metric and it will also check that all metrics listed here are defined.

Defining new metrics and registering the experiment for analysis can be done together in the same merge request (example).

Anyone in Product Analytics can review and merge MRs which register experiments for analysis. Ping the team via @product-analytics-team in #working-with-data channel in WMF Slack.

View experiment results

To access the Superset dashboard with the experiment results you need to (1) have access to Superset, and (2) be a member of the analytics-privatedata-users group. If you cannot access Superset at all, refer to these instructions. If you are receiving errors about permissions, you will need to request SSH-less membership to analytics-privatedata-users. See Data Platform/Data access for instructions on requesting and T393409 for example of a request.

If the experiment has been registered for analysis before or while it was active, it is analyzed by our system. Each analyzed experiment's results are then explorable at https://superset.wikimedia.org/superset/dashboard/experiment-analytics/.

Use Superset's filters sidebar (on the left) to select:

  • Experiment: search for experiments by name
  • Difference:
    • Relative (default) is defined as: (treatment - control)/control
    • Absolute is defined as: treatment - control

User segmentation

This version has limited user segmentation capability. For experiments conducted on all user traffic, you can filter on User authentication status which includes the following segments:

  • All: analysis used data collected from all subjects in the experiment
  • Logged-in only: analysis used data collected only from subjects which were logged-in at the time of data collection
  • Logged-out only: analysis used data collected only from subjects which were not logged-in at the time of data collection

For experiments conducted on logged-in users, Logged-in only is the only user segment available.

Difference types

Relative change is the unit-less difference between the treatment and the control relative to the control. We report relative change as a percentage.

Absolute difference on the other hand is not scaled by a reference/starting value. Absolute difference retains the unit of what is being compared. Absolute differences of percentages (such as clickthrough rates) use percentage points (pp).

For example:

Difference type Control group Treatment group Difference
Relative 10% 11% +10%
Absolute 10% 11% +1pp
Relative 12% 16% +33.3%
Absolute 12% 16% +4pp
Relative 10 5 -50%
Absolute 10 5 -5

Analysis

There is no way to include error bars in Superset's bar chart showing relative changes by metric. Please review the statistical analysis table for credible/confidence intervals (CIs) to get a sense of how much uncertainty there is about these estimates of lift.

There is not one signal to use for making a decision. Instead, you need to look at the set of signals together as a whole. Even if the p-value is greater than 0.05 and the 95% credible interval includes 0, it may be that it includes it just barely. If the vast majority – almost all – of the interval is to the right of 0, that strongly suggests that the effect is positive. In that case you might look at the risk to see the loss you can expect if the change really is negative, but you chose to go with it anyway.

Experiment health

We automatically check for a 50/50 split between the groups' sample sizes using a standard chi-squared test for Sample Ratio Mismatch. In practice the groups will not be exactly the same due to randomness, but this checks for the split being significantly different than what we expect. If you see a warning, there is likely something wrong with the randomization and assignment algorithms.

Bayesian results

Top: a Normal probability distribution with mean 0 and standard deviation 0.3, encoding belief that 68% of experiments have a lift between -30% and 30% (see 68–95–99.7 rule). Bottom: posterior distribution (also a Normal, due to conjugacy), showing how beliefs about likely values of lift shifted after collecting data. The area to the right of 0 is highlighted, indicating the chance to win (probability that lift is greater than 0).
Prior and posterior distributions in Bayesian analysis of an A/B test

In Bayesian analysis we start with a prior belief about the likely values of parameter of interest – expressed as a probability distribution – and then collect data (evidence), and use that to update our beliefs – yielding the posterior (also a probability distribution). In this case – where the parameter of interest is lift (relative change) – we start with a Normal(0, 0.3) prior on the lift, encoding the belief that 68% of experiments have a lift between -30% and 30%, 95% of experiments have a lift between -60% and 60%, and 99.7% of experiments have a lift between -90% and 90%.[2] When we conduct an experiment and collect data, we update our belief about what our lift is likely to be.

The more data (number of subjects in our experiment) we collect, the lesser the influence of the prior belief. If we conduct a small experiment (say, <100 users) and observe a huge lift (say, 200%), the probability distribution's mean will shift from 0% closer to 200%, but not by much – because our prior belief says that such a lift is highly unlikely. If we were to conduct a large experiment (say, >10K users) and somehow observe a similarly huge lift (say, 150%), our posterior's mean will be close to that, because despite the prior belief about such a lift being unlikely, the evidence to the contrary is overwhelming.

Chance To Win is the probability that the treatment is better than the control. In other words, the probability that the lift is greater than 0.

Credible intervals have the property that if we calculate a 95% credible interval for an experiment, there is a 95% probability the lift is in that interval. This is because the interval is based on the probability distribution of lift after observing the data.

Risk (expected loss) is the average uplift that would be lost if we chose treatment but the treatment was actually worse than the control, multiplied by the probability that control is better than treatment (or "Chance To Lose"). This is a summary of the posterior distribution to the left of 0, and the more of the distribution there is to the left of 0 the more risk there is.

Frequentist results

There are many resources online about understanding p-values correctly. If you are using p-values for decision making, be sure you are not misusing p-values. Furthermore, if you do use p-value for decision making, be sure your experiment was sufficiently powered.

The p-value is the probability of observing a lift as big or bigger than the one observed in this experiment if the actual lift is zero. If the p-value is below a significance level α, you reject the null hypothesis that there is no lift. If the p-value is at or above α, you do not reject the null hypothesis. The significance level α is the false positive rate and it usually set to 0.05, indicating a 5% probability of incorrectly rejecting the null hypothesis when it should not be rejected.

A large p-value (statistically not significant effect) is not necessarily indicative of a bad change. It may be that there is a small, positive, and practically significant effect, but the experiment was underpowered. That is, it did not have enough statistical power to detect an effect of that small size. In addition to the p-value, you should review other signals (such as confidence intervals) to better understand the direction and possible magnitude of the effect, as well as the amount of uncertainty associated with the estimate.

Confidence intervals have the property that if the experiment is repeated many times and a 95% confidence interval is calculated each time. 95% of those confidence intervals will contain the true lift. It does not not imply a 95% probability that the true lift is in the particular confidence interval calculated for any given experiment. However, the 95% credible interval has that property; see below for notes on Bayesian results.

View metric across experiments

Due to the nature of edge unique-based enrollment, results are limited to "Logged-out only" user authentication status segment for experiments conducted on all user traffic.

You can also get a high-level view of all the experiments whose analysis included a specific metric: https://superset.wikimedia.org/superset/dashboard/experiment-metrics/ This allows you to know which are the top experiments impacting the metrics you care about – great for key result and objective owners.

Extract experiments results

In some cases you may wish to have the experiment results elsewhere (e.g. some spreadsheet software), either for additional analysis or for easier inclusion in presentations. You could copy and paste the results from the Superset dashboard, but you can also query the experiment results that are stored in data lake (in wmf_experiments.experiment_results_v1).

Maintain automated analytics

This section is for maintainers of this sub-system of Experimentation Lab.

The analysis code lives in a Python jobs repohttps://gitlab.wikimedia.org/repos/product-analytics/experimentation-lab/experiment-analytics-jobs – and has these entry points:

  • configuration_loader.py downloads configurations from xLab API and experiment-analytics-configs repo and loads them into:
    • wmf_experiments.metrics_catalog_v1
    • wmf_experiments.experiment_registry_v1
  • analysis_coordinator.py:
    1. Uses experiment_registry_v1 to determines which experiments need to be analyzed and how (type of experiment, which metrics)
    2. Gathers data for analysis of those experiments using information from metrics_catalog_v1
    3. Performs analyses of each experiment's metrics
    4. Stores results in wmf_experiments.experiment_results_v1 using MERGE INTO (updating records for re-analyzed experiments, appending records for newly analyzed experiments)

These two jobs are scheduled via the analytics-product Airflow instance:

Both of those DAGs rely on the scripts having been packaged into an artifact, the URI for which is specified in artifacts.yaml.

Resetting metrics catalog

This procedure can only be performed by somebody in the analytics-product-users group.

Due to how the metrics catalog is currently loaded into the data lake – using merge into by name and only ever doing an insert or an update operation, never delete (learn more here) – if a metric is renamed then there will be a copy of its definition in wmf_experiments.metrics_catalog_v1 under the previous name.

To remove the specific metric: sudo -u analytics-product kerberos-run-command analytics-product spark3-sql -e "DELETE FROM wmf_experiments.metrics_catalog_v1 WHERE name = '{previous metric name}';"

To do a full reset the metrics catalog in the data lake (if there are too many renamed metrics to delete one by one):

  1. Run sudo -u analytics-product kerberos-run-command analytics-product spark3-sql -e "DELETE FROM wmf_experiments.metrics_catalog_v1 WHERE TRUE;" on any stat host.
  2. Go to airflow-analytics-product.wikimedia.org and manually run the load_configurations_daily DAG.

References

  1. https://gitlab.wikimedia.org/repos/product-analytics/experimentation-lab/experiment-analytics-jobs/-/blob/dfe764161f4633e4660588f262b8271085954bc9/experiment_analytics_jobs/experiments.py#L98-114
  2. 68–95–99.7 rule