Experimentation Lab/Automated analysis of experiments
This guide helps you define a metric (if it does not yet exist in the Metrics Catalog) and register a (configured) experiment for automated analysis.

For now, the Metrics Catalog and registration of experiments for automated analysis are separate configurations from configuring experiments in xLab (formerly MPIC), and the reporting is done separately via Superset. We are considering consolidating the user experience into xLab UI as a potential hypothesis for FY25–26 SDS 2.1 KR.
Prerequisites
performer_is_logged_in
contextual attribute.This guide assumes that you have created a measurement plan (a source of truth describing what metrics you're planning to measure and why) and an instrumentation specification (a source of truth describing how you're planning to measure those metrics and what events you will need to collect).
Please refer to the guide to conducting an experiment for help with this step.
This guide also assumes that you have implemented the instrumentation specification and that the experiment has been configured. That is, mpic.wikimedia.org/api/v1/experiments?format=analytics includes the experiment.
Development environment
Please refer to these instructions on setting up your environment for development and testing.
Define metric
Metrics are defined in metrics_catalog.yaml and each entry must include the following:
name
- The unique name by which the metric can be referred to in the experiment registry.
type
- "mean" metric measures some numeric outcome, usually calculated with
SUM
orCOUNT
(e.g. pageviews, edits, session length) - "proportion" metric measures a binary outcome with a success or failure state (e.g. clicked through, retained, registered)
- "prepared" is for advanced usage; refer to guide on prepared metrics
description
- Descriptions can include links to additional documentation and to instrumentation source code.
query_template
- Parameterized query used to calculate the metric. See below for details.
business_data_steward
- Who stewards the definition and business logic, usually someone on the Product Analytics team.
technical_data_steward
- Who stewards the technical implementation of the metric, usually someone on the product team that instrumented the metric.
Anyone in Product Analytics can review and merge MRs which define metrics for use in experiment analysis. Ping the team via @product-analytics-team in #working-with-data channel in WMF Slack.
Query templates
The parameterized query in query_template
can be as simple or complex as you need it to be and include CTEs, but in the end it should yield 3 columns:
variation
: the group to which the subject was assigned- "control" for the control group
- "treatment" for the other group, regardless of the name in the experiment
subject_id
– see subject ID inexperiment
schema fragmentoutcome
: aggregated for each subject- For "mean" type metrics, outcome is any number (e.g. calculated with
SUM
orCOUNT
) - For "proportion" type metrics, outcome is 0 or 1; see example below
- 0 for failure (e.g. user has not clicked on link)
- 1 for success (e.g. user has clicked on link)
- For "mean" type metrics, outcome is any number (e.g. calculated with
Query FROM {table}
, which will be filled-in by the analysis job using table
from the experiment's entry in the registry (see below).
For brevity, the WHERE
clause in the query should include AND {where_boilerplate}
which will be replaced with:[1]
experiment.coordinator = 'xLab'
AND experiment.enrolled = '{experiment_name}'
AND CONCAT(LPAD(year, 4, '0'), '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '{start_date}'
AND CONCAT(LPAD(year, 4, '0'), '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '{end_date}'
This means that you can also refer to {start_date}
and {end_date}
(start and end dates of the experiment, respectively) if your metric requires it. For example, if you need to limit a metric to a cohort of users from the first 3 days of the experiment your query can use the following:
TO_TIMESTAMP('{start_date}') + INTERVAL 72 HOUR
Example queries
Mean-type metric
Suppose we have an experiment to increase each user's page visits. The metric could have a query like:
SELECT
IF(experiment.assigned = 'control', 'control', 'treatment') AS variation,
experiment.subject_id,
COUNT(1) AS outcome
FROM {table}
WHERE action = 'page-visit'
AND {where_boilerplate}
GROUP BY 1, 2
If you have a proportion measured at the subject level, such as:
- Revert rate: proportion of edits published by subject which are then reverted
- Per-user clickthrough rate: proportion of impressions by each subject that resulted in a click
then you would still define it as a mean type metric, because the group's measurement is averaged across all the subjects.
Proportion-type metric
Suppose we have an experiment to increase proportion of users who click on some element on the page. The metric could have a query like:
SELECT
IF(experiment.assigned = 'control', 'control', 'treatment') AS variation,
experiment.subject_id,
CAST(SUM(IF(action = 'click', 1, 0)) > 0 AS INT) AS outcome
FROM {table}
WHERE action IN('impression', 'click')
AND {where_boilerplate}
GROUP BY 1, 2
Register experiment
Registration is adding an entry to experiments_registry.yaml.
Each entry must have the following:
enrolled
: The machine-readable name of the experiment, as used in the instrumentation and when configuring the experiment in MPIC/xLab. This is what we would find inexperiment.enrolled
in the event data.table
: Where the data for your experiment is available. Your event data will be available in a Hive table with the same name as the event stream, with dots and dashes replaced with underscores.metrics
: Names of metrics as defined in the Metrics Catalog (metrics_catalog.yaml). CI will check for presence of at least one metric and it will also check that all metrics listed here are defined.
Defining new metrics and registering the experiment for analysis can be done together in the same merge request (example).
Anyone in Product Analytics can review and merge MRs which register experiments for analysis. Ping the team via @product-analytics-team in #working-with-data channel in WMF Slack.
View experiment results
analytics-privatedata-users
group. If you cannot access Superset at all, refer to these instructions. If you are receiving errors about permissions, you will need to request SSH-less membership to analytics-privatedata-users
. See Data Platform/Data access for instructions on requesting and T393409 for example of a request.If the experiment has been registered for analysis before or while it was active, it is analyzed by our system. Each analyzed experiment's results are then explorable at https://superset.wikimedia.org/superset/dashboard/experiment-analytics/.
Use Superset's filters sidebar (on the left) to select:
- Experiment: search for experiments by name
- Difference:
- Relative (default) is defined as:
(treatment - control)/control
- Absolute is defined as:
treatment - control
- Relative (default) is defined as:
User segmentation
This version has limited user segmentation capability. For experiments conducted on all user traffic, you can filter on User authentication status which includes the following segments:
- All: analysis used data collected from all subjects in the experiment
- Logged-in only: analysis used data collected only from subjects which were logged-in at the time of data collection
- Logged-out only: analysis used data collected only from subjects which were not logged-in at the time of data collection
For experiments conducted on logged-in users, Logged-in only is the only user segment available.
Difference types
Relative change is the unit-less difference between the treatment and the control relative to the control. We report relative change as a percentage.
Absolute difference on the other hand is not scaled by a reference/starting value. Absolute difference retains the unit of what is being compared. Absolute differences of percentages (such as clickthrough rates) use percentage points (pp).
For example:
Difference type | Control group | Treatment group | Difference |
---|---|---|---|
Relative | 10% | 11% | +10% |
Absolute | 10% | 11% | +1pp |
Relative | 12% | 16% | +33.3% |
Absolute | 12% | 16% | +4pp |
Relative | 10 | 5 | -50% |
Absolute | 10 | 5 | -5 |
Analysis
There is not one signal to use for making a decision. Instead, you need to look at the set of signals together as a whole. Even if the p-value is greater than 0.05 and the 95% credible interval includes 0, it may be that it includes it just barely. If the vast majority – almost all – of the interval is to the right of 0, that strongly suggests that the effect is positive. In that case you might look at the risk to see the loss you can expect if the change really is negative, but you chose to go with it anyway.
Experiment health
We automatically check for a 50/50 split between the groups' sample sizes using a standard chi-squared test for Sample Ratio Mismatch. In practice the groups will not be exactly the same due to randomness, but this checks for the split being significantly different than what we expect. If you see a warning, there is likely something wrong with the randomization and assignment algorithms.
Bayesian results

In Bayesian analysis we start with a prior belief about the likely values of parameter of interest – expressed as a probability distribution – and then collect data (evidence), and use that to update our beliefs – yielding the posterior (also a probability distribution). In this case – where the parameter of interest is lift (relative change) – we start with a Normal(0, 0.3) prior on the lift, encoding the belief that 68% of experiments have a lift between -30% and 30%, 95% of experiments have a lift between -60% and 60%, and 99.7% of experiments have a lift between -90% and 90%.[2] When we conduct an experiment and collect data, we update our belief about what our lift is likely to be.
The more data (number of subjects in our experiment) we collect, the lesser the influence of the prior belief. If we conduct a small experiment (say, <100 users) and observe a huge lift (say, 200%), the probability distribution's mean will shift from 0% closer to 200%, but not by much – because our prior belief says that such a lift is highly unlikely. If we were to conduct a large experiment (say, >10K users) and somehow observe a similarly huge lift (say, 150%), our posterior's mean will be close to that, because despite the prior belief about such a lift being unlikely, the evidence to the contrary is overwhelming.
Chance To Win is the probability that the treatment is better than the control. In other words, the probability that the lift is greater than 0.
Credible intervals have the property that if we calculate a 95% credible interval for an experiment, there is a 95% probability the lift is in that interval. This is because the interval is based on the probability distribution of lift after observing the data.
Risk (expected loss) is the average uplift that would be lost if we chose treatment but the treatment was actually worse than the control, multiplied by the probability that control is better than treatment (or "Chance To Lose"). This is a summary of the posterior distribution to the left of 0, and the more of the distribution there is to the left of 0 the more risk there is.
Frequentist results
The p-value is the probability of observing a lift as big or bigger than the one observed in this experiment if the actual lift is zero. If the p-value is below a significance level α, you reject the null hypothesis that there is no lift. If the p-value is at or above α, you do not reject the null hypothesis. The significance level α is the false positive rate and it usually set to 0.05, indicating a 5% probability of incorrectly rejecting the null hypothesis when it should not be rejected.
A large p-value (statistically not significant effect) is not necessarily indicative of a bad change. It may be that there is a small, positive, and practically significant effect, but the experiment was underpowered. That is, it did not have enough statistical power to detect an effect of that small size. In addition to the p-value, you should review other signals (such as confidence intervals) to better understand the direction and possible magnitude of the effect, as well as the amount of uncertainty associated with the estimate.
Confidence intervals have the property that if the experiment is repeated many times and a 95% confidence interval is calculated each time. 95% of those confidence intervals will contain the true lift. It does not not imply a 95% probability that the true lift is in the particular confidence interval calculated for any given experiment. However, the 95% credible interval has that property; see below for notes on Bayesian results.
View metric across experiments
You can also get a high-level view of all the experiments whose analysis included a specific metric: https://superset.wikimedia.org/superset/dashboard/experiment-metrics/ This allows you to know which are the top experiments impacting the metrics you care about – great for key result and objective owners.
Extract experiments results
In some cases you may wish to have the experiment results elsewhere (e.g. some spreadsheet software), either for additional analysis or for easier inclusion in presentations. You could copy and paste the results from the Superset dashboard, but you can also query the experiment results that are stored in data lake (in wmf_experiments.experiment_results_v1
).
Querying with Presto in Superset's SQL Lab |
Just be sure to select
presto_analytics_icerberg as the database. (Instead of presto_analytics_hive .)SELECT
metric_name,
difference_type, -- 'Relative' or 'Absolute'
-- Sample sizes:
sample_sizes.control AS sample_size_control,
sample_sizes.treatment AS sample_size_treatment,
-- NOTE: Group sample sizes are the same for all metrics
-- Per-group average (of metric):
ROUND(sample_means.control, 3) AS metric_control,
ROUND(sample_means.treatment, 3) AS metric_treatment,
-- Bayesian statistical analysis of lift:
ROUND(bayesian_results.estimate, 3) AS bayesian_estimate_of_lift,
ROUND(bayesian_results.chance_to_win, 4) AS probability_lift_greater_than_zero,
CONCAT(
'(' || CAST(ROUND(bayesian_results.credible_interval_95.lower, 2) AS VARCHAR) || ', ',
CAST(ROUND(bayesian_results.credible_interval_95.upper, 2) AS VARCHAR) || ')'
) AS credible_interval, -- 95% CI (Bayesian)
-- Frequentist statistical analysis of lift:
ROUND(frequentist_results.estimate, 3) AS frequentist_estimate_of_lift,
ROUND(frequentist_results.p_value, 4) AS p_value,
CONCAT(
'(' || CAST(ROUND(frequentist_results.confidence_interval_95.lower, 2) AS VARCHAR) || ', ',
CAST(ROUND(frequentist_results.confidence_interval_95.upper, 2) AS VARCHAR) || ')'
) AS confidence_interval -- 95% CI (Frequentist)
FROM wmf_experiments.experiment_results_v1
WHERE
experiment_name = (
SELECT name
FROM wmf_experiments.experiments_registry_v1
WHERE machine_name = 'sds2-4-11-synth-aa-test-2' -- or: 'fy24-25-we-1-7-rc-grouping-toggle'
)
AND user_auth_segment = 'Logged-in only'
Then use SQL Lab's Download to CSV button. |
Querying with Python and Spark SQL in Jupyter |
import wmfdata as wmf
experiment_name = "sds2-4-11-synth-aa-test-2" # change to "fy24-25-we-1-7-rc-grouping-toggle"
query = """
SELECT
metric_name,
difference_type, -- 'Relative' or 'Absolute'
-- Sample sizes:
sample_sizes.control AS sample_size_control,
sample_sizes.treatment AS sample_size_treatment,
-- NOTE: Group sample sizes are the same for all metrics
-- Per-group average (of metric):
ROUND(sample_means.control, 3) AS metric_control,
ROUND(sample_means.treatment, 3) AS metric_treatment,
-- Bayesian statistical analysis of lift:
ROUND(bayesian_results.estimate, 3) AS bayesian_estimate_of_lift,
ROUND(bayesian_results.chance_to_win, 4) AS probability_lift_greater_than_zero,
CONCAT(
'(', ROUND(bayesian_results.credible_interval_95.lower, 2), ', ',
ROUND(bayesian_results.credible_interval_95.upper, 2), ')'
) AS credible_interval, -- 95% CI (Bayesian)
-- Frequentist statistical analysis of lift:
ROUND(frequentist_results.estimate, 3) AS frequentist_estimate_of_lift,
ROUND(frequentist_results.p_value, 4) AS p_value,
CONCAT(
'(', ROUND(frequentist_results.confidence_interval_95.lower, 2), ', ',
ROUND(frequentist_results.confidence_interval_95.upper, 2), ')'
) AS confidence_interval -- 95% CI (Frequentist)
FROM wmf_experiments.experiment_results_v1
WHERE
experiment_name = (
SELECT name
FROM wmf_experiments.experiments_registry_v1
WHERE machine_name = '{experiment_name}'
)
AND user_auth_segment = 'Logged-in only'
"""
experiment_results_df = wmf.spark.run(query.format(
experiment_name = experiment_name
))
experiment_results_df.to_csv("experiment_results.csv", index=False)
|
Maintain automated analytics
The analysis code lives in a Python jobs repo – https://gitlab.wikimedia.org/repos/product-analytics/experimentation-lab/experiment-analytics-jobs – and has these entry points:
- configuration_loader.py downloads configurations from xLab API and experiment-analytics-configs repo and loads them into:
wmf_experiments.metrics_catalog_v1
wmf_experiments.experiment_registry_v1
- analysis_coordinator.py:
- Uses
experiment_registry_v1
to determines which experiments need to be analyzed and how (type of experiment, which metrics) - Gathers data for analysis of those experiments using information from
metrics_catalog_v1
- Performs analyses of each experiment's metrics
- Stores results in
wmf_experiments.experiment_results_v1
usingMERGE INTO
(updating records for re-analyzed experiments, appending records for newly analyzed experiments)
- Uses
These two jobs are scheduled via the analytics-product Airflow instance:
- experiment_analytics_load_configurations_daily_dag.py runs configuration_loader.py on a daily basis
- experiment_analytics_analyze_experiments_hourly_dag.py runs analysis_coordinator.py on an hourly basis
Both of those DAGs rely on the scripts having been packaged into an artifact, the URI for which is specified in artifacts.yaml.
Resetting metrics catalog
analytics-product-users
group.Due to how the metrics catalog is currently loaded into the data lake – using merge into by name and only ever doing an insert or an update operation, never delete (learn more here) – if a metric is renamed then there will be a copy of its definition in wmf_experiments.metrics_catalog_v1
under the previous name.
To remove the specific metric: sudo -u analytics-product kerberos-run-command analytics-product spark3-sql -e "DELETE FROM wmf_experiments.metrics_catalog_v1 WHERE name = '{previous metric name}';"
To do a full reset the metrics catalog in the data lake (if there are too many renamed metrics to delete one by one):
- Run
sudo -u analytics-product kerberos-run-command analytics-product spark3-sql -e "DELETE FROM wmf_experiments.metrics_catalog_v1 WHERE TRUE;"
on any stat host. - Go to airflow-analytics-product.wikimedia.org and manually run the
load_configurations_daily
DAG.