Experimentation Lab/Analytics sampling

This page documents the options for data collection sampling supported by the Metrics Platform. Analytics sampling specifies how events are determined to be in-sample (sent) or out-of-sample (thrown away). You can configure sampling as part of event stream configuration or using the Experimentation Lab.

Analytics sampling controls data collection, not data generation.

  • Data generation: Instrument code determines when to submit events.
  • Data collection: Analytics sampling determines which events actually get sent to be processed and put into the database
When instrumenting an A/B test, it is the responsibility of the instrumentation code to determine which clients get which feature variants; sampling logic cannot be used for experiment enrolment sampling.

Wiki project

The Wiki project field allows you to set sampling logic per wiki. The default applies to all wikis not specified in other rules. The wiki names used in the keys are database names, such as enwiki for English Wikipedia; see Configuration files.

Sample rate

In xLab UI's Instrument configuration this is under section called Traffic and is allocated using percentages, so "20%" in the UI is 'rate' => 0.2 in the backend.

The sample rate is the proportion of identifiers that are considered in-sample:

  • 1.0 (100%) by default, can be overridden in individual streams
  • set to 0.0 to disable the stream (if you want to keep the stream in the config but prevent events from being sent to it)
  • uses "widening the net" approach: IDs determined to be in-sample at lower rates will be determined to be in-sample at higher rates

For example: Suppose we have 4 streams: A, B, C, and D with sampling rates 0.01, 0.1, 0.25, 0.5, respectively. Those streams could be using the same schema or different ones. But specifically, those streams use the same identifier – let's say it's the session token. Remember, in the MEP paradigm streams map to tables inside the database. Here's what you should expect to see in those tables for any time period:

  • Table A will have data from approximately 1% of active sessions in that time period
  • Table B will have data from approx. 10% of active sessions at that time, but definitely all of the sessions found in table A
  • Table C " " " " ~25% of active sessions at that time, but definitely all of the sessions found in tables A & B
  • Table D " " " " ~half of active sessions at that time, but definitely all of the sessions found in tables A, B, and C

Sample unit

In-sample and out-of-sample determination lasts for the lifetime of the identifier on which it is based.

Defines the scope of the user activity to be tracked, and how it’s grouped for analysis. The unit determines which randomly generated identifier is used to determine whether events are in-sample or out-of-sample. You can choose base your sample on:

pageview
Identified by performer_pageview_id
Determination varies from pageview to pageview.
session
Identified by performer_session_id
Determination varies from session to session.
device
Identified by agent_app_install_id
Determination varies from install to install – if user's install is in-sample and they uninstall and re-install, a new ID will be generated and they may or may not be in-sample again.

Pageview

Web-specific streams can be configured to use the "pageview" unit. This will cause the determination to be made on a page-by-page basis and can be useful for getting a random sample of page views, not sessions.

Every pageview is an independent event with a unique ID. A new pageview ID is generated when the user:

  • Navigates to a page;
  • Refreshes the page;
  • Opens the page again in the same window or tab; or
  • Opens the page again in a different window or tab
FIXME: The pageview ID should be regenerated when navigating away from and then back to the page quickly.
Category:Pages with FIXME on them

Actions which occur within the scope of a single pageview can be correlated by pageview ID. Actions which occur within the scope of multiple pageviews cannot be correlated by pageview ID. These actions must have taken place on a single device and it is safe to assume they were performed by a single user. However, a pageview ID cannot be used as a proxy for an individual user.

Session

A browsing session consists of one or more pageviews on one domain. A new session ID is generated when the user:

  • First navigates to a page;
  • Opens the page again in a private browsing window or tab; or
  • When the session expires

Actions which occur within the scope of a session – i.e. within the scope of multiple pageviews – can be correlated by session ID. These actions must have taken place on a single device and it safe to assume they were performed by a single user. A session ID can be used as a proxy for an individual user.

Session expiry

Sessions can expire on the Wikipedias and in the iOS and Android apps. When a session expires a new session ID is generated. Now, the mechanism for session expiry on the Wikipedias is different for that in the apps:

  • On MediaWiki, a session expires if the user has not clicked, typed, or scrolled in the foreground window or tab for at least 30 minutes;
  • In the apps, a session expires if the user has not used the app for at least 30 minutes

Session scope

On MediaWiki, the session ID is per-domain. If a user views a page on domain A, clicks an interwiki link, and views a page on domain B, then they have two session IDs. Currently, we cannot link those two session IDs.

Additional info about sessions can be found at Analytics/Sessions.

Device

Mobile app-specific streams can be configured to use the "device" unit (app_install_id on iOS and Android). This will cause the determination to be made on a device-by-device basis. If a device is determined to be in-sample, all of their sessions and events will be in-sample. This is useful for retention metrics, cohort and longitudinal analyses, and cross-session analysis.

An app install consists of one or more sessions. A new app install ID is generated when the user first opts into tracking, and opts out of tracking and then opts back into tracking.

Actions which occur within the scope of an app install – i.e. within the scope of multiple sessions – can be correlated by app install ID. These actions must have taken place on a single device and it safe to assume they were performed by a single user. An app install ID can be used as a proxy for an individual user. However, because an app install ID can be regenerated, it cannot be used as a proxy for an individual device.

Category:Pages with FIXME on them