Data Platform/Analyze data

This page outlines the tools and systems available for analyzing private Wikimedia data. For public data, see meta:Research:Data.

Key terms

This term list focuses only on what you should know to get started using the Data Platform for analysis. A more comprehensive glossary is at Data_Platform/Systems/Cluster#Glossary.
Analytics cluster
The "analytics cluster" is a catch-all term for compute resources and services running inside of the Analytics VLAN, which itself is inside of WMF production network. Individual systems within the analytics cluster include Hadoop, and related components that run the Data Lake.
Data Lake
The Data Lake is a large, analytics-oriented repository of Wikimedia data.
Hadoop
A collection of services for batch processing of large data. See Hadoop.
HDFS
A file system for the Hadoop framework, which WMF uses to store files of various formats the Data Lake.
Hive
A system that projects structure onto flat data (text or binary) in HDFS and allows this data to be queried using an SQL-like syntax.
Stat host
Stat hosts are servers in the production cluster which you can use to access and analyze Data Platform data.

Get access to internal data

Private data lives in same server cluster that runs Wikimedia's production websites. Often, this means you need production access to access it.

There are varying levels and combinations of access. The type of access you need depends on the tools you want to use, and the type of data you need to access.

You must read and follow these guidelines in all your work with internal data at WMF.

Follow the process to file an access request for your account.

Query and analyze data

After you have access to internal data and systems, you can start exploring and querying data in the Data Lake.

Jupyter notebooks are a friendly and powerful interface for programming that work great for data analysis. The Data Platform has a cloud installation of Jupyter which makes accessing its data easy and secure.

Browse datasets in the Data Lake and view table schemas and other metadata: https://datahub.wikimedia.org.

Run SQL queries

The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.

For lightweight analysis tasks, use Superset, which has a graphical SQL editor where you can run Presto queries, or Hue, which has a graphical SQL editor where you can run Hive queries.

Use libraries and analysis packages
Documentation pages for specific data sources may also contain example queries for working with that dataset. For example: wmf.webrequest Sample queries.

Use internal versions of public resources

You can access some popular public data sources more quickly and efficiently by using these internal data platform tools or datasets.

For a full overview of the types of data available internally and publicly, see Discover data.

Public pageviews data is available through dumps, APIs, and dashboards, but you can access more granular data internally in the wmf.pageview_hourly Hive table.

The wmf database contains internal versions of the public data dumps published at dumps.wikimedia.org. The internal tables include raw and preprocessed edits data. For example, wmf.mediawiki_wikitext_history provides an internal version of the public XML dumps, refined into Avro data.

Internal users can access EventLogging datasets stored in the event and event_sanitized Hive databases, instead of using the public Event Streams service.

Internal MediaWiki API requests

Query the MediaWiki APIs internally in R and Python, rather than sending requests over the internet.

Next steps

To learn about how to publish and share your analyses through dashboards, visualizations, and more, see Transform and publish data.

Category:Data platform Category:Landing page
Category:Data platform Category:Landing page