Data Platform/Analyze data
This page outlines the tools and systems available for analyzing private Wikimedia data. For public data, see meta:Research:Data.
Key terms
- Analytics cluster
- The "analytics cluster" is a catch-all term for compute resources and services running inside of the Analytics VLAN, which itself is inside of WMF production network. Individual systems within the analytics cluster include Hadoop, and related components that run the Data Lake.
- Data Lake
- The Data Lake is a large, analytics-oriented repository of Wikimedia data.
- Hadoop
- A collection of services for batch processing of large data. See Hadoop.
- HDFS
- A file system for the Hadoop framework, which WMF uses to store files of various formats the Data Lake.
- Hive
- A system that projects structure onto flat data (text or binary) in HDFS and allows this data to be queried using an SQL-like syntax.
- Stat host
- Stat hosts are servers in the production cluster which you can use to access and analyze Data Platform data.
Get access to internal data
Private data lives in same server cluster that runs Wikimedia's production websites. Often, this means you need production access to access it.
There are varying levels and combinations of access. The type of access you need depends on the tools you want to use, and the type of data you need to access.
You must read and follow these guidelines in all your work with internal data at WMF.
Follow the process to file an access request for your account.
Query and analyze data
After you have access to internal data and systems, you can start exploring and querying data in the Data Lake.
Jupyter notebooks are a friendly and powerful interface for programming that work great for data analysis. The Data Platform has a cloud installation of Jupyter which makes accessing its data easy and secure.
Browse datasets in the Data Lake and view table schemas and other metadata: https://datahub.wikimedia.org.
The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.
- Quickstart notebook
- Syntax differences between query engines
- Query examples
- Query and coding conventions
For lightweight analysis tasks, use Superset, which has a graphical SQL editor where you can run Presto queries, or Hue, which has a graphical SQL editor where you can run Hive queries.
- Wmfdata-Python and Wmfdata-R (available in Jupyter environments on analytics clients)
- wmfastr : for speedy dwelltime and search preference metrics calculations in R
- waxer: R wrapper for the metrics endpoint of the AQS REST API
- MediaWiki-utilities, including tools for parsing HTML and wikitext
- Tools for working with the Wikimedia dumps
- Resources for IP geolocation and geotagging
Use internal versions of public resources
You can access some popular public data sources more quickly and efficiently by using these internal data platform tools or datasets.
For a full overview of the types of data available internally and publicly, see Discover data.
Public pageviews data is available through dumps, APIs, and dashboards, but you can access more granular data internally in the wmf.pageview_hourly
Hive table.
The wmf
database contains internal versions of the public data dumps published at dumps.wikimedia.org. The internal tables include raw and preprocessed edits data. For example, wmf.mediawiki_wikitext_history
provides an internal version of the public XML dumps, refined into Avro data.
Internal users can access EventLogging datasets stored in the event
and event_sanitized
Hive databases, instead of using the public Event Streams service.
Query the MediaWiki APIs internally in R and Python, rather than sending requests over the internet.
Next steps
To learn about how to publish and share your analyses through dashboards, visualizations, and more, see Transform and publish data.
Category:Data platform Category:Landing page