MariaDB/PII
This page describes the procedure for removing Personally Identifiable Information (PII) from Wiki Replicas.
This procedure must be followed every time a new wiki is added to the database. The full procedure for adding a new wiki is described at Add a wiki.
Sanitize the Wiki Data
Note: As of Jun 2024 - the default section for new wikis is s5 but this might not be the case in the future.
Use the `sre.mysql.sanitize-wiki` cookbook from cumin:
sudo cookbook [GLOBAL_ARGS] sre.mysql.sanitize-wiki [-h] --wiki WIKI [--check-only] --task TASK [--only-grant-and-view]
cookbook sre.mysql.sanitize-wiki --wiki myfirstwiki --wiki mysecondwiki Will sanitize those wikis according to what is documented above. cookbook sre.mysql.sanitize-wiki --wiki mywiki --check-only This will run the first part of the procedure, limited to checking if there is some sanitization to perform.
Optional arguments:
-h, --help show this help message and exit --wiki WIKI Name of the wiki (can be repeated) (default: None) --check-only Only perform checks without making changes (default: False) --task TASK Phabricator task (default: None) --only-grant-and-view Only perform grant permissions and create view database. This should be quicker than a typical run. It will skip checking existing needs for sanitization in dbs, and focus on the last steps as in https://wikitech.wikimedia.org/wiki/MariaDB/PII#Grant_Permissions_for_SQL_Views (default: False)
Notify Relevant Teams
Once the database is sanitized and the view database is created, the work managed by DBAs is complete.
The next step is to create the views, and that is managed by the Wikimedia Cloud Services team.
Until we define a better process, assign the task to fnegri.
Restarting sanitarium instances
To restart all instances on sanitarium hosts use the `sre.mysql.sanitarium_restart` cookbook:
sudo cookbook --dry-run sre.mysql.sanitarium_restart
It shows a summary before and asks for confirmation before restarting the instances. It supports `-t` to log events into Phabricator and `--dc` and `--hostnames` to limit which hosts to operate on.
Optional arguments:
-t TASK_ID, --task-id TASK_ID Phabricator task ID --dc DC Datacenter e.g. --dc=eqiad --hostnames [HOSTNAMES ...] Limit restarts: --hostnames db0000 db1234
Manual PII cleanup process
The process used before the cookbook is described below:
Sanitize the Wiki Data
Go to the sanitarium hosts in both data centers and run the following command to clean up the data:
redact_sanitarium.sh -d $NEW_WIKI_NAME -S /run/mysqld/mysqld.s5.sock | mysql -S /run/mysqld/mysqld.s5.sock
Run the Private Data Check Script
Execute the check_private_data.py script to identify which table/columns need to be dropped:
check_private_data.py -S /run/mysqld/mysqld.s5.sock
If the output makes sense, you can pipe it directly to MySQL to drop the necessary table/columns:
check_private_data.py -S /run/mysqld/mysqld.s5.sock | mysql -S /run/mysqld/mysqld.s5.sock
Verify changes on each wikireplicas host
clouddb10[13-20]
(owned by the Cloud Services team), plus an-redacteddb1001
(owned by the Data Platform SRE team)Ensure that you run check_private_data.py on each replicas host after the whole process on your side is done to avoid leaking Personally Identifiable Information:
check_private_data.py -S /path/to/socket
Grant Permissions for SQL Views
Identify the wikireplicas hosts that belong to the relevant database section (e.g., s5). Grant the labsdbuser role the appropriate grants by running:
GRANT SELECT, SHOW VIEW ON `NEW_WIKI_NAME_p`.* TO `labsdbuser`;
This needs to be done on all wikireplicas hosts that have the relevant database section.
Create the View Database
Create the view database on all the wikireplicas hosts that belong to the relevant database section:
CREATE DATABASE NEW_WIKI_NAME_p;