Data Platform/Data Lake/Edits/Public

Right now, this page is a draft where we will work out the best way to publish this dataset. With some compression, we have roughly five billion events adding up to one terabyte of data.

Ideas for splitting

Split by wiki with grouping

Split by wiki, but group all wikis with fewer than ten million events. This results in about 50 separate files, which is nice and manageable. These may be further split into 3 separate files for user, page, and revision histories, depending on the size and ease of working with the data. The down side is that as wikis get over ten million events, they will move their own separate file, potentially causing some confusion. Possible mitigation is a machine-readable index of where each wiki is. Dan Andreescu is currently investigating this approach.

  • grouping wikis with less than 10 million events results in about 50 output files -> 150 if further split by entity
  • grouping less than 30 million events means 25 output files, but does increase the size of the "all others" group to almost the same size as English and wikidata wikis, and doesn't leave any individual "small" wikis which could be useful if people want to test their analysis before downloading a bigger set.
with with_count as (
 select wiki_db,
        sum(events) t
   from milimetric.history_count_by_wiki
  group by wiki_db

), with_label as (
 select if(t > 10000000, wiki_db, 'all others') wiki,
        t
   from with_count

)

 select wiki,
        sum(t) / 5031314059 as percent
   from with_label
  group by wiki
  order by percent desc
  limit 1000
;
WikiRatio of total events
wikidatawiki0.206
enwiki0.203
all others0.100
commonswiki0.090
dewiki0.041
frwiki0.036
eswiki0.027
itwiki0.024
ruwiki0.023
jawiki0.016
viwiki0.014
zhwiki0.013
ptwiki0.013
enwiktionary0.013
plwiki0.013
nlwiki0.012
svwiki0.011
metawiki0.011
arwiki0.009
shwiki0.009
cebwiki0.007
mgwiktionary0.007
fawiki0.007
frwiktionary0.006
ukwiki0.006
hewiki0.006
kowiki0.006
srwiki0.005
trwiki0.005
loginwiki0.005
huwiki0.005
cawiki0.005
nowiki0.004
mediawikiwiki0.004
fiwiki0.004
cswiki0.004
idwiki0.004
rowiki0.003
enwikisource0.003
frwikisource0.003
ruwiktionary0.002
dawiki0.002
bgwiki0.002
incubatorwiki0.002
enwikinews0.002
specieswiki0.002
thwiki0.002

Split by wiki, data set and time in GZipped TSVs

In these splitting idea the directory structure is:

base_path/<wiki_or_wikigroup>/<data_set>/<time_range_1>.tsv.gz
                                        /<time_range_2>.tsv.gz
                                        /...
  • Where <wiki_or_wikigroup> is: enwiki, dewiki, etc. for the top 30 wikis, or the name of a wiki group for smaller wikis, i.e.: medium_wikis (5M < events < 25M) and small_wikis (events < 5M) [thresholds are a guess, haven't checked them, just to present the idea]. Based on Dan's idea of grouping the smaller wikis, but having two groups, so that people interested in one single smaller wiki, don't have to download all wikis except the top 30.
  • Where <data_set> is: mediawiki_history, mediawiki_user_history or mediawiki_page_history.
  • Where <time_range> is either the year (YYYY) or the year and month (YYYY-MM) the events belong to. The idea is to partition dump files by time range, so that files for larger wikis are not so large. By our ballpark calculations enwiki mediawiki_history (full) would be a 200+GB file. One year (2019) of enwiki would be around 16GB and one month a bit more than 1GB. Depending on the size of the wiki/wiki-group we could use YYYY or YYYY-MM partitioning. Or maybe use YYYY always (and accept the 16GB enwiki files).

We thought TSV would be a good data format, because it doesn't contain the field names, like json or yaml would, and it's a bit better than CSV, because commas are more likely to appear on page titles and user names than tabs (so we'd have to escape less with TSV).

Finally I think Gzip is a good format for the dumps, because it's a pretty standard algorithm and you can unzip the files separately and then join them, or also you can join them into a single compressed file and then unzip it. I think Parquet is too technology-specific for the main user of the data lake dumps, no?

Category:Data platform
Category:Data platform