Commons:GLAMwiki Toolset Project/NARA analytics pilot

The NARA analytics pilot is a pilot project to produce the first image usage analytics of a GLAM upload. This is a quick and dirty approach done at the Zürich Hackathon 2014. Beware, the data and analytics didn't go through any qualilty control so it might be full of errors.

History

Report on Requirements for Usage and Reuse Statistics for GLAM Content

At the moment the Wikimedia Foundation collects page view statistics, but no image view statistics. This means we don't know how many views our images and other media files get, we try to derive it based on page views. A long time ago someone enabled logging of image views for NARA images. The data has been collecting for several years, but nothing has been done with the data yet. Unfortunately the data was rotated (old files deleted) so the dataset we worked on was from 2014-02-05 to 2014-05-09. The logs are sampled 1:10 so as a rule of thumb you night to multiple the numbers with 10. We made an overview of requirements for usage and re-usage statistics for GLAM content so we know what questions we want to answer.

Steps

  • udplog collects NARA data. The Raw data format is at wikitech:Analytics/Data_access#Request_logs
  • This data is imported to Hadoop on stat1002.eqiad.wmnet to the table webrequest_glam_nara on database Otto (thanks Andrew)
  • We do a query to clean up the data and to get the same image names (multiple url's for the same image, I hope we solved the fileviews problem)
CREATE TABLE webrequest_glam_nara_cleaned(
  dt string COMMENT 'from deserializer', 
  country_code string COMMENT 'from deserializer', 
  uri string COMMENT 'from deserializer', 
  content_type string COMMENT 'from deserializer', 
  referer string COMMENT 'from deserializer')
--PARTITIONED BY ( 
--  year int, 
--  month int, 
--  day int)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
INSERT INTO TABLE
  webrequest_glam_nara_cleaned
SELECT
  dt, country_code, regexp_replace(uri, 'https?://upload.wikimedia.org/wikipedia/commons(/thumb)?/(\\w/\\w\\w)/([^\/]+)(.*)', '//upload.wikimedia.org/wikipedia/commons/$2/$3'), content_type, referer
FROM
  webrequest_glam_nara
WHERE 
    year=2014 
  AND 
    (uri LIKE 'http://upload.wikimedia.org/%' OR uri like 'https://upload.wikimedia.org/%');

First results

Total number of views

SELECT COUNT(*) FROM webrequest_glam_nara_cleaned;

Over the time period we had a total of 21,547,979 hits. So, taking into account the 1:10 sampling, this means roughly 200 million hits in 3 months. We could split this out by day and make a histogram

Views per day

"Give me a daily overviewcount of the views my media files had (FileViews)"

See dataset and visualisation.

There is a spike on March 24. Further analysis shows that the biggest referral on that day is Dorothy Height. Turns out this lady was featured on a Google Doodle on that day.

Views per month

"Give me a monthly overview of the views my media files had (FileViews)"

PeriodViews
2014-02 (from 05)5,394,192
2014-037,222,593
2014-046,848,673
2014-05 (end 09)2,082,521
SELECT 
  count(*) as freq, uri
FROM
  webrequest_glam_nara_cleaned 
GROUP BY 
  uri 
ORDER BY
  freq desc 
LIMIT 20;
February 2014
March 2014
April 2014

Imageviews by country

SELECT 
  country_code as country, count(*) as freq
FROM
  webrequest_glam_nara_cleaned 
GROUP BY 
  country_code 
ORDER BY
  freq desc
LIMIT 2000;

We could make a heatmap of this one.

countryfreq
US10806780
GB1338132
XX1050605
DE879286
CA822878
JP566449
FR479325
AU459854
IN370386
NL293125
RU235243
MX226716
IT217636
CN199184
ES194763
BR144575
PH134343
SE126166
PL113083
EU109736
CH104411
AT103840
IE93439
CO81917
NZ81675
SG79750
NO78881
TR75130

Imageviews per country per month

Give me a monthly overview of the provenance of the users that see my media files

February 2014
countryfreq
US2710846
GB343203
XX248388
DE224909
CA203417
JP175721
FR126888
AU103175
IN96543
MX65935
NL65767
IT60654
ES55110
RU50584
CN40986
PH39484
SE30751
BR30243
EU28404
CH25599
March 2014
countryfreq
US3540291
GB465823
XX351699
DE303557
CA273932
JP189155
FR163550
AU152841
IN123431
NL105033
RU91192
MX84487
CN73168
IT68658
ES65726
PH51155
BR46257
SE43836
PL41642
EU37951
April 2014
countryfreq
US3476374
GB404892
XX344029
CA269814
DE267976
AU158220
JP154575
FR145757
IN117831
NL96214
RU74868
IT67890
CN64992
MX58800
ES57575
BR47004
SE39736
PL36828
PH33647
EU33524

Top referers

SELECT
  referer, count(*) as freq 
FROM
  webrequest_glam_nara_cleaned
GROUP BY
  referer
ORDER BY
  freq
DESC
LIMIT 20;
refererfreq
<none>906001
http://en.wikipedia.org/wiki/World_War_I505325
http://en.m.wikipedia.org/wiki/World_War_I392841
http://en.wikipedia.org/wiki/Japanese_American_internment197975
http://en.wikipedia.org/wiki/Vietnam_War196377
http://en.wikipedia.org/wiki/Franklin_D._Roosevelt191373
http://en.wikipedia.org/wiki/Atomic_bombings_of_Hiroshima_and_Nagasaki188715
http://en.wikipedia.org/wiki/Martin_Luther_King,_Jr.180489
http://en.wikipedia.org/wiki/World_War_II166533
http://en.wikipedia.org/wiki/John_F._Kennedy164430
http://en.m.wikipedia.org/wiki/List_of_Presidents_of_the_United_States149187
http://en.m.wikipedia.org/wiki/Vietnam_War141585
http://en.wikipedia.org/wiki/George_H._W._Bush138513
http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States138324
https://www.google.com/136984
http://en.m.wikipedia.org/wiki/Atomic_bombings_of_Hiroshima_and_Nagasaki136312
http://en.m.wikipedia.org/wiki/John_F._Kennedy131820
http://en.m.wikipedia.org/wiki/World_War_II128544
http://en.m.wikipedia.org/wiki/Franklin_D._Roosevelt126696
http://en.wikipedia.org/wiki/March_on_Washington_for_Jobs_and_Freedom113333

Top external referers

SELECT
  referer, count(*) as freq 
FROM
  webrequest_glam_nara_cleaned
WHERE
  referer not like '%wiki%'
GROUP BY
  referer
ORDER BY
  freq
DESC
LIMIT 20;
refererfreq
https://www.google.com/136984
http://www.google.com/47612
http://www.google.com/blank.html33039
http://hainn12.blogspot.com/12769
https://www.google.co.uk/11585
http://www.reddit.com/7940
https://www.google.ca/7713
https://www.google.co.in/4959
http://www.google.co.uk/4165
https://www.google.fr/3816
https://www.google.com.au/3708
https://www.google.com.mx/3228
https://www.google.es/3182
http://hainn8x12.blogspot.com/3128
https://www.google.it/2687
https://www.google.co.jp/2646
https://www.google.de/2565
http://www.google.de/2508
http://www.google.co.uk/blank.html2460

(removed the none one)

Top external referers (without Google)

SELECT
  referer, count(*) as freq 
FROM
  webrequest_glam_nara_cleaned
WHERE
  referer not like '%wiki%' and referer not like '%google%'
GROUP BY
  referer
ORDER BY
  freq
DESC
LIMIT 20;
http://hainn12.blogspot.com/12769
http://www.reddit.com/7940
http://hainn8x12.blogspot.com/3128
http://newtownliterary.wordpress.com/wp-admin/post.php?post=767&action=edit&message=12379
http://image-search.kik.com/1336
http://hainn12.blogspot.com.es/1292
http://www.reddit.com/r/all/954
http://ex-frat-man.tumblr.com/915
http://hainn12.blogspot.co.uk/913
http://blogs.discovermagazine.com/bodyhorrors/882
http://www.reddit.com/r/HistoryPorn/710
http://www.moava.org/index.php?pageID=222670
http://hainn12.blogspot.de/615
http://www.answers.com/topic/environmental-protection-agency608
http://hainn12.blogspot.com.tr/587
http://hainn12.blogspot.tw/569
http://www.answers.com/topic/martin-luther-king-jr509
http://www.answers.com/topic/world-war-i481
http://www.answers.com/topic/new-deal445

Source code

See https://github.com/Commonists/limn-glam for a visualization code base which was based on now deprecated Limn.

Category:GLAMwiki toolset project Category:National Archives and Records Administration
Category:GLAMwiki toolset project Category:National Archives and Records Administration Category:Pages using deprecated source tags