TimedText:Wikidata Editing with OpenRefine - Part 2.webm.en.srt
1 00:00:00,000 --> 00:00:02,367 Welcome back to this tutorial
2 00:00:02,367 --> 00:00:04,700 on using OpenRefine to import data
3 00:00:04,700 --> 00:00:06,400 into Wikidata.
4 00:00:06,400 --> 00:00:08,200 In the previous video,
5 00:00:08,200 --> 00:00:10,750 we have matched films against Wikidata items
6 00:00:10,750 --> 00:00:13,100 and checked the quality of these matches.
7 00:00:13,100 --> 00:00:15,050 For each of these films,
8 00:00:15,050 --> 00:00:17,300 we want to add the filming locations
9 00:00:17,300 --> 00:00:19,550 to the Wikidata items.
10 00:00:19,550 --> 00:00:22,150 This requires reconciling the locations as well.
11 00:00:22,150 --> 00:00:25,107 So let's do this.
12 00:00:25,107 --> 00:00:27,104 The locations we have in this dataset
13 00:00:27,104 --> 00:00:29,726 are given as street addresses.
14 00:00:29,726 --> 00:00:31,974 these particular addresses are unlikely
15 00:00:31,974 --> 00:00:33,939 to have a corresponding Wikidata item,
16 00:00:33,939 --> 00:00:36,939 but the streets they are in often have one.
17 00:00:38,665 --> 00:00:40,608 So we are first going to extract
18 00:00:40,608 --> 00:00:41,871 the street names from the addresses.
19 00:00:41,871 --> 00:00:44,000 We use a regular expression
20 00:00:44,000 --> 00:00:47,467 to remove any number at the beginning of the string.
21 00:00:51,504 --> 00:00:54,500 In the preview window, we notice that
22 00:00:55,089 --> 00:00:55,400 our regular expression did not catch
23 00:00:55,400 --> 00:00:57,800 the leading spaces.
24 00:00:57,800 --> 00:01:00,500 This is an indication that these strings
25 00:01:00,500 --> 00:01:03,050 contain non-standard space characters.
26 00:01:03,050 --> 00:01:04,550 They are likely to cause problems
27 00:01:04,550 --> 00:01:07,445 during reconciliation with Wikidata.
28 00:01:07,445 --> 00:01:08,892 So let's just copy these weird characters
29 00:01:08,892 --> 00:01:10,700 and get rid of them
30 00:01:10,700 --> 00:01:13,967 with a first replace function.
31 00:01:15,479 --> 00:01:16,979 The first call to replace
32 00:01:16,979 --> 00:01:18,646 cleans up the whitespace;
33 00:01:19,302 --> 00:01:22,302 the second removes the street numbers.
34 00:01:30,550 --> 00:01:32,538 Pick a name for the column
35 00:01:32,750 --> 00:01:34,600 and create it.
36 00:01:37,050 --> 00:01:40,464 We can now reconcile these streets to Wikidata.
37 00:01:40,464 --> 00:01:43,531 Again, pick "Reconcile" -> "Start reconciling"
38 00:01:43,548 --> 00:01:45,525 and choose the Wikidata service.
39 00:01:48,250 --> 00:01:51,638 In this case, the "street" type is too narrow.
40 00:01:52,071 --> 00:01:54,735 Some locations are parks or bridges
41 00:01:54,735 --> 00:01:57,735 so we manually pick a broader type.
42 00:01:57,936 --> 00:01:59,859 Let's see what other information we could use
43 00:01:59,859 --> 00:02:02,200 to improve the matches.
44 00:02:02,200 --> 00:02:04,754 The postcode looks like a good fit
45 00:02:04,754 --> 00:02:07,300 but unfortunately postcodes are rarely
46 00:02:07,300 --> 00:02:10,300 added on street items.
47 00:02:10,600 --> 00:02:13,000 The last column contains the geographical
48 00:02:13,000 --> 00:02:14,535 coordinates of the locations,
49 00:02:14,535 --> 00:02:17,535 expressed as latitude, comma, longitude.
50 00:02:18,912 --> 00:02:22,379 We can match that to the coordinates of the streets.
51 00:02:22,688 --> 00:02:25,218 The closer these geographical points will be,
52 00:02:25,218 --> 00:02:28,218 the higher the matching score will get.
53 00:02:35,400 --> 00:02:37,949 Once reconciliation is done,
54 00:02:37,949 --> 00:02:39,902 we can inspect the matches.
55 00:02:39,902 --> 00:02:41,644 In this case, we can see that two streets
56 00:02:41,644 --> 00:02:43,034 with the same name
57 00:02:43,034 --> 00:02:45,073 got different matching scores,
58 00:02:45,073 --> 00:02:48,073 thanks to the matching on coordinates.
59 00:02:48,111 --> 00:02:51,111 The first one is the correct one.
60 00:02:52,600 --> 00:02:55,209 This cell was not matched automatically
61 00:02:55,209 --> 00:02:56,550 because the gap between the two scores
62 00:02:56,550 --> 00:02:58,300 is not big enough.
63 00:02:58,300 --> 00:03:00,550 I suspect there are more cases like this,
64 00:03:00,550 --> 00:03:02,900 so I am just going to filter the cells
65 00:03:02,900 --> 00:03:06,100 which were not matched
66 00:03:06,100 --> 00:03:10,431 but whose best candidate score is very high.
67 00:03:12,050 --> 00:03:14,500 I'm also going to add a facet
68 00:03:14,500 --> 00:03:16,238 which computes the string similarity
69 00:03:16,238 --> 00:03:17,819 between the cell content
70 00:03:17,819 --> 00:03:20,202 and the name of the best match
71 00:03:20,202 --> 00:03:23,202 and restrict to the high quality matches.
72 00:03:24,500 --> 00:03:26,127 Let's review these filtered rows
73 00:03:26,127 --> 00:03:27,860 and their best candidates.
74 00:03:48,209 --> 00:03:50,679 All these candidates are correct.
75 00:03:50,679 --> 00:03:52,469 So click "Reconcile" -> "Actions"
76 00:03:52,469 --> 00:03:55,469 -> "Match each cell to its best candidate"
77 00:03:57,015 --> 00:03:58,931 Obviously this operation should be used with care
78 00:03:58,931 --> 00:04:01,931 because it can introduce false positives.
79 00:04:03,650 --> 00:04:05,200 Let's now check the quality
80 00:04:05,200 --> 00:04:07,671 of the matched cells.
81 00:04:08,079 --> 00:04:09,582 For instance,
82 00:04:09,582 --> 00:04:11,100 we can fetch the administrative location
83 00:04:11,100 --> 00:04:15,669 of these streets.
84 00:04:23,676 --> 00:04:24,918 Once these locations are fetched,
85 00:04:24,918 --> 00:04:27,514 we can create a text facet on this column
86 00:04:27,514 --> 00:04:29,168 and sort the facet
87 00:04:29,168 --> 00:04:32,168 by decreasing number of occurrences.
88 00:04:39,550 --> 00:04:42,100 This gives us a broad overview
89 00:04:42,100 --> 00:04:47,241 of the most frequent values.
90 00:04:47,241 --> 00:04:47,800 We can review this list.
91 00:04:47,800 --> 00:04:50,050 All these locations are neighborhoods in Paris,
92 00:04:50,050 --> 00:04:52,947 which is consistent with the dataset.
93 00:05:03,250 --> 00:05:06,551 This is the end of the second part of this tutorial.
94 00:05:06,551 --> 00:05:08,500 In the next video, we are going to
95 00:05:08,500 --> 00:05:10,965 transform our table into statements
96 00:05:10,965 --> 00:05:12,832 and upload them to Wikidata.