Wikimore

1 00:00:00,000 --> 00:00:02,367 Welcome back to this tutorial

2 00:00:02,367 --> 00:00:04,700 on using OpenRefine to import data

3 00:00:04,700 --> 00:00:06,400 into Wikidata.

4 00:00:06,400 --> 00:00:08,200 In the previous video,

5 00:00:08,200 --> 00:00:10,750 we have matched films against Wikidata items

6 00:00:10,750 --> 00:00:13,100 and checked the quality of these matches.

7 00:00:13,100 --> 00:00:15,050 For each of these films,

8 00:00:15,050 --> 00:00:17,300 we want to add the filming locations

9 00:00:17,300 --> 00:00:19,550 to the Wikidata items.

10 00:00:19,550 --> 00:00:22,150 This requires reconciling the locations as well.

11 00:00:22,150 --> 00:00:25,107 So let's do this.

12 00:00:25,107 --> 00:00:27,104 The locations we have in this dataset

13 00:00:27,104 --> 00:00:29,726 are given as street addresses.

14 00:00:29,726 --> 00:00:31,974 these particular addresses are unlikely

15 00:00:31,974 --> 00:00:33,939 to have a corresponding Wikidata item,

16 00:00:33,939 --> 00:00:36,939 but the streets they are in often have one.

17 00:00:38,665 --> 00:00:40,608 So we are first going to extract

18 00:00:40,608 --> 00:00:41,871 the street names from the addresses.

19 00:00:41,871 --> 00:00:44,000 We use a regular expression

20 00:00:44,000 --> 00:00:47,467 to remove any number at the beginning of the string.

21 00:00:51,504 --> 00:00:54,500 In the preview window, we notice that

22 00:00:55,089 --> 00:00:55,400 our regular expression did not catch

23 00:00:55,400 --> 00:00:57,800 the leading spaces.

24 00:00:57,800 --> 00:01:00,500 This is an indication that these strings

25 00:01:00,500 --> 00:01:03,050 contain non-standard space characters.

26 00:01:03,050 --> 00:01:04,550 They are likely to cause problems

27 00:01:04,550 --> 00:01:07,445 during reconciliation with Wikidata.

28 00:01:07,445 --> 00:01:08,892 So let's just copy these weird characters

29 00:01:08,892 --> 00:01:10,700 and get rid of them

30 00:01:10,700 --> 00:01:13,967 with a first replace function.

31 00:01:15,479 --> 00:01:16,979 The first call to replace

32 00:01:16,979 --> 00:01:18,646 cleans up the whitespace;

33 00:01:19,302 --> 00:01:22,302 the second removes the street numbers.

34 00:01:30,550 --> 00:01:32,538 Pick a name for the column

35 00:01:32,750 --> 00:01:34,600 and create it.

36 00:01:37,050 --> 00:01:40,464 We can now reconcile these streets to Wikidata.

37 00:01:40,464 --> 00:01:43,531 Again, pick "Reconcile" -> "Start reconciling"

38 00:01:43,548 --> 00:01:45,525 and choose the Wikidata service.

39 00:01:48,250 --> 00:01:51,638 In this case, the "street" type is too narrow.

40 00:01:52,071 --> 00:01:54,735 Some locations are parks or bridges

41 00:01:54,735 --> 00:01:57,735 so we manually pick a broader type.

42 00:01:57,936 --> 00:01:59,859 Let's see what other information we could use

43 00:01:59,859 --> 00:02:02,200 to improve the matches.

44 00:02:02,200 --> 00:02:04,754 The postcode looks like a good fit

45 00:02:04,754 --> 00:02:07,300 but unfortunately postcodes are rarely

46 00:02:07,300 --> 00:02:10,300 added on street items.

47 00:02:10,600 --> 00:02:13,000 The last column contains the geographical

48 00:02:13,000 --> 00:02:14,535 coordinates of the locations,

49 00:02:14,535 --> 00:02:17,535 expressed as latitude, comma, longitude.

50 00:02:18,912 --> 00:02:22,379 We can match that to the coordinates of the streets.

51 00:02:22,688 --> 00:02:25,218 The closer these geographical points will be,

52 00:02:25,218 --> 00:02:28,218 the higher the matching score will get.

53 00:02:35,400 --> 00:02:37,949 Once reconciliation is done,

54 00:02:37,949 --> 00:02:39,902 we can inspect the matches.

55 00:02:39,902 --> 00:02:41,644 In this case, we can see that two streets

56 00:02:41,644 --> 00:02:43,034 with the same name

57 00:02:43,034 --> 00:02:45,073 got different matching scores,

58 00:02:45,073 --> 00:02:48,073 thanks to the matching on coordinates.

59 00:02:48,111 --> 00:02:51,111 The first one is the correct one.

60 00:02:52,600 --> 00:02:55,209 This cell was not matched automatically

61 00:02:55,209 --> 00:02:56,550 because the gap between the two scores

62 00:02:56,550 --> 00:02:58,300 is not big enough.

63 00:02:58,300 --> 00:03:00,550 I suspect there are more cases like this,

64 00:03:00,550 --> 00:03:02,900 so I am just going to filter the cells

65 00:03:02,900 --> 00:03:06,100 which were not matched

66 00:03:06,100 --> 00:03:10,431 but whose best candidate score is very high.

67 00:03:12,050 --> 00:03:14,500 I'm also going to add a facet

68 00:03:14,500 --> 00:03:16,238 which computes the string similarity

69 00:03:16,238 --> 00:03:17,819 between the cell content

70 00:03:17,819 --> 00:03:20,202 and the name of the best match

71 00:03:20,202 --> 00:03:23,202 and restrict to the high quality matches.

72 00:03:24,500 --> 00:03:26,127 Let's review these filtered rows

73 00:03:26,127 --> 00:03:27,860 and their best candidates.

74 00:03:48,209 --> 00:03:50,679 All these candidates are correct.

75 00:03:50,679 --> 00:03:52,469 So click "Reconcile" -> "Actions"

76 00:03:52,469 --> 00:03:55,469 -> "Match each cell to its best candidate"

77 00:03:57,015 --> 00:03:58,931 Obviously this operation should be used with care

78 00:03:58,931 --> 00:04:01,931 because it can introduce false positives.

79 00:04:03,650 --> 00:04:05,200 Let's now check the quality

80 00:04:05,200 --> 00:04:07,671 of the matched cells.

81 00:04:08,079 --> 00:04:09,582 For instance,

82 00:04:09,582 --> 00:04:11,100 we can fetch the administrative location

83 00:04:11,100 --> 00:04:15,669 of these streets.

84 00:04:23,676 --> 00:04:24,918 Once these locations are fetched,

85 00:04:24,918 --> 00:04:27,514 we can create a text facet on this column

86 00:04:27,514 --> 00:04:29,168 and sort the facet

87 00:04:29,168 --> 00:04:32,168 by decreasing number of occurrences.

88 00:04:39,550 --> 00:04:42,100 This gives us a broad overview

89 00:04:42,100 --> 00:04:47,241 of the most frequent values.

90 00:04:47,241 --> 00:04:47,800 We can review this list.

91 00:04:47,800 --> 00:04:50,050 All these locations are neighborhoods in Paris,

92 00:04:50,050 --> 00:04:52,947 which is consistent with the dataset.

93 00:05:03,250 --> 00:05:06,551 This is the end of the second part of this tutorial.

94 00:05:06,551 --> 00:05:08,500 In the next video, we are going to

95 00:05:08,500 --> 00:05:10,965 transform our table into statements

96 00:05:10,965 --> 00:05:12,832 and upload them to Wikidata.

Wikimore

TimedText:Wikidata Editing with OpenRefine - Part 2.webm.en.srt