Module:data consistency check/documentation
This module checks the validity and internal consistency of the language, language family, and script data used on Wiktionary: the modules in Category:Language data modules as well as Module:scripts/data.
Output
Discrepancies detected:
Module:etymology languages/canonical names
- Literary Chinese, the canonical name for the code
lzh-lit
, is wrong; it should be Literary Chinese.
Module:etymology languages/code to canonical name
- Literary Chinese, the canonical name for the code
lzh-lit
, is wrong; it should be Literary Chinese.
Module:etymology languages/data
- Literary Chinese (
lzh-lit
) has a canonical name that is not unique; it is also used by the codelzh
.
Module:families/data
- Old Indo-Aryan languages (
inc-old
) has no child families or languages. - Middle Iranian languages (
ira-mid
) has no child families or languages. - Old Iranian languages (
ira-old
) has no child families or languages. - creole languages (
qfa-cre
) has no child families or languages. - pidgin languages (
qfa-pid
) has no child families or languages.
Module:languages/data/2
- Norwegian Bokmål (
nb
) has Middle Norwegian (gmq-mno
) set as an ancestor, but is not in the West Scandinavian languages (gmq-wes
). - Norwegian Bokmål (
nb
) has Danish (da
) set as an ancestor, but is not in the East Scandinavian languages (gmq-eas
).
Module:languages/data/3/h
- Caribbean Hindustani (
hns
) has Bhojpuri (bho
) set as an ancestor, but is not in the Bihari languages (inc-bih
). - Caribbean Hindustani (
hns
) has Awadhi (awa
) set as an ancestor, but is not in the Eastern Hindi languages (inc-hie
).
Module:languages/data/exceptional
- Proto-Central Togo (
alv-gtm-pro
) does not have the expected name "Proto-Ghana-Togo Mountain", even though it is the proto-language of the Ghana-Togo Mountain languages (alv-gtm
). - Proto-Arawa (
auf-pro
) does not have the expected name "Proto-Arauan", even though it is the proto-language of the Arauan languages (auf
). - Proto-Amuesha-Chamicuro (
awd-amc-pro
) has a proto-language code associated with the invalid code"awd-amc"
. - Proto-Kampa (
awd-kmp-pro
) has a proto-language code associated with the invalid code"awd-kmp"
. - Proto-Arawak (
awd-pro
) does not have the expected name "Proto-Arawakan", even though it is the proto-language of the Arawakan languages (awd
). - Proto-Paresi-Waura (
awd-prw-pro
) has a proto-language code associated with the invalid code"awd-prw"
. - Proto-Ta-Arawak (
awd-taa-pro
) does not have the expected name "Proto-Ta-Arawakan", even though it is the proto-language of the Ta-Arawakan languages (awd-taa
). - Proto-Rukai (
dru-pro
) has a proto-language code associated with Rukai (dru
), which is not a family. - Proto-Basque (
euq-pro
) does not have the expected name "Proto-Vasconic", even though it is the proto-language of the Vasconic languages (euq
). - Proto-Norse (
gmq-pro
) does not have the expected name "Proto-North Germanic", even though it is the proto-language of the North Germanic languages (gmq
). - Proto-Kamta (
inc-krd-pro
) does not have the expected name "Proto-KRDS lects", even though it is the proto-language of the KRDS lects (inc-krd
). - Kelantan Peranakan Hokkien (
mis-hkl
) has its canonical name ("Kelantan Peranakan Hokkien"
) repeated in the table ofaliases
. - Proto-Chumash (
nai-chu-pro
) does not have the expected name "Proto-Chumashan", even though it is the proto-language of the Chumashan languages (nai-chu
). - Proto-Maidun (
nai-mdu-pro
) does not have the expected name "Proto-Maiduan", even though it is the proto-language of the Maiduan languages (nai-mdu
). - Proto-Mixe-Zoque (
nai-miz-pro
) does not have the expected name "Proto-Mixe-Zoquean", even though it is the proto-language of the Mixe-Zoquean languages (nai-miz
). - Proto-Pomo (
nai-pom-pro
) does not have the expected name "Proto-Pomoan", even though it is the proto-language of the Pomoan languages (nai-pom
). - Proto-Mazatec (
omq-maz-pro
) does not have the expected name "Proto-Mazatecan", even though it is the proto-language of the Mazatecan languages (omq-maz
). - Proto-North Sarawak (
poz-swa-pro
) does not have the expected name "Proto-North Sarawakan", even though it is the proto-language of the North Sarawakan languages (poz-swa
). - Proto-Salish (
sal-pro
) does not have the expected name "Proto-Salishan", even though it is the proto-language of the Salishan languages (sal
). - Proto-Puroik (
sit-khp-pro
) has a proto-language code associated with the invalid code"sit-khp"
. - Proto-Northern Naga (
sit-kon-pro
) does not have the expected name "Proto-Konyak", even though it is the proto-language of the Konyak languages (sit-kon
). - Proto-Samic (
smi-pro
) does not have the expected name "Proto-Sami", even though it is the proto-language of the Sami languages (smi
). - Proto-Kuki-Chin (
tbq-kuk-pro
) does not have the expected name "Proto-Kukish", even though it is the proto-language of the Kukish languages (tbq-kuk
). - Proto-Saka (
xsc-sak-pro
) does not have the expected name "Proto-Sakan", even though it is the proto-language of the Sakan languages (xsc-sak
).
Module:languages/data/wikidata.json
apc
is set as an ISO 639-3 code on multiple items:Q56593
andQ22809485
.kjv
is set as an ISO 639-3 code on multiple items:Q838165
andQ31199873
.msn
is set as an ISO 639-3 code on multiple items:Q3331111
andQ3563857
.ttt
is set as an ISO 639-3 code on multiple items:Q56489
andQ123964178
.
Module:scripts/data
- Blissymbolic script (
Blis
) is not used by any language and has no characters listed for auto-detection. - Cypro-Minoan script (
Cpmn
) is not used by any language. - Hiragana script (
Hira
) is not used by any language. - Kana script (
Hrkt
) is not used by any language. - Image-rendered script (
Image
) is not used by any language and has no characters listed for auto-detection. - International Phonetic Alphabet (
Ipach
) is not used by any language and has no characters listed for auto-detection. - Moon script (
Moon
) is not used by any language and has no characters listed for auto-detection. - Morse code (
Morse
) is not used by any language and has no characters listed for auto-detection. - musical notation (
Music
) is not used by any language. - unspecified script (
None
) is not used by any language and has no characters listed for auto-detection. - Proto-Cuneiform script (
Pcun
) is not used by any language and has no characters listed for auto-detection. - Proto-Elamite script (
Pelm
) is not used by any language and has no characters listed for auto-detection. - Proto-Sinaitic script (
Psin
) is not used by any language and has no characters listed for auto-detection. - Rongorongo script (
Roro
) is not used by any language and has no characters listed for auto-detection. - Rumi numerals (
Rumin
) is not used by any language. - flag semaphore (
Semap
) is not used by any language and has no characters listed for auto-detection. - Visible Speech script (
Visp
) is not used by any language and has no characters listed for auto-detection. - mathematical notation (
Zmth
) is not used by any language. - symbolic script (
Zsym
) is not used by any language. - undetermined script (
Zyyy
) is not used by any language and has no characters listed for auto-detection. - uncoded script (
Zzzz
) is not used by any language and has no characters listed for auto-detection. - The codes
fa-Arab
,ug-Arab
,ks-Arab
,ps-Arab
,ur-Arab
,ku-Arab
,tt-Arab
,ota-Arab
,mzn-Arab
andsd-Arab
are currently alias codes. Only one code should be used in the data. - The codes
ms-Arab
andkk-Arab
are currently alias codes. Only one code should be used in the data.
Checks performed
For multiple data modules:
- Codes for languages, families and etymology-only languages must be unique and cannot clash with one another.
- Canonical names for languages, families, and etymology-only languages must not be found in the list of other names.
- Each name in the list of other names must appear only once.
otherNames
, if present, must be an array.- Wikidata item IDs must be a positive integer or a string starting with
Q
and ending with decimal digits.
The following must be true of the data used by Module:languages:
- Each code must be defined in the correct submodule according to whether it is two-letter, three-letter or exceptional.
- The canonical name (field
1
) must be present and must not be the same as the canonical name of another language. - If field
2
is notnil
, it must a valid Wikidata item ID. - If field
3
orfamily
is given and notnil
, it must be a valid family code. - If field
4
orscripts
is given and notnil
, it must be an array, and each string in the array must be a valid script code. - If
ancestors
is given, it must be an array, and each string in the array must be a valid language or etymology language code. - If
family
is given, it must be a valid family code. - If
type
is given, it must be one of the recognised values (regular
,reconstructed
,appendix-constructed
). - If
entry_name
is given, it must be a table that contains either two arrays (from
andto
) or a string (remove_diacritics
) or both. - If
sort_key
is given, it may either be a string, or at table that in turn contains either two arrays (from
andto
) or a string (remove_diacritics
). - If
entry_name
orsort_key
is given, thefrom
array must be longer or equal in length to theto
array. - If
standardChars
is given, it must form a valid Lua string pattern when placed between square brackets with^
before it ("[^...]
). (It should match all characters regularly used in the language, but that cannot be tested.) - If
override_translit
is set,translit
must also be set, because there must be a transliteration module that can override manual transliteration. - If
link_tr
is present, it must betrue
. - Have no data keys besides these:
1, 2, 3, "entry_name", "sort_key", "display", "otherNames", "aliases", "varieties", "type", "scripts", "ancestors", "wikimedia_codes", "wikipedia_article", "standardChars", "translit", "override_translit", "link_tr"
.
Checks not performed:
- If
translit
is present, it should be the name of a module, and this module should contain atr
function that takes a pagename (and optionally a language code and script code) as arguments. - If
sort_key
is a string, it should be the name of a module, and this module should contain amakeSortKey
function that takes a pagename (and optionally a language code and script code) as arguments. - If
entry_name
orsort_key
is a table and contains a fieldremove_diacritics
, the value of the field should be a string that forms a valid Lua pattern when it is placed inside negated set notation ([^...]
).
These are not checked here, because module errors will quickly crop up in entries if these conditions are not met, assuming that Module:utilities attempts to generate a sortkey for a category pertaining to the language in question, or full_link
attempts to use the transliteration module.
Module:languages/code to canonical name and Module:languages/canonical names must contain all the codes and canonical names found in the data submodules of Module:languages, and no more.
The following must be true of the data used by Module:etymology languages:
canonicalName
must be given.parent
must be given must be a valid language, family or etymology-only language code.- If
ancestors
is given, it must be an array, and each string in the array must be a valid language or etymology language code. The etymology language should also be listed as the ancestor of a regular language. - Have no data keys besides these:
"canonicalName", "otherNames", "parent", "ancestors", "wikipedia_article", "wikidata_item"
.
Codes in Module:families data must:
- Have
canonicalName
, which must not be the same as the canonical name of another family. - If
family
is given, it must be a valid family code. - Have at least one language or subfamily belonging to it.
- Have no data keys besides these:
"canonicalName", "otherNames", "family", "protoLanguage", "wikidata_item"
.
Codes in Module:scripts data must:
- Have
canonicalName
. - Have at least one language that lists it as one of its scripts.
- Have a
characters
pattern for script autodetection, and this must form a valid Lua string pattern when placed between square brackets ("[...]"
). (It should match all characters in the script, but that cannot be tested.) - Have no data keys besides these:
"canonicalName", "otherNames", "parent", "systems", "wikipedia_article", "characters", "direction"
.