Wikimore

This module checks the validity and internal consistency of the language, language family, and script data used on Wiktionary: the modules in Category:Language data modules as well as Module:scripts/data.

Output

Discrepancies detected:

Module:etymology languages/canonical names

Literary Chinese, the canonical name for the code lzh-lit, is wrong; it should be Literary Chinese.

Module:etymology languages/code to canonical name

Literary Chinese, the canonical name for the code lzh-lit, is wrong; it should be Literary Chinese.

Module:etymology languages/data

Literary Chinese (lzh-lit) has a canonical name that is not unique; it is also used by the code lzh.

Module:families/data

Old Indo-Aryan languages (inc-old) has no child families or languages.
Middle Iranian languages (ira-mid) has no child families or languages.
Old Iranian languages (ira-old) has no child families or languages.
creole languages (qfa-cre) has no child families or languages.
pidgin languages (qfa-pid) has no child families or languages.

Module:languages/data/2

Norwegian Bokmål (nb) has Middle Norwegian (gmq-mno) set as an ancestor, but is not in the West Scandinavian languages (gmq-wes).
Norwegian Bokmål (nb) has Danish (da) set as an ancestor, but is not in the East Scandinavian languages (gmq-eas).

Module:languages/data/3/h

Caribbean Hindustani (hns) has Bhojpuri (bho) set as an ancestor, but is not in the Bihari languages (inc-bih).
Caribbean Hindustani (hns) has Awadhi (awa) set as an ancestor, but is not in the Eastern Hindi languages (inc-hie).

Module:languages/data/exceptional

Proto-Central Togo (alv-gtm-pro) does not have the expected name "Proto-Ghana-Togo Mountain", even though it is the proto-language of the Ghana-Togo Mountain languages (alv-gtm).
Proto-Arawa (auf-pro) does not have the expected name "Proto-Arauan", even though it is the proto-language of the Arauan languages (auf).
Proto-Amuesha-Chamicuro (awd-amc-pro) has a proto-language code associated with the invalid code "awd-amc".
Proto-Kampa (awd-kmp-pro) has a proto-language code associated with the invalid code "awd-kmp".
Proto-Arawak (awd-pro) does not have the expected name "Proto-Arawakan", even though it is the proto-language of the Arawakan languages (awd).
Proto-Paresi-Waura (awd-prw-pro) has a proto-language code associated with the invalid code "awd-prw".
Proto-Ta-Arawak (awd-taa-pro) does not have the expected name "Proto-Ta-Arawakan", even though it is the proto-language of the Ta-Arawakan languages (awd-taa).
Proto-Rukai (dru-pro) has a proto-language code associated with Rukai (dru), which is not a family.
Proto-Basque (euq-pro) does not have the expected name "Proto-Vasconic", even though it is the proto-language of the Vasconic languages (euq).
Proto-Norse (gmq-pro) does not have the expected name "Proto-North Germanic", even though it is the proto-language of the North Germanic languages (gmq).
Proto-Kamta (inc-krd-pro) does not have the expected name "Proto-KRDS lects", even though it is the proto-language of the KRDS lects (inc-krd).
Kelantan Peranakan Hokkien (mis-hkl) has its canonical name ("Kelantan Peranakan Hokkien") repeated in the table of aliases.
Proto-Chumash (nai-chu-pro) does not have the expected name "Proto-Chumashan", even though it is the proto-language of the Chumashan languages (nai-chu).
Proto-Maidun (nai-mdu-pro) does not have the expected name "Proto-Maiduan", even though it is the proto-language of the Maiduan languages (nai-mdu).
Proto-Mixe-Zoque (nai-miz-pro) does not have the expected name "Proto-Mixe-Zoquean", even though it is the proto-language of the Mixe-Zoquean languages (nai-miz).
Proto-Pomo (nai-pom-pro) does not have the expected name "Proto-Pomoan", even though it is the proto-language of the Pomoan languages (nai-pom).
Proto-Mazatec (omq-maz-pro) does not have the expected name "Proto-Mazatecan", even though it is the proto-language of the Mazatecan languages (omq-maz).
Proto-North Sarawak (poz-swa-pro) does not have the expected name "Proto-North Sarawakan", even though it is the proto-language of the North Sarawakan languages (poz-swa).
Proto-Salish (sal-pro) does not have the expected name "Proto-Salishan", even though it is the proto-language of the Salishan languages (sal).
Proto-Puroik (sit-khp-pro) has a proto-language code associated with the invalid code "sit-khp".
Proto-Northern Naga (sit-kon-pro) does not have the expected name "Proto-Konyak", even though it is the proto-language of the Konyak languages (sit-kon).
Proto-Samic (smi-pro) does not have the expected name "Proto-Sami", even though it is the proto-language of the Sami languages (smi).
Proto-Kuki-Chin (tbq-kuk-pro) does not have the expected name "Proto-Kukish", even though it is the proto-language of the Kukish languages (tbq-kuk).
Proto-Saka (xsc-sak-pro) does not have the expected name "Proto-Sakan", even though it is the proto-language of the Sakan languages (xsc-sak).

Module:languages/data/wikidata.json

apc is set as an ISO 639-3 code on multiple items: Q56593 and Q22809485.
kjv is set as an ISO 639-3 code on multiple items: Q838165 and Q31199873.
msn is set as an ISO 639-3 code on multiple items: Q3331111 and Q3563857.
ttt is set as an ISO 639-3 code on multiple items: Q56489 and Q123964178.

Module:scripts/data

Blissymbolic script (Blis) is not used by any language and has no characters listed for auto-detection.
Cypro-Minoan script (Cpmn) is not used by any language.
Hiragana script (Hira) is not used by any language.
Kana script (Hrkt) is not used by any language.
Image-rendered script (Image) is not used by any language and has no characters listed for auto-detection.
International Phonetic Alphabet (Ipach) is not used by any language and has no characters listed for auto-detection.
Moon script (Moon) is not used by any language and has no characters listed for auto-detection.
Morse code (Morse) is not used by any language and has no characters listed for auto-detection.
musical notation (Music) is not used by any language.
unspecified script (None) is not used by any language and has no characters listed for auto-detection.
Proto-Cuneiform script (Pcun) is not used by any language and has no characters listed for auto-detection.
Proto-Elamite script (Pelm) is not used by any language and has no characters listed for auto-detection.
Proto-Sinaitic script (Psin) is not used by any language and has no characters listed for auto-detection.
Rongorongo script (Roro) is not used by any language and has no characters listed for auto-detection.
Rumi numerals (Rumin) is not used by any language.
flag semaphore (Semap) is not used by any language and has no characters listed for auto-detection.
Visible Speech script (Visp) is not used by any language and has no characters listed for auto-detection.
mathematical notation (Zmth) is not used by any language.
symbolic script (Zsym) is not used by any language.
undetermined script (Zyyy) is not used by any language and has no characters listed for auto-detection.
uncoded script (Zzzz) is not used by any language and has no characters listed for auto-detection.
The codes fa-Arab, ug-Arab, ks-Arab, ps-Arab, ur-Arab, ku-Arab, tt-Arab, ota-Arab, mzn-Arab and sd-Arab are currently alias codes. Only one code should be used in the data.
The codes ms-Arab and kk-Arab are currently alias codes. Only one code should be used in the data.

Checks performed

For multiple data modules:

Codes for languages, families and etymology-only languages must be unique and cannot clash with one another.
Canonical names for languages, families, and etymology-only languages must not be found in the list of other names.
Each name in the list of other names must appear only once.
otherNames, if present, must be an array.
Wikidata item IDs must be a positive integer or a string starting with Q and ending with decimal digits.

The following must be true of the data used by Module:languages:

Each code must be defined in the correct submodule according to whether it is two-letter, three-letter or exceptional.
The canonical name (field 1) must be present and must not be the same as the canonical name of another language.
If field 2 is not nil, it must a valid Wikidata item ID.
If field 3 or family is given and not nil, it must be a valid family code.
If field 4 or scripts is given and not nil, it must be an array, and each string in the array must be a valid script code.
If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code.
If family is given, it must be a valid family code.
If type is given, it must be one of the recognised values (regular, reconstructed, appendix-constructed).
If entry_name is given, it must be a table that contains either two arrays (from and to) or a string (remove_diacritics) or both.
If sort_key is given, it may either be a string, or at table that in turn contains either two arrays (from and to) or a string (remove_diacritics).
If entry_name or sort_key is given, the from array must be longer or equal in length to the to array.
If standardChars is given, it must form a valid Lua string pattern when placed between square brackets with ^ before it ("[^...]). (It should match all characters regularly used in the language, but that cannot be tested.)
If override_translit is set, translit must also be set, because there must be a transliteration module that can override manual transliteration.
If link_tr is present, it must be true.
Have no data keys besides these: 1, 2, 3, "entry_name", "sort_key", "display", "otherNames", "aliases", "varieties", "type", "scripts", "ancestors", "wikimedia_codes", "wikipedia_article", "standardChars", "translit", "override_translit", "link_tr".

Checks not performed:

If translit is present, it should be the name of a module, and this module should contain a tr function that takes a pagename (and optionally a language code and script code) as arguments.
If sort_key is a string, it should be the name of a module, and this module should contain a makeSortKey function that takes a pagename (and optionally a language code and script code) as arguments.
If entry_name or sort_key is a table and contains a field remove_diacritics, the value of the field should be a string that forms a valid Lua pattern when it is placed inside negated set notation ([^...]).

These are not checked here, because module errors will quickly crop up in entries if these conditions are not met, assuming that Module:utilities attempts to generate a sortkey for a category pertaining to the language in question, or full_link attempts to use the transliteration module.

Module:languages/code to canonical name and Module:languages/canonical names must contain all the codes and canonical names found in the data submodules of Module:languages, and no more.

The following must be true of the data used by Module:etymology languages:

canonicalName must be given.
parent must be given must be a valid language, family or etymology-only language code.
If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code. The etymology language should also be listed as the ancestor of a regular language.
Have no data keys besides these: "canonicalName", "otherNames", "parent", "ancestors", "wikipedia_article", "wikidata_item".

Codes in Module:families data must:

Have canonicalName, which must not be the same as the canonical name of another family.
If family is given, it must be a valid family code.
Have at least one language or subfamily belonging to it.
Have no data keys besides these: "canonicalName", "otherNames", "family", "protoLanguage", "wikidata_item".

Codes in Module:scripts data must:

Have canonicalName.
Have at least one language that lists it as one of its scripts.
Have a characters pattern for script autodetection, and this must form a valid Lua string pattern when placed between square brackets ("[...]"). (It should match all characters in the script, but that cannot be tested.)
Have no data keys besides these: "canonicalName", "otherNames", "parent", "systems", "wikipedia_article", "characters", "direction".

Module:data consistency check/documentation

Output

Checks performed