Unicode Confusables and Adjacency Pairs Terms of Use

 


What is the FirstVoices Unicode 'Confusables' Database?


The FirstVoices Unicode 'Confusables' Database is a collection of Unicode characters found primarily within BC-based Indigenous languages, alongside the Unicode characters that can be mistakenly used in place of them. As an example:

Latin Small Letter K With Line Below

\u1E35

is a character found in many Indigenous alphabets. Some Unicode characters commonly mistaken for this one are:

Latin Small Letter K + Combining Low Line

Latin Small Letter K + Combining Minus Sign Below

Latin Small Letter K + Combining Macron Below

\u006B\u0332

\u006B\u0320

\u006B\u0331

 


What is the FirstVoices Unicode 'Adjacency Pair' Database?


The FirstVoices Unicode 'Adjacency Pair' Database is an aggregated collection of Unicode-based character pairs used in BC-based Indigenous languages, alongside a snapshot of their type-based frequencies from the FirstVoices language platform. It demonstrates the relative frequency of different Unicode characters as well as the relative frequency of adjacent pairs of characters in BC-based Indigenous languages.

There are three component resources: a database of pairs by Unicode codepoint, a database of pairs by grapheme, and a list of graphemes. To illustrate the difference:

Here is one example adjacency pair by codepoint, involving one lowercase letter and one combining diacritic:

Latin Small Letter K + Combining Comma Above

\u1E35 + \u0313

Because it involves a combining diacritic, it is also an example of a grapheme.

Here is one example adjacency pair by grapheme, which involves a total of two non-combining characters:

k̓w

k̓w

Latin Small Letter K, Combining Comma Above + Latin Small Letter W

\u1E35\u0313 + \u0077

 


Where is the database?


The FirstVoices Unicode Databases can be found here.

 


Why did we curate these databases?


Indigenous languages are sorely underrepresented online and in digital spaces. UTF-8 is the standard encoding that should be used, as it provides access to all the unicode symbols necessary. Despite that, thorough testing is necessary to make digital environments truly inclusive. "Uncommon" Unicode characters (i.e. characters that aren't used in alphabets of majority languages like English) are often untested when considering how software, websites, fonts or other digital environments render text. The result is that these characters often render incorrectly, or not at all.

The other problem we wanted to resolve on our own platform (FirstVoices) were issues of search and sorting for dictionary content. We found that many issues arose from discrepancies between which Unicode characters our system expected to see as 'alphabet letters' and the Unicode characters that were actually being input into entries. As an example, if a language has ḵ (\u1E35) in its alphabet on FirstVoices, but a word is spelt using ḵ (\u006B\u0331), that word had a much lower likelihood of showing up properly in search results, and it would often be placed wrong in alphabetized lists of words. These 'confusable' characters were being added to language content for a number of reasons: they were being copy-pasted from other documents, language recorders were using a keyboard that had the wrong character, and more. To fix this, we documented instances of confusables for every alphabet character found on FirstVoices, and developed a system that detects confusables within language content, matches the confusable to the correct alphabet character for that language, and then automatically corrects the Unicode in the content.

While normalizing functions work to resolve differences in characters like á (\u00E1) and á (\u0061\u0301), it doesn't resolve issues like pʼ (\u0070\u02bc) and p̓ (\u0070\u0313) and p̕ (\u0070\u0315) – these aren't examples of Canonical Equivalence or Compatibility Equivalence but are commonly mistaken for one another, depending on the font being used, due to their similar look.

 


Why are we publishing this database?


We see this work as useful and applicable for other projects, whether they have an Indigenous focus or not. Some examples of where this database may be of use:

  • Testing fonts – Is your font equipped to handle Indigenous languages in the correct way?

  • Building and testing keyboards – Does your keyboard employ the Unicode characters that are correct for the language you're working with?

  • Checking for Unicode discrepancies within Indigenous-language data sets – Are there 'confusables' within a document or data set that may cause issues later on?

  • Improving search functionality within Indigenous-language software – Are there instances of 'confusables' in your language data that may be reducing functionality?

  • Other – If you have an application for this work that we haven't considered, please let us know!

 


Terms of use


The FirstVoices Unicode 'Confusables' and 'Adjacency Pair' Databases are published under the APACHE LICENSE, VERSION 2.0

Our team would like to track the scope of this work, and understand its broader applications as well as potentially share your work for inspiration on how to improve Indigenous language support in digital spaces. Please email hello@firstvoices.com to let us know about where and how this database is being used for your project.