Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

...

What is the FirstVoices Unicode 'Confusables' Database?

...

The FirstVoices Unicode 'Confusables' Database is a collection of Unicode characters found primarily within BC-based Indigenous languages, alongside the Unicode characters that can be mistakenly used in place of them. As an example:

Latin Small Letter K With Line Below

\u1E35

is a character found in many Indigenous alphabets. Some Unicode characters commonly mistaken for this one are:

Latin Small Letter K + Combining Low Line

Latin Small Letter K + Combining Minus Sign Below

Latin Small Letter K + Combining Macron Below

\u006B\u0332

\u006B\u0320

\u006B\u0331

...

What is the FirstVoices Unicode 'Adjacency Pair' Database?

...

Here is one example adjacency pair by codepoint, involving one lowercase letter and one combining diacritic:

Latin Small Letter K + Combining Comma Above

\u1E35 + \u0313

Because it involves a combining diacritic, it is also an example of a grapheme.

Here is one example adjacency pair by grapheme, which involves a total of two non-combining characters:

k̓w

Latin Small Letter K, Combining Comma Above + Latin Small Letter W

\u1E35\u0313 + \u0077

...

Where is the database?

...

The FirstVoices Unicode Databases can be found here.

...

We see this work as useful and applicable for other projects, whether they have an Indigenous focus or not. Some examples of where this database may be of use:

  • Testing fonts – Is your font equipped to handle Indigenous languages in the correct way?

  • Building and testing keyboards – Does your keyboard employ the Unicode characters that are correct for the language you're working with?

  • Checking for Unicode discrepancies within Indigenous-language data sets – Are there 'confusables' within a document or data set that may cause issues later on?

  • Improving search functionality within Indigenous-language software – Are there instances of 'confusables' in your language data that may be reducing functionality?

  • Other – If you have an application for this work that we haven't considered, please let us know!

...

Terms of use

...

The FirstVoices Unicode 'Confusables' and 'Adjacency Pair' Databases are published under the APACHE LICENSE, VERSION 2.0

Our team would like to track the scope of this work, and understand its broader applications as well as potentially share your work for inspiration on how to improve Indigenous language support in digital spaces. Please email hello@firstvoices.com to let us know about where and how this database is being used for your project.

...