FirstVoices dictionary data schemas

Introduction

In FirstVoices, words and phrases are stored in data tables. This page describes the properties of those tables, such as the fields or columns, the data types (e.g., numbers or text), and rules such as which fields are required and the maximum lengths for text.

This is intended for use by:

  • people who want to make sure their data is compatible with FirstVoices

  • developers who want to transfer data between FirstVoices and another system

  • developers who are interested in ways of modelling dictionary data for BC Indigenous languages

Tables and Overall Structure

Core Tables

PartOfSpeech

A standard list of parts of speech for use in all language sites.

Site

Represents the language site.

  • id (uuid, required)

  • visibility (“team” or “member” or “public”, required)

  • other fields unrelated to dictionary data

Site Content Tables

Common Fields

All site content tables include the following fields:

  • site (site id, required)

  • created (timestamp, automatic)

  • last_modified (timestamp, automatic)

  • created_by (user id, required)

  • last_modified_by (user id, required)

Character

Represents an alphabet character. For a discussion of related issues and schemas see Unicode Confusables and Adjacency Pairs Terms of Use

  • title (text, max 10 chars, required)

  • sort_order (integer, required, must be unique within the site)

  • approximate_form(text, max 20 chars, optional)

  • note(text, optional)

  • related_audio(many-to-many relation with Audio rows that have the same site id)

  • related_images(many-to-many relation with Image rows that have the same site id)

  • related_videos(many-to-many relation with Video rows that have the same site id)

DictionaryEntry

Represents a word or phrase.

  • title (text, max 225 chars, required)

  • type (“word” or “phrase”, required)

  • visibility (“team” or “member” or “public”, required)

  • custom_order (text, max 225 chars, automatic, generated based on configured Characters for the same Site)

  • part_of_speech(foreign key relation with PartOfSpeech)

  • acknowledgements (array of text, each max 500 chars, optional)

  • notes (array of text, each max 500 chars, optional)

  • translations (array of text, each max 225 chars, optional)

  • pronunciations (array of text, each max 225 chars, optional)

  • alternate_spellings (array of text, each max 225 chars, optional)

  • categories (many-to-many relation with Category rows that have the same site id)

  • exclude_from_kids(boolean, default false)

  • exclude_from_games(boolean, default false)

  • related_dictionary_entries(many-to-many relation with DictionaryEntry rows that have the same site id)

  • related_characters(many-to-many relation with Character rows that have the same site id)

  • related_audio(many-to-many relation with Audio rows that have the same site id)

  • related_images(many-to-many relation with Image rows that have the same site id)

  • related_videos(many-to-many relation with Video rows that have the same site id)

  • related_video_links(array of URLs from YouTube or Vimeo)

Category

Represents a topical grouping for words and phrases.

  • title (text, max 75 chars, required)

  • description (text, max 500 chars, optional)

  • parent (foreign key relation to Category in the same site, maximum 1 level of nesting enforced)

Audio

Represents an audio object with metadata.

  • original (one-to-one relation with File, required)

  • title (text, max 200 chars, required)

  • description (text, max 500 chars, optional)

  • acknowledgement (text, max 500 chars, optional)

  • exclude_from_kids(boolean, default false)

  • exclude_from_games(boolean, default false)

  • speakers (many-to-many relation with Person in the same Site)

Image

Represents an image object with metadata.

  • original (one-to-one relation with VisualFile, required)

  • generated thumbnail file fields: medium, small, and thumbnail

  • title (text, max 200 chars, required)

  • description (text, max 500 chars, optional)

  • acknowledgement (text, max 500 chars, optional)

  • exclude_from_kids(boolean, default false)

  • exclude_from_games(boolean, default false)

Video

Represents a video object with metadata.

  • original (one-to-one relation with VisualFile, required)

  • generated thumbnail file fields: medium, small, and thumbnail

  • title (text, max 200 chars, required)

  • description (text, max 500 chars, optional)

  • acknowledgement (text, max 500 chars, optional)

  • exclude_from_kids(boolean, default false)

  • exclude_from_games(boolean, default false)

File

Represents a file, for example an image or audio file.

  • content (file, required)

  • mimetype (text, optional, generated)

  • size (integer, optional, generated)

VisualFile

  • content (file, required)

  • mimetype (text, optional, generated)

  • size (integer, optional, generated)

  • height (integer, optional, generated)

  • width (integer, optional, generated)

Person

Represents a person, for example a speaker on an audio file.

Fields:

  • name (text, max 200 chars, required)

  • bio (text, max 1000 chars, optional)

Discussion

In the 20+ years that FirstVoices has been storing BC Indigenous dictionary data we have tried many variations in data schemas, including more and less specific fields, and allowing wysiwyg or html formatting in some fields. We aim to find a balance between precision (more fields that are more specific) and ease of use for language teams (fewer decisions about similar fields, and fewer fields that are unused by each team).

The current schema is based on observations from real world use and feedback from dozens of language documentation teams over the years.

  • When there are too many specific fields, such as different types of acknowledgements (source, recorder, speaker, citation, etc), there are more opportunities to make mistakes or have different interpretations. It is more confusing for the people inputting the data, and it is harder for developers to format the data in a way that will work for all teams.

  • We have found that a list of general acknowledgements is simpler and more useful than separating types of references, credits, and thank you notes. The exception is data about speakers for audio files, where it is important to be able to group audio by the person or people speaking, separately from other types of acknowledgements.

  • In most fields, wysiwyg or html formatting is rarely used and not essential. We reserve this feature for longer formatted text such as stories, rather than for smaller text items like notes on dictionary entries.

  • Including an alternate spelling field helps avoid duplicate input work by teams

  • The distinction between words and phrases is different depending on the language context. We use the same data schema for both, and distinguish them based only on the type field.

  • Different languages have different needs related to tracking parts of speech. For example, some language teams enter prefixes as their own dictionary entries. We use a standard list of parts of speech for all teams to avoid confusion, but we are looking for ways to give each team a customizable view of their most used options.

  • For people’s names, whenever practical we use a single “full name” field to allow for whatever naming convention is best for that person.

  • While we currently do not specify the type of relationship between dictionary entries, this is something that some teams would be interested in. E.g., marking another dictionary entry as the “root” or “plural”.

  • For related dictionary entries, we use directional relations (with a “from” and “to”) to account for situations where a word or phrase has many incoming links. In those situations, the team may wish to specify a few outgoing links that are highlighted differently, for example a few “example phrases” for a common word.