Quickly building (Keyman) Predictive Text keyboards

The Rowbory/Nigeria Family Blog

Quickly building (Keyman) Predictive Text keyboards

The excellent Keyboard App Builder helps you build keyboards for Android devices, wrapping the very clever Keyman Android Engine. I’ve been working on a keyboard for the Nigerian languages I know and one feature I’ve been experimenting with is predictive text, where the keyboard suggests what it thinks you may be saying so you can click on the word to autocomplete it. I thought I’d write up some of my process in case this helps me and others in the future.

Note that you have to create one project for the keyboard and a separate project for the text model (predictive text).

Keyman’s Developer environment needs a tab-separated values file with words and their frequencies. How do I get those data? Paratext SFM files are one great source, as are any other plain text files. Here’s my bash script (on Mac OS but would also work on Linux):

sed "s/\\[a-z0-9-+*]*//" \
| sed "s/[0-9]+ / /" \
| tr [:space:] '\n' \
| tr [:punct:][:digit:] '\n' \
| sort | uniq -c | sort -rn \
| sed "s/ *(.) (.)/\2$(printf '\t')\1/" | tail -n +2
  • The first line removes all standard format markers (\v, \p etc).
  • The next removes numbers.
  • Then all spaces are turned into newlines
  • Punctuation and digits are turned into newlines too.
  • These lines are sorted alphabetically with the unique values counted and then sorted again in order of frequency.
  • A tab (\t) is put in to replace the space.
  • Then we dump out all lines apart from the first (which will be empty).

I actually put all that above in a freq.sh shell script which I call like this:

cat *.SFM | freq.sh > wordlist.tsv

This pipes the whole contents of every SFM file into the script and dumps the result into a file for Keyman Developer to use.

Finally, if you want to ignore distinctions for hooked letters and diacritics when predicting text, you can add something to the Source directly.

const source: LexicalModelSource = {
  format: 'trie-1.0',
  wordBreaker: 'default',
  sources: ['wordlist.tsv'],
  searchTermToKey: function (wordform: string): string {
    // Your searchTermToKey function goes here!
  // Use this pattern to remove common diacritical marks (accents).
  // See: https://www.compart.com/en/unicode/block/U+0300
  const COMBINING_DIACRITICAL_MARKS = /[\u0300-\u036f]/g;
    let key = wordform.toLowerCase();
    key = key.normalize('NFKD');
    key = key.replace("ƙ","k");
    key = key.replace("ɓ","b");
    key = key.replace("ɗ","d");
    key = key.replace("ƴ","y");
    key = key.replace("a̱","a");
    key = key.replace("i̱","i");
    key = key.replace("o̱","o");
    key = key.replace("e̱","e");
    key = key.replace("\ua78c","");
    key = key.replace("\'","");
    key = key.replace(COMBINING_DIACRITICAL_MARKS, '');

    return key;
  // other customizations go here:
export default source;

Source view of the Pangu text model.

Oh, and did I mention that you can add all your lexical models to the one project in Keyman Developer so you can build all at once? Also you can build everything into the one output directory so that it’s dead easy to find the compiled files ready for release.

No Comments

Add your comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.