Class Vocabulary

java.lang.Object
com.itextpdf.pdfocr.onnxtr.recognition.Vocabulary

public class Vocabulary extends Object
A string-based LUT for mapping text recognition model results to characters.

This class assumes, that each character is represented with a single UTF-16 code unit. So the string itself can be used as a LUT. If this is not the case, results will be unpredictable.

It pretty much implements IOutputLabelMapper for Character but since it would involve unnecessary boxing, it is a standalone thing instead.

  • Field Details

    • ASCII_LOWERCASE

      public static final Vocabulary ASCII_LOWERCASE
    • ASCII_UPPERCASE

      public static final Vocabulary ASCII_UPPERCASE
    • ASCII_LETTERS

      public static final Vocabulary ASCII_LETTERS
    • DIGITS

      public static final Vocabulary DIGITS
    • PUNCTUATION

      public static final Vocabulary PUNCTUATION
    • CURRENCY

      public static final Vocabulary CURRENCY
    • LATIN

      public static final Vocabulary LATIN
    • ENGLISH

      public static final Vocabulary ENGLISH
    • LEGACY_FRENCH

      public static final Vocabulary LEGACY_FRENCH
    • FRENCH

      public static final Vocabulary FRENCH
    • HINDI_DIGITS

      public static final Vocabulary HINDI_DIGITS
    • GENERIC_CYRILLIC_LETTERS

      public static final Vocabulary GENERIC_CYRILLIC_LETTERS
    • RUSSIAN_CYRILLIC_LETTERS

      public static final Vocabulary RUSSIAN_CYRILLIC_LETTERS
    • RUSSIAN_SIGNS

      public static final Vocabulary RUSSIAN_SIGNS
    • ANCIENT_GREEK

      public static final Vocabulary ANCIENT_GREEK
    • ARABIC_DIACRITICS

      public static final Vocabulary ARABIC_DIACRITICS
    • ARABIC_DIGITS

      public static final Vocabulary ARABIC_DIGITS
    • ARABIC_LETTERS

      public static final Vocabulary ARABIC_LETTERS
    • ARABIC_PUNCTUATION

      public static final Vocabulary ARABIC_PUNCTUATION
    • PERSIAN_LETTERS

      public static final Vocabulary PERSIAN_LETTERS
    • BENGALI_CONSONANTS

      public static final Vocabulary BENGALI_CONSONANTS
    • BENGALI_VOWELS

      public static final Vocabulary BENGALI_VOWELS
    • BENGALI_DIGITS

      public static final Vocabulary BENGALI_DIGITS
    • BENGALI_MATRAS

      public static final Vocabulary BENGALI_MATRAS
    • BENGALI_VIRAMA

      public static final Vocabulary BENGALI_VIRAMA
    • BENGALI_PUNCTUATION

      public static final Vocabulary BENGALI_PUNCTUATION
    • BENGALI_SIGNS

      public static final Vocabulary BENGALI_SIGNS
    • GUJARATI_CONSONANTS

      public static final Vocabulary GUJARATI_CONSONANTS
    • GUJARATI_VOWELS

      public static final Vocabulary GUJARATI_VOWELS
    • GUJARATI_DIGITS

      public static final Vocabulary GUJARATI_DIGITS
    • GUJARATI_MATRAS

      public static final Vocabulary GUJARATI_MATRAS
    • GUJARATI_VIRAMA

      public static final Vocabulary GUJARATI_VIRAMA
    • GUJARATI_PUNCTUATION

      public static final Vocabulary GUJARATI_PUNCTUATION
    • GUJARATI_SIGNS

      public static final Vocabulary GUJARATI_SIGNS
    • DEVANAGARI_CONSONANTS

      public static final Vocabulary DEVANAGARI_CONSONANTS
    • DEVANAGARI_VOWELS

      public static final Vocabulary DEVANAGARI_VOWELS
    • DEVANAGARI_DIGITS

      public static final Vocabulary DEVANAGARI_DIGITS
    • DEVANAGARI_MATRAS

      public static final Vocabulary DEVANAGARI_MATRAS
    • DEVANAGARI_VIRAMA

      public static final Vocabulary DEVANAGARI_VIRAMA
    • DEVANAGARI_PUNCTUATION

      public static final Vocabulary DEVANAGARI_PUNCTUATION
    • DEVANAGARI_SIGNS

      public static final Vocabulary DEVANAGARI_SIGNS
    • PUNJABI_CONSONANTS

      public static final Vocabulary PUNJABI_CONSONANTS
    • PUNJABI_VOWELS

      public static final Vocabulary PUNJABI_VOWELS
    • PUNJABI_DIGITS

      public static final Vocabulary PUNJABI_DIGITS
    • PUNJABI_MATRAS

      public static final Vocabulary PUNJABI_MATRAS
    • PUNJABI_VIRAMA

      public static final Vocabulary PUNJABI_VIRAMA
    • PUNJABI_PUNCTUATION

      public static final Vocabulary PUNJABI_PUNCTUATION
    • PUNJABI_SIGNS

      public static final Vocabulary PUNJABI_SIGNS
    • TAMIL_CONSONANTS

      public static final Vocabulary TAMIL_CONSONANTS
    • TAMIL_VOWELS

      public static final Vocabulary TAMIL_VOWELS
    • TAMIL_DIGITS

      public static final Vocabulary TAMIL_DIGITS
    • TAMIL_MATRAS

      public static final Vocabulary TAMIL_MATRAS
    • TAMIL_VIRAMA

      public static final Vocabulary TAMIL_VIRAMA
    • TAMIL_PUNCTUATION

      public static final Vocabulary TAMIL_PUNCTUATION
    • TAMIL_SIGNS

      public static final Vocabulary TAMIL_SIGNS
    • TAMIL_FRACTIONS

      public static final Vocabulary TAMIL_FRACTIONS
    • TELUGU_CONSONANTS

      public static final Vocabulary TELUGU_CONSONANTS
    • TELUGU_DIGITS

      public static final Vocabulary TELUGU_DIGITS
    • TELUGU_VOWELS

      public static final Vocabulary TELUGU_VOWELS
    • TELUGU_MATRAS

      public static final Vocabulary TELUGU_MATRAS
    • TELUGU_VIRAMA

      public static final Vocabulary TELUGU_VIRAMA
    • TELUGU_PUNCTUATION

      public static final Vocabulary TELUGU_PUNCTUATION
    • TELUGU_SIGNS

      public static final Vocabulary TELUGU_SIGNS
    • KANNADA_CONSONANTS

      public static final Vocabulary KANNADA_CONSONANTS
    • KANNADA_VOWELS

      public static final Vocabulary KANNADA_VOWELS
    • KANNADA_DIGITS

      public static final Vocabulary KANNADA_DIGITS
    • KANNADA_MATRAS

      public static final Vocabulary KANNADA_MATRAS
    • KANNADA_VIRAMA

      public static final Vocabulary KANNADA_VIRAMA
    • KANNADA_PUNCTUATION

      public static final Vocabulary KANNADA_PUNCTUATION
    • KANNADA_SIGNS

      public static final Vocabulary KANNADA_SIGNS
    • SINHALA_CONSONANTS

      public static final Vocabulary SINHALA_CONSONANTS
    • SINHALA_VOWELS

      public static final Vocabulary SINHALA_VOWELS
    • SINHALA_DIGITS

      public static final Vocabulary SINHALA_DIGITS
    • SINHALA_MATRAS

      public static final Vocabulary SINHALA_MATRAS
    • SINHALA_VIRAMA

      public static final Vocabulary SINHALA_VIRAMA
    • SINHALA_PUNCTUATION

      public static final Vocabulary SINHALA_PUNCTUATION
    • SINHALA_SIGNS

      public static final Vocabulary SINHALA_SIGNS
    • MALAYALAM_CONSONANTS

      public static final Vocabulary MALAYALAM_CONSONANTS
    • MALAYALAM_VOWELS

      public static final Vocabulary MALAYALAM_VOWELS
    • MALAYALAM_DIGITS

      public static final Vocabulary MALAYALAM_DIGITS
    • MALAYALAM_MATRAS

      public static final Vocabulary MALAYALAM_MATRAS
    • MALAYALAM_VIRAMA

      public static final Vocabulary MALAYALAM_VIRAMA
    • MALAYALAM_SIGNS

      public static final Vocabulary MALAYALAM_SIGNS
    • ODIA_CONSONANTS

      public static final Vocabulary ODIA_CONSONANTS
    • ODIA_VOWELS

      public static final Vocabulary ODIA_VOWELS
    • ODIA_DIGITS

      public static final Vocabulary ODIA_DIGITS
    • ODIA_MATRAS

      public static final Vocabulary ODIA_MATRAS
    • ODIA_VIRAMA

      public static final Vocabulary ODIA_VIRAMA
    • ODIA_PUNCTUATION

      public static final Vocabulary ODIA_PUNCTUATION
    • ODIA_SIGNS

      public static final Vocabulary ODIA_SIGNS
    • KHMER_CONSONANTS

      public static final Vocabulary KHMER_CONSONANTS
    • KHMER_VOWELS

      public static final Vocabulary KHMER_VOWELS
    • KHMER_DIGITS

      public static final Vocabulary KHMER_DIGITS
    • KHMER_MATRAS

      public static final Vocabulary KHMER_MATRAS
    • KHMER_DIACRITICS

      public static final Vocabulary KHMER_DIACRITICS
    • KHMER_VIRAMA

      public static final Vocabulary KHMER_VIRAMA
    • KHMER_PUNCTUATION

      public static final Vocabulary KHMER_PUNCTUATION
    • BURMESE_CONSONANTS

      public static final Vocabulary BURMESE_CONSONANTS
    • BURMESE_VOWELS

      public static final Vocabulary BURMESE_VOWELS
    • BURMESE_DIGITS

      public static final Vocabulary BURMESE_DIGITS
    • BURMESE_DIACRITICS

      public static final Vocabulary BURMESE_DIACRITICS
    • BURMESE_VIRAMA

      public static final Vocabulary BURMESE_VIRAMA
    • BURMESE_PUNCTUATION

      public static final Vocabulary BURMESE_PUNCTUATION
    • JAVANESE_CONSONANTS

      public static final Vocabulary JAVANESE_CONSONANTS
    • JAVANESE_VOWELS

      public static final Vocabulary JAVANESE_VOWELS
    • JAVANESE_DIGITS

      public static final Vocabulary JAVANESE_DIGITS
    • JAVANESE_DIACRITICS

      public static final Vocabulary JAVANESE_DIACRITICS
    • JAVANESE_VIRAMA

      public static final Vocabulary JAVANESE_VIRAMA
    • JAVANESE_PUNCTUATION

      public static final Vocabulary JAVANESE_PUNCTUATION
    • SUDANESE_CONSONANTS

      public static final Vocabulary SUDANESE_CONSONANTS
    • SUDANESE_VOWELS

      public static final Vocabulary SUDANESE_VOWELS
    • SUDANESE_DIGITS

      public static final Vocabulary SUDANESE_DIGITS
    • SUDANESE_DIACRITICS

      public static final Vocabulary SUDANESE_DIACRITICS
    • HEBREW_CANTILLATIONS

      public static final Vocabulary HEBREW_CANTILLATIONS
    • HEBREW_CONSONANTS

      public static final Vocabulary HEBREW_CONSONANTS
    • HEBREW_SPECIALS

      public static final Vocabulary HEBREW_SPECIALS
    • HEBREW_PUNCTUATION

      public static final Vocabulary HEBREW_PUNCTUATION
    • HEBREW_VOWELS

      public static final Vocabulary HEBREW_VOWELS
    • ALBANIAN

      public static final Vocabulary ALBANIAN
    • AFRIKAANS

      public static final Vocabulary AFRIKAANS
    • BASQUE

      public static final Vocabulary BASQUE
    • CATALAN

      public static final Vocabulary CATALAN
    • CROATIAN

      public static final Vocabulary CROATIAN
    • CZECH

      public static final Vocabulary CZECH
    • DANISH

      public static final Vocabulary DANISH
    • DUTCH

      public static final Vocabulary DUTCH
    • ESTONIAN

      public static final Vocabulary ESTONIAN
    • FINNISH

      public static final Vocabulary FINNISH
    • GERMAN

      public static final Vocabulary GERMAN
    • HUNGARIAN

      public static final Vocabulary HUNGARIAN
    • INDONESIAN

      public static final Vocabulary INDONESIAN
    • IRISH

      public static final Vocabulary IRISH
    • ITALIAN

      public static final Vocabulary ITALIAN
    • LUXEMBOURGISH

      public static final Vocabulary LUXEMBOURGISH
    • MALAY

      public static final Vocabulary MALAY
    • NORWEGIAN

      public static final Vocabulary NORWEGIAN
    • POLISH

      public static final Vocabulary POLISH
    • PORTUGUESE

      public static final Vocabulary PORTUGUESE
    • ROMANIAN

      public static final Vocabulary ROMANIAN
    • SERBIAN_LATIN

      public static final Vocabulary SERBIAN_LATIN
    • SLOVAK

      public static final Vocabulary SLOVAK
    • SPANISH

      public static final Vocabulary SPANISH
    • SWEDISH

      public static final Vocabulary SWEDISH
    • VIETNAMESE

      public static final Vocabulary VIETNAMESE
    • ZULU

      public static final Vocabulary ZULU
    • AZERBAIJANI

      public static final Vocabulary AZERBAIJANI
    • BOSNIAN

      public static final Vocabulary BOSNIAN
    • ESPERANTO

      public static final Vocabulary ESPERANTO
    • FRISIAN

      public static final Vocabulary FRISIAN
    • GALICIAN

      public static final Vocabulary GALICIAN
    • HAUSA

      public static final Vocabulary HAUSA
    • ICELANDIC

      public static final Vocabulary ICELANDIC
    • LATVIAN

      public static final Vocabulary LATVIAN
    • LITHUANIAN

      public static final Vocabulary LITHUANIAN
    • MALAGASY

      public static final Vocabulary MALAGASY
    • MALTESE

      public static final Vocabulary MALTESE
    • MAORI

      public static final Vocabulary MAORI
    • MONTENEGRIN

      public static final Vocabulary MONTENEGRIN
    • QUECHUA

      public static final Vocabulary QUECHUA
    • SCOTTISH_GAELIC

      public static final Vocabulary SCOTTISH_GAELIC
    • SLOVENE

      public static final Vocabulary SLOVENE
    • SOMALI

      public static final Vocabulary SOMALI
    • SWAHILI

      public static final Vocabulary SWAHILI
    • TAGALOG

      public static final Vocabulary TAGALOG
    • TURKISH

      public static final Vocabulary TURKISH
    • UZBEK_LATIN

      public static final Vocabulary UZBEK_LATIN
    • WELSH

      public static final Vocabulary WELSH
    • YORUBA

      public static final Vocabulary YORUBA
    • RUSSIAN

      public static final Vocabulary RUSSIAN
    • BELARUSIAN

      public static final Vocabulary BELARUSIAN
    • UKRAINIAN

      public static final Vocabulary UKRAINIAN
    • TATAR

      public static final Vocabulary TATAR
    • TAJIK

      public static final Vocabulary TAJIK
    • KAZAKH

      public static final Vocabulary KAZAKH
    • KYRGYZ

      public static final Vocabulary KYRGYZ
    • BULGARIAN

      public static final Vocabulary BULGARIAN
    • MACEDONIAN

      public static final Vocabulary MACEDONIAN
    • MONGOLIAN

      public static final Vocabulary MONGOLIAN
    • YAKUT

      public static final Vocabulary YAKUT
    • SERBIAN_CYRILLIC

      public static final Vocabulary SERBIAN_CYRILLIC
    • UZBEK_CYRILLIC

      public static final Vocabulary UZBEK_CYRILLIC
    • GREEK

      public static final Vocabulary GREEK
    • GREEK_EXTENDED

      public static final Vocabulary GREEK_EXTENDED
    • HEBREW

      public static final Vocabulary HEBREW
    • ARABIC

      public static final Vocabulary ARABIC
    • PERSIAN

      public static final Vocabulary PERSIAN
    • URDU

      public static final Vocabulary URDU
    • PASHTO

      public static final Vocabulary PASHTO
    • KURDISH

      public static final Vocabulary KURDISH
    • UYGHUR

      public static final Vocabulary UYGHUR
    • SINDHI

      public static final Vocabulary SINDHI
    • DEVANAGARI

      public static final Vocabulary DEVANAGARI
    • HINDI

      public static final Vocabulary HINDI
    • SANSKRIT

      public static final Vocabulary SANSKRIT
    • MARATHI

      public static final Vocabulary MARATHI
    • NEPALI

      public static final Vocabulary NEPALI
    • GUJARATI

      public static final Vocabulary GUJARATI
    • BENGALI

      public static final Vocabulary BENGALI
    • TAMIL

      public static final Vocabulary TAMIL
    • TELUGU

      public static final Vocabulary TELUGU
    • KANNADA

      public static final Vocabulary KANNADA
    • SINHALA

      public static final Vocabulary SINHALA
    • MALAYALAM

      public static final Vocabulary MALAYALAM
    • PUNJABI

      public static final Vocabulary PUNJABI
    • ODIA

      public static final Vocabulary ODIA
    • KHMER

      public static final Vocabulary KHMER
    • ARMENIAN

      public static final Vocabulary ARMENIAN
    • SUDANESE

      public static final Vocabulary SUDANESE
    • THAI

      public static final Vocabulary THAI
    • LAO

      public static final Vocabulary LAO
    • BURMESE

      public static final Vocabulary BURMESE
    • JAVANESE

      public static final Vocabulary JAVANESE
    • GEORGIAN

      public static final Vocabulary GEORGIAN
    • ETHIOPIC

      public static final Vocabulary ETHIOPIC
    • JAPANESE

      public static final Vocabulary JAPANESE
    • KOREAN

      public static final Vocabulary KOREAN
    • SIMPLIFIED_CHINESE

      public static final Vocabulary SIMPLIFIED_CHINESE
    • LATIN_EXTENDED

      public static final Vocabulary LATIN_EXTENDED
    • MULTI_LANG

      public static final Vocabulary MULTI_LANG
    • MULTI_LANG_FULL

      public static final Vocabulary MULTI_LANG_FULL
  • Constructor Details

    • Vocabulary

      public Vocabulary (String lookUpString)
      Creates a new vocabulary based on a look-up string.
      Parameters:
      lookUpString - look-up string to be used as LUT for the vocabulary
  • Method Details

    • concat

      public static Vocabulary concat (Vocabulary... vocabularies)
      Creates a new vocabulary by concatenating multiple ones.
      Parameters:
      vocabularies - vocabularies to concatenate
      Returns:
      the new aggregated vocabulary
    • getLookUpString

      public String getLookUpString()
      Returns the look-up string.
      Returns:
      the look-up string
    • size

      public int size()
      Returns the size of the vocabulary.
      Returns:
      the size of the vocabulary
    • map

      public char map (int index)
      Returns character, which is mapped to the specified index in the lookup string.
      Parameters:
      index - index to map
      Returns:
      mapped character
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • equals

      public boolean equals (Object o)
      Overrides:
      equals in class Object
    • toString

      public String toString()
      Overrides:
      toString in class Object