public class HyphenationTree extends TernaryTree implements IPatternConsumer
This work was authored by Carlos Villegas (cav@uniscope.co.jp).
Modifier and Type | Field and Description |
---|---|
protected TernaryTree |
classmap
This map stores the character classes
|
protected Map<String,List> |
stoplist
This map stores hyphenation exceptions
|
protected ByteVector |
vspace
value space: stores the interletter values
|
BLOCK_SIZE, eq, freenode, hi, kv, length, lo, root, sc
Constructor and Description |
---|
HyphenationTree()
Default constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
addClass(String chargroup)
Add a character class to the tree.
|
void |
addException(String word, List hyphenatedword)
Add an exception to the tree.
|
void |
addPattern(String pattern, String ivalue)
Add a pattern to the tree.
|
String |
findPattern(String pat)
Find pattern.
|
protected byte[] |
getValues(int k)
Get values.
|
protected int |
hstrcmp(char[] s, int si, char[] t, int ti)
String compare, returns 0 if equal or t is a substring of s.
|
Hyphenation |
hyphenate(char[] w, int offset, int len, int remainCharCount, int pushCharCount)
Hyphenate word and return an array of hyphenation points.
|
Hyphenation |
hyphenate(String word, int remainCharCount, int pushCharCount)
Hyphenate word and return a Hyphenation object.
|
void |
loadPatterns(InputStream stream, String name)
Read hyphenation patterns from an XML file.
|
void |
loadPatterns(String filename)
Read hyphenation patterns from an XML file.
|
protected int |
packValues(String values)
Packs the values by storing them in 4 bits, two values into a byte Values range is from 0 to 9.
|
protected void |
searchPatterns(char[] word, int index, byte[] il)
Search for all possible partial matches of word starting at index an update interletter values.
|
protected String |
unpackValues(int k)
Unpack values.
|
protected ByteVector vspace
protected TernaryTree classmap
protected int packValues(String values)
values
- a string of digits from '0' to '9' representing the interletter values.
protected String unpackValues(int k)
k
- an integer
public void loadPatterns(String filename) throws HyphenationException, FileNotFoundException
filename
- the filename
HyphenationException
- In case the parsing fails
FileNotFoundException
- When the specified file is not found
public void loadPatterns(InputStream stream, String name) throws HyphenationException
stream
- the InputSource for the file
name
- unique key representing country-language combination
HyphenationException
- In case the parsing fails
public String findPattern(String pat)
pat
- a pattern
protected int hstrcmp(char[] s, int si, char[] t, int ti)
s
- first character array
si
- starting index into first array
t
- second character array
ti
- starting index into second array
protected byte[] getValues(int k)
k
- an integer
protected void searchPatterns(char[] word, int index, byte[] il)
for(i=0; i
But it is done in an efficient way since the patterns are stored in a ternary tree. In fact, this is the whole purpose of having the tree: doing this search without having to test every single pattern. The number of patterns for languages such as English range from 4000 to 10000. Thus, doing thousands of string comparisons for each word to hyphenate would be really slow without the tree. The tradeoff is memory, but using a ternary tree instead of a trie, almost halves the the memory used by Lout or TeX. It's also faster than using a hash table
- Parameters:
-
word
- null terminated word to match
-
index
- start index from word
-
il
- interletter values array to update
-
hyphenate
public Hyphenation hyphenate(String word,
int remainCharCount,
int pushCharCount)
Hyphenate word and return a Hyphenation object.
- Parameters:
-
word
- the word to be hyphenated
-
remainCharCount
- Minimum number of characters allowed before the hyphenation point.
-
pushCharCount
- Minimum number of characters allowed after the hyphenation point.
- Returns:
-
a
Hyphenation
object representing the hyphenated word or null if word is not hyphenated.
-
hyphenate
public Hyphenation hyphenate(char[] w,
int offset,
int len,
int remainCharCount,
int pushCharCount)
Hyphenate word and return an array of hyphenation points.
- Parameters:
-
w
- char array that contains the word
-
offset
- Offset to first character in word
-
len
- Length of word
-
remainCharCount
- Minimum number of characters allowed before the hyphenation point.
-
pushCharCount
- Minimum number of characters allowed after the hyphenation point.
- Returns:
-
a
Hyphenation
object representing the hyphenated word or null if word is not hyphenated.
-
addClass
public void addClass(String chargroup)
Add a character class to the tree. It is used by PatternParser
as callback to add character classes. Character classes define the valid word characters for hyphenation. If a word contains a character not defined in any of the classes, it is not hyphenated. It also defines a way to normalize the characters in order to compare them with the stored patterns. Usually pattern files use only lower case characters, in this case a class for letter 'a', for example, should be defined as "aA", the first character being the normalization char.
- Specified by:
-
addClass
in interface IPatternConsumer
- Parameters:
-
chargroup
- a character class (group)
-
addException
public void addException(String word,
List hyphenatedword)
Add an exception to the tree. It is used by PatternParser
class as callback to store the hyphenation exceptions.
- Specified by:
-
addException
in interface IPatternConsumer
- Parameters:
-
word
- normalized word
-
hyphenatedword
- a vector of alternating strings and hyphen
objects.
-
addPattern
public void addPattern(String pattern,
String ivalue)
Add a pattern to the tree. Mainly, to be used by PatternParser
class as callback to add a pattern to the tree.
- Specified by:
-
addPattern
in interface IPatternConsumer
- Parameters:
-
pattern
- the hyphenation pattern
-
ivalue
- interletter weight values indicating the desirability and priority of hyphenating at a given point within the pattern. It should contain only digit characters. (i.e. '0' to '9').
Copyright © 1998–2021 iText Group NV. All rights reserved.