Package com.itextpdf.pdfocr.tesseract4
Class AbstractTesseract4OcrEngine
java.lang.Object
com.itextpdf.pdfocr.tesseract4.AbstractTesseract4OcrEngine
- All Implemented Interfaces:
-
IOcrEngine
,IProductAware
- Direct Known Subclasses:
-
Tesseract4ExecutableOcrEngine
,Tesseract4LibOcrEngine
public abstract class AbstractTesseract4OcrEngine extends Object implements IOcrEngine, IProductAware
The implementation of
IOcrEngine
. This class provides possibilities to perform OCR, to read data from input files and to return contained text in the required format. Also there are possibilities to use features of "tesseract" (optical character recognition engine for various operating systems).
-
Constructor Summary
ConstructorDescriptionAbstractTesseract4OcrEngine
(Tesseract4OcrEngineProperties tesseract4OcrEngineProperties) Creates a newTesseract4OcrEngineProperties
instance based on anotherTesseract4OcrEngineProperties
instance (copy constructor). -
Method Summary
Modifier and TypeMethodDescriptionvoid
createTxtFile
(List<File> inputImages, File txtFile) Performs OCR using providedIOcrEngine
for the given list of input images and saves output to a text file using provided path.void
createTxtFile
(List<File> inputImages, File txtFile, OcrProcessContext ocrProcessContext) Performs OCR using providedIOcrEngine
for the given list of input images and saves output to a text file using provided path.doImageOcr
(File input) Reads data from the provided input image file and returns retrieved data in the format described below.doImageOcr
(File input, OcrProcessContext ocrProcessContext) Reads data from the provided input image file and returns retrieved data in the format described below.final String
doImageOcr
(File input, OutputFormat outputFormat) Reads data from the provided input image file and returns retrieved data as string.final String
doImageOcr
(File input, OutputFormat outputFormat, OcrProcessContext ocrProcessContext) Reads data from the provided input image file and returns retrieved data as string.void
doTesseractOcr
(File inputImage, File outputFile, OutputFormat outputFormat) Performs tesseract OCR for the first (or for the only) image page.void
doTesseractOcr
(File inputImage, File outputFile, OutputFormat outputFormat, OcrProcessContext ocrProcessContext) Performs tesseract OCR for the first (or for the only) image page.final String
Gets list of languages concatenated with "+" symbol to a string in format required by tesseract.Gets the container with meta info.com.itextpdf.commons.actions.data.ProductData
Gets object containing information about the product.Gets properties forAbstractTesseract4OcrEngine
.Identifies type of current OS and return it (win, linux).boolean
Checks current os type.final void
setTesseract4OcrEngineProperties
(Tesseract4OcrEngineProperties tesseract4OcrEngineProperties) Sets properties forAbstractTesseract4OcrEngine
.void
validateLanguages
(List<String> languagesList) Validates list of provided languages and checks if they all exist in given tess data directory.
-
Constructor Details
-
AbstractTesseract4OcrEngine
Creates a newTesseract4OcrEngineProperties
instance based on anotherTesseract4OcrEngineProperties
instance (copy constructor).- Parameters:
-
tesseract4OcrEngineProperties
- the otherTesseract4OcrEngineProperties
instance
-
-
Method Details
-
doTesseractOcr
Performs tesseract OCR for the first (or for the only) image page.- Parameters:
-
inputImage
- input imageFile
-
outputFile
- output file for the result for the first page -
outputFormat
- selectedOutputFormat
for tesseract
-
doTesseractOcr
public void doTesseractOcr(File inputImage, File outputFile, OutputFormat outputFormat, OcrProcessContext ocrProcessContext) Performs tesseract OCR for the first (or for the only) image page.- Parameters:
-
inputImage
- input imageFile
-
outputFile
- output file for the result for the first page -
outputFormat
- selectedOutputFormat
for tesseract -
ocrProcessContext
- ocr process context
-
createTxtFile
Performs OCR using providedIOcrEngine
for the given list of input images and saves output to a text file using provided path.- Specified by:
-
createTxtFile
in interfaceIOcrEngine
- Parameters:
-
inputImages
-List
of images to be OCRed -
txtFile
- file to be created
-
createTxtFile
public void createTxtFile(List<File> inputImages, File txtFile, OcrProcessContext ocrProcessContext) Performs OCR using providedIOcrEngine
for the given list of input images and saves output to a text file using provided path.- Specified by:
-
createTxtFile
in interfaceIOcrEngine
- Parameters:
-
inputImages
-List
of images to be OCRed -
txtFile
- file to be created -
ocrProcessContext
- ocr process context
-
getTesseract4OcrEngineProperties
Gets properties forAbstractTesseract4OcrEngine
.- Returns:
-
set properties
Tesseract4OcrEngineProperties
-
setTesseract4OcrEngineProperties
public final void setTesseract4OcrEngineProperties(Tesseract4OcrEngineProperties tesseract4OcrEngineProperties) Sets properties forAbstractTesseract4OcrEngine
.- Parameters:
-
tesseract4OcrEngineProperties
- set of propertiesTesseract4OcrEngineProperties
forAbstractTesseract4OcrEngine
-
getLanguagesAsString
Gets list of languages concatenated with "+" symbol to a string in format required by tesseract.- Returns:
-
String
of concatenated languages
-
doImageOcr
Reads data from the provided input image file and returns retrieved data in the format described below.- Specified by:
-
doImageOcr
in interfaceIOcrEngine
- Parameters:
-
input
- input imageFile
- Returns:
-
Map
where key isInteger
representing the number of the page and value isList
ofTextInfo
elements where eachTextInfo
element contains a word or a line and its 4 coordinates(bbox)
-
doImageOcr
public final Map<Integer,List<TextInfo>> doImageOcr(File input, OcrProcessContext ocrProcessContext) Reads data from the provided input image file and returns retrieved data in the format described below.- Specified by:
-
doImageOcr
in interfaceIOcrEngine
- Parameters:
-
input
- input imageFile
-
ocrProcessContext
- ocr process context - Returns:
-
Map
where key isInteger
representing the number of the page and value isList
ofTextInfo
elements where eachTextInfo
element contains a word or a line and its 4 coordinates(bbox)
-
doImageOcr
public final String doImageOcr(File input, OutputFormat outputFormat, OcrProcessContext ocrProcessContext) Reads data from the provided input image file and returns retrieved data as string.- Parameters:
-
input
- input imageFile
-
outputFormat
- returnOutputFormat
result -
ocrProcessContext
- ocr process context - Returns:
-
OCR result as a
String
that is returned after processing the given image
-
doImageOcr
Reads data from the provided input image file and returns retrieved data as string.- Parameters:
-
input
- input imageFile
-
outputFormat
- returnOutputFormat
result - Returns:
-
OCR result as a
String
that is returned after processing the given image
-
isWindows
public boolean isWindows()Checks current os type.- Returns:
- boolean true is current os is windows, otherwise - false
-
identifyOsType
Identifies type of current OS and return it (win, linux).- Returns:
-
type of current os as
String
-
validateLanguages
Validates list of provided languages and checks if they all exist in given tess data directory.- Parameters:
-
languagesList
-List
of provided languages - Throws:
-
PdfOcrTesseract4Exception
- if tess data wasn't found for one of the languages from the provided list
-
getMetaInfoContainer
Gets the container with meta info.- Specified by:
-
getMetaInfoContainer
in interfaceIProductAware
- Returns:
- the held meta info container
-
getProductData
public com.itextpdf.commons.actions.data.ProductData getProductData()Description copied from interface:IProductAware
Gets object containing information about the product.- Specified by:
-
getProductData
in interfaceIProductAware
- Returns:
- product data
-