public class Pdf2DataExtractor extends Object
Pdf2DataExtractor is a class for extracting data from files.
To create instance of Pdf2DataExtractor to extract data from PDF file, use create(File).
To create instance of Pdf2DataExtractor to extract data from image, use create(File, OcrWithPostProcessingEngine).
To extract data from PDF file use extract(File) method.
To extract data from image use extract(File, RecognitionProperties) method with file type specified via RecognitionProperties instance.
| Modifier and Type | Method and Description |
|---|---|
Map<String,Integer> |
check(File targetPDF)
Recognize the pdf file and returns recognition results amount.
|
Map<String,Integer> |
check(File targetFile, RecognitionProperties properties)
Recognize the document and returns recognition results amount.
|
Map<String,Integer> |
check(InputStream targetInputStream)
Recognize the pdf file and returns recognition results amount.
|
Map<String,Integer> |
check(InputStream targetInputStream, RecognitionProperties properties)
Recognize the document and returns recognition results amount.
|
static Pdf2DataExtractor |
create(File p2dFile)
Creates instance of Pdf2DataExtractor from pdf2data template file.
|
static Pdf2DataExtractor |
create(File p2dFile, OcrWithPostProcessingEngine ocrEngine)
Creates instance of Pdf2DataExtractor from pdf2data template file with provided OCR engine.
|
static Pdf2DataExtractor |
createFromTemplateContentJson(InputStream templateContentJsonStream)
Creates instance of Pdf2DataExtractor from stream which contants pdf2data template content in JSON format.
|
static Pdf2DataExtractor |
createFromTemplateContentJson(InputStream templateContentJsonStream, OcrWithPostProcessingEngine ocrEngine)
Creates instance of Pdf2DataExtractor from stream which contants pdf2data template content in JSON format.
|
RecognitionResultHolder |
extract(File targetPDF)
Recognize the pdf file.
|
RecognitionResultHolder |
extract(File targetFile, RecognitionProperties properties)
Recognize the file.
|
RecognitionResultHolder |
extract(InputStream targetInputStream)
Recognize the pdf file.
|
RecognitionResultHolder |
extract(InputStream targetInputStream, RecognitionProperties properties)
Recognize the file.
|
com.itextpdf.pdfocr.IOcrEngine |
getOcrEngine()
Gets current OCR engine instance.
|
com.itextpdf.pdf2data.template.Template |
getTemplate()
Gets current template instance.
|
public static Pdf2DataExtractor create(File p2dFile) throws IOException
Pdf2DataExtractor from pdf2data template file. Note that template should be processed.
p2dFile - pdf2data template archive
IOException - if any I/O exception occurs
com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
public static Pdf2DataExtractor create(File p2dFile, OcrWithPostProcessingEngine ocrEngine) throws IOException
Pdf2DataExtractor from pdf2data template file with provided OCR engine. Note that template should be processed.
p2dFile - pdf2data template archive
ocrEngine - OCR engine to be used for OCR involving recognitions. May be null if no OCR involving recognitions would be used.
IOException - if any I/O exception occurs
com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
public static Pdf2DataExtractor createFromTemplateContentJson(InputStream templateContentJsonStream)
Pdf2DataExtractor from stream which contants pdf2data template content in JSON format. Note that template should be processed.
templateContentJsonStream - processed template content stream
com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
public static Pdf2DataExtractor createFromTemplateContentJson(InputStream templateContentJsonStream, OcrWithPostProcessingEngine ocrEngine)
Pdf2DataExtractor from stream which contants pdf2data template content in JSON format. Note that template should be processed.
templateContentJsonStream - processed template content stream
ocrEngine - OCR engine to be used for OCR involving recognitions. May be null if no OCR involving recognitions would be used.
com.itextpdf.pdf2data.exceptions.TemplateConversionException - if it's impossible to extract template from passed archive
public com.itextpdf.pdf2data.template.Template getTemplate()
public com.itextpdf.pdfocr.IOcrEngine getOcrEngine()
public RecognitionResultHolder extract(File targetPDF) throws IOException
targetPDF - pdf file for recognition
RecognitionResultHolder instance
IOException - if any I/O issue occurs
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
public RecognitionResultHolder extract(File targetFile, RecognitionProperties properties) throws IOException
targetFile - file for recognition
properties - a RecognitionProperties instance
RecognitionResultHolder instance
IOException - if any I/O issue occurs
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
public RecognitionResultHolder extract(InputStream targetInputStream) throws IOException
targetInputStream - input stream from pdf file for recognition
RecognitionResultHolder instance
IOException - if any I/O issue occurs
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
public RecognitionResultHolder extract(InputStream targetInputStream, RecognitionProperties properties) throws IOException
targetInputStream - input stream from file for recognition
properties - a RecognitionProperties instance
RecognitionResultHolder instance
IOException - if any I/O issue occurs
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
public Map<String,Integer> check(File targetPDF) throws IOException
targetPDF - pdf file for recognition
Map containing the recognition results as key-value pairs of strings and integers.
IOException - if any I/O issue occurs
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
public Map<String,Integer> check(File targetFile, RecognitionProperties properties) throws IOException
targetFile - file for recognition
properties - a RecognitionProperties instance
Map containing the recognition results as key-value pairs of strings and integers.
IOException - if any I/O issue occurs
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
public Map<String,Integer> check(InputStream targetInputStream) throws IOException
targetInputStream - input stream from pdf file for recognition
Map containing the recognition results as key-value pairs of strings and integers.
IOException - if any I/O issue occurs
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
public Map<String,Integer> check(InputStream targetInputStream, RecognitionProperties properties) throws IOException
targetInputStream - input stream from file for recognition
properties - a RecognitionProperties instance
Map containing the recognition results as key-value pairs of strings and integers.
IOException - if any I/O issue occurs
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException - if pdf document is encrypted and extracting text is not permitted
Copyright © 2024. All rights reserved.