Package com.itextpdf.pdf2data
Class Pdf2DataExtractor
java.lang.Object
com.itextpdf.pdf2data.Pdf2DataExtractor
Pdf2DataExtractor
is a class for extracting data from files.
To create instance of Pdf2DataExtractor
to extract data from PDF file, use create(File)
.
To create instance of Pdf2DataExtractor
to extract data from image, use create(File, OcrWithPostProcessingEngine)
.
To extract data from PDF file use extract(File)
method.
To extract data from image use extract(File, RecognitionProperties)
method with file type specified via RecognitionProperties
instance.
-
Method Summary
Modifier and TypeMethodDescriptionRecognize the pdf file and returns recognition results amount.check
(File targetFile, RecognitionProperties properties) Recognize the document and returns recognition results amount.check
(InputStream targetInputStream) Recognize the pdf file and returns recognition results amount.check
(InputStream targetInputStream, RecognitionProperties properties) Recognize the document and returns recognition results amount.static Pdf2DataExtractor
Creates instance ofPdf2DataExtractor
from pdf2data template file.static Pdf2DataExtractor
create
(File p2dFile, OcrWithPostProcessingEngine ocrEngine) Creates instance ofPdf2DataExtractor
from pdf2data template file with provided OCR engine.static Pdf2DataExtractor
createFromTemplateContentJson
(InputStream templateContentJsonStream) Creates instance ofPdf2DataExtractor
from stream which contants pdf2data template content in JSON format.static Pdf2DataExtractor
createFromTemplateContentJson
(InputStream templateContentJsonStream, OcrWithPostProcessingEngine ocrEngine) Creates instance ofPdf2DataExtractor
from stream which contants pdf2data template content in JSON format.Recognize the pdf file.extract
(File targetFile, RecognitionProperties properties) Recognize the file.extract
(InputStream targetInputStream) Recognize the pdf file.extract
(InputStream targetInputStream, RecognitionProperties properties) Recognize the file.com.itextpdf.pdfocr.IOcrEngine
Gets current OCR engine instance.com.itextpdf.pdf2data.template.Template
Gets current template instance.
-
Method Details
-
create
Creates instance ofPdf2DataExtractor
from pdf2data template file. Note that template should be processed.- Parameters:
-
p2dFile
- pdf2data template archive - Returns:
- a Pdf2DataExtractor instance
- Throws:
-
IOException
- if any I/O exception occurs -
com.itextpdf.pdf2data.exceptions.TemplateConversionException
- if it's impossible to extract template from passed archive
-
create
public static Pdf2DataExtractor create(File p2dFile, OcrWithPostProcessingEngine ocrEngine) throws IOException Creates instance ofPdf2DataExtractor
from pdf2data template file with provided OCR engine. Note that template should be processed.- Parameters:
-
p2dFile
- pdf2data template archive -
ocrEngine
- OCR engine to be used for OCR involving recognitions. May benull
if no OCR involving recognitions would be used. - Returns:
- a Pdf2DataExtractor instance
- Throws:
-
IOException
- if any I/O exception occurs -
com.itextpdf.pdf2data.exceptions.TemplateConversionException
- if it's impossible to extract template from passed archive
-
createFromTemplateContentJson
public static Pdf2DataExtractor createFromTemplateContentJson(InputStream templateContentJsonStream) Creates instance ofPdf2DataExtractor
from stream which contants pdf2data template content in JSON format. Note that template should be processed.- Parameters:
-
templateContentJsonStream
- processed template content stream - Returns:
- a Pdf2DataExtractor instance
- Throws:
-
com.itextpdf.pdf2data.exceptions.TemplateConversionException
- if it's impossible to extract template from passed archive
-
createFromTemplateContentJson
public static Pdf2DataExtractor createFromTemplateContentJson(InputStream templateContentJsonStream, OcrWithPostProcessingEngine ocrEngine) Creates instance ofPdf2DataExtractor
from stream which contants pdf2data template content in JSON format. Note that template should be processed.- Parameters:
-
templateContentJsonStream
- processed template content stream -
ocrEngine
- OCR engine to be used for OCR involving recognitions. May benull
if no OCR involving recognitions would be used. - Returns:
- a Pdf2DataExtractor instance
- Throws:
-
com.itextpdf.pdf2data.exceptions.TemplateConversionException
- if it's impossible to extract template from passed archive
-
getTemplate
public com.itextpdf.pdf2data.template.Template getTemplate()Gets current template instance.- Returns:
- current template instance
-
getOcrEngine
public com.itextpdf.pdfocr.IOcrEngine getOcrEngine()Gets current OCR engine instance.- Returns:
- current OCR engine instance.
-
extract
Recognize the pdf file.- Parameters:
-
targetPDF
- pdf file for recognition - Returns:
-
RecognitionResultHolder
instance - Throws:
-
IOException
- if any I/O issue occurs -
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException
- if pdf document is encrypted and extracting text is not permitted
-
extract
public RecognitionResultHolder extract(File targetFile, RecognitionProperties properties) throws IOException Recognize the file.- Parameters:
-
targetFile
- file for recognition -
properties
- aRecognitionProperties
instance - Returns:
-
RecognitionResultHolder
instance - Throws:
-
IOException
- if any I/O issue occurs -
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException
- if pdf document is encrypted and extracting text is not permitted
-
extract
Recognize the pdf file.- Parameters:
-
targetInputStream
- input stream from pdf file for recognition - Returns:
-
RecognitionResultHolder
instance - Throws:
-
IOException
- if any I/O issue occurs -
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException
- if pdf document is encrypted and extracting text is not permitted
-
extract
public RecognitionResultHolder extract(InputStream targetInputStream, RecognitionProperties properties) throws IOException Recognize the file.- Parameters:
-
targetInputStream
- input stream from file for recognition -
properties
- aRecognitionProperties
instance - Returns:
-
RecognitionResultHolder
instance - Throws:
-
IOException
- if any I/O issue occurs -
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException
- if pdf document is encrypted and extracting text is not permitted
-
check
Recognize the pdf file and returns recognition results amount.- Parameters:
-
targetPDF
- pdf file for recognition - Returns:
-
A
Map
containing the recognition results as key-value pairs of strings and integers. - Throws:
-
IOException
- if any I/O issue occurs -
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException
- if pdf document is encrypted and extracting text is not permitted
-
check
public Map<String,Integer> check(File targetFile, RecognitionProperties properties) throws IOException Recognize the document and returns recognition results amount.- Parameters:
-
targetFile
- file for recognition -
properties
- aRecognitionProperties
instance - Returns:
-
A
Map
containing the recognition results as key-value pairs of strings and integers. - Throws:
-
IOException
- if any I/O issue occurs -
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException
- if pdf document is encrypted and extracting text is not permitted
-
check
Recognize the pdf file and returns recognition results amount.- Parameters:
-
targetInputStream
- input stream from pdf file for recognition - Returns:
-
A
Map
containing the recognition results as key-value pairs of strings and integers. - Throws:
-
IOException
- if any I/O issue occurs -
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException
- if pdf document is encrypted and extracting text is not permitted
-
check
public Map<String,Integer> check(InputStream targetInputStream, RecognitionProperties properties) throws IOException Recognize the document and returns recognition results amount.- Parameters:
-
targetInputStream
- input stream from file for recognition -
properties
- aRecognitionProperties
instance - Returns:
-
A
Map
containing the recognition results as key-value pairs of strings and integers. - Throws:
-
IOException
- if any I/O issue occurs -
com.itextpdf.pdf2data.exceptions.DocumentExtractionDeniedException
- if pdf document is encrypted and extracting text is not permitted
-