Class OcrPdfCreator

java.lang.Object
com.itextpdf.pdfocr.OcrPdfCreator

public class OcrPdfCreator extends Object
OcrPdfCreator is the class that creates PDF documents containing input images and text that was recognized using provided IOcrEngine.

OcrPdfCreator provides possibilities to set list of input images to be used for OCR, to set scaling mode for images, to set color of text in output PDF document, to set fixed size of the PDF document's page and to perform OCR using given images and to return PdfDocument as result. OCR is based on the provided IOcrEngine (e.g. tesseract reader). This parameter is obligatory and it should be provided in constructor or using setter.

  • Constructor Summary

    Constructors
    Constructor
    Description
    Creates a new OcrPdfCreator instance.
    OcrPdfCreator(IOcrEngine ocrEngine, OcrPdfCreatorProperties ocrPdfCreatorProperties)
    Creates a new OcrPdfCreator instance.
  • Method Summary

    Modifier and Type
    Method
    Description
    final com.itextpdf.kernel.pdf.PdfDocument
    createPdf(List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter)
    Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter.
    final com.itextpdf.kernel.pdf.PdfDocument
    createPdf(List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.DocumentProperties documentProperties)
    Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter.
    final com.itextpdf.kernel.pdf.PdfDocument
    createPdf(List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.DocumentProperties documentProperties, IOcrProcessProperties ocrProcessProperties)
    Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter.
    final com.itextpdf.kernel.pdf.PdfDocument
    createPdfA(List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.DocumentProperties documentProperties, com.itextpdf.kernel.pdf.PdfOutputIntent pdfOutputIntent)
    Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter, DocumentProperties and PdfOutputIntent.
    final com.itextpdf.kernel.pdf.PdfDocument
    createPdfA(List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.DocumentProperties documentProperties, com.itextpdf.kernel.pdf.PdfOutputIntent pdfOutputIntent, IOcrProcessProperties ocrProcessProperties)
    Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter, DocumentProperties and PdfOutputIntent.
    final com.itextpdf.kernel.pdf.PdfDocument
    createPdfA(List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.PdfOutputIntent pdfOutputIntent)
    Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter and PdfOutputIntent.
    void
    createPdfAFile(List<File> inputImages, File outPdfFile, com.itextpdf.kernel.pdf.PdfOutputIntent pdfOutputIntent)
    Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided File and PdfOutputIntent.
    void
    createPdfFile(List<File> inputImages, File outPdfFile)
    Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided File.
    final IOcrEngine
    Gets used IOcrEngine reader object to perform OCR.
    Gets properties for OcrPdfCreator.
    void
    makePdfSearchable(com.itextpdf.kernel.pdf.PdfDocument pdfDoc)
    Performs OCR of all images in an input PDF document and adds recognized text on top of the images.
    void
    makePdfSearchable(com.itextpdf.kernel.pdf.PdfDocument pdfDoc, IOcrProcessProperties ocrProcessProperties)
    Performs OCR of all images in an input PDF document and adds recognized text on top of the images.
    void
    makePdfSearchable(File inputPdf, File outputPdf)
    Performs OCR of all images in an input PDF file and generates searchable PDF.
    void
    makePdfSearchable(File inputPdf, File outputPdf, IOcrProcessProperties ocrProcessProperties)
    Performs OCR of all images in an input PDF file and generates searchable PDF.
    final void
    Sets IOcrEngine reader object to perform OCR.
    final void
    Sets properties for OcrPdfCreator.
    protected void
    validateInputPdfDocument(com.itextpdf.kernel.pdf.PdfDocument pdfDoc)
    Validates input PDF document.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

  • Method Details

    • getOcrPdfCreatorProperties

      public final OcrPdfCreatorProperties getOcrPdfCreatorProperties()
      Gets properties for OcrPdfCreator.
      Returns:
      set properties OcrPdfCreatorProperties
    • setOcrPdfCreatorProperties

      public final void setOcrPdfCreatorProperties (OcrPdfCreatorProperties ocrPdfCreatorProperties)
      Sets properties for OcrPdfCreator.
      Parameters:
      ocrPdfCreatorProperties - set of properties OcrPdfCreatorProperties for OcrPdfCreator
    • createPdfA

      public final com.itextpdf.kernel.pdf.PdfDocument createPdfA (List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.DocumentProperties documentProperties, com.itextpdf.kernel.pdf.PdfOutputIntent pdfOutputIntent, IOcrProcessProperties ocrProcessProperties) throws PdfOcrException
      Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter, DocumentProperties and PdfOutputIntent. PDF/A-3u document will be created if provided PdfOutputIntent is not null.

      NOTE that after executing this method you will have a product event from the both itextcore and pdfOcr. Therefore, use this method only if you need to work with the generated PdfDocument. If you don't need this, use the createPdfAFile(java.util.List, java.io.File, com.itextpdf.kernel.pdf.PdfOutputIntent) method. In this case, only the pdfOcr event will be dispatched.

      Parameters:
      inputImages - List of images to be OCRed
      pdfWriter - the PdfWriter object to write final PDF document to
      documentProperties - document properties
      pdfOutputIntent - PdfOutputIntent for PDF/A-3u document
      ocrProcessProperties - extra OCR process properties passed to OcrProcessContext
      Returns:
      result PDF/A-3u PdfDocument object
      Throws:
      PdfOcrException - if it was not possible to read provided or default font
    • createPdfA

      public final com.itextpdf.kernel.pdf.PdfDocument createPdfA (List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.PdfOutputIntent pdfOutputIntent) throws PdfOcrException
      Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter and PdfOutputIntent. PDF/A-3u document will be created if provided PdfOutputIntent is not null.

      NOTE that after executing this method you will have a product event from the both itextcore and pdfOcr. Therefore, use this method only if you need to work with the generated PdfDocument. If you don't need this, use the createPdfAFile(java.util.List, java.io.File, com.itextpdf.kernel.pdf.PdfOutputIntent) method. In this case, only the pdfOcr event will be dispatched.

      Parameters:
      inputImages - List of images to be OCRed
      pdfWriter - the PdfWriter object to write final PDF document to
      pdfOutputIntent - PdfOutputIntent for PDF/A-3u document
      Returns:
      result PDF/A-3u PdfDocument object
      Throws:
      PdfOcrException - if it was not possible to read provided or default font
    • createPdfA

      public final com.itextpdf.kernel.pdf.PdfDocument createPdfA (List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.DocumentProperties documentProperties, com.itextpdf.kernel.pdf.PdfOutputIntent pdfOutputIntent) throws PdfOcrException
      Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter, DocumentProperties and PdfOutputIntent. PDF/A-3u document will be created if provided PdfOutputIntent is not null.

      NOTE that after executing this method you will have a product event from the both itextcore and pdfOcr. Therefore, use this method only if you need to work with the generated PdfDocument. If you don't need this, use the createPdfAFile(java.util.List, java.io.File, com.itextpdf.kernel.pdf.PdfOutputIntent) method. In this case, only the pdfOcr event will be dispatched.

      Parameters:
      inputImages - List of images to be OCRed
      pdfWriter - the PdfWriter object to write final PDF document to
      documentProperties - document properties
      pdfOutputIntent - PdfOutputIntent for PDF/A-3u document
      Returns:
      result PDF/A-3u PdfDocument object
      Throws:
      PdfOcrException - if it was not possible to read provided or default font
    • createPdf

      public final com.itextpdf.kernel.pdf.PdfDocument createPdf (List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.DocumentProperties documentProperties, IOcrProcessProperties ocrProcessProperties) throws PdfOcrException
      Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter.

      NOTE that after executing this method you will have a product event from the both itextcore and pdfOcr. Therefore, use this method only if you need to work with the generated PdfDocument. If you don't need this, use the createPdfFile(java.util.List, java.io.File) method. In this case, only the pdfOcr event will be dispatched.

      Parameters:
      inputImages - List of images to be OCRed
      pdfWriter - the PdfWriter object to write final PDF document to
      documentProperties - document properties
      ocrProcessProperties - extra OCR process properties passed to OcrProcessContext
      Returns:
      result PdfDocument object
      Throws:
      PdfOcrException - if provided font is incorrect
    • createPdf

      public final com.itextpdf.kernel.pdf.PdfDocument createPdf (List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter, com.itextpdf.kernel.pdf.DocumentProperties documentProperties) throws PdfOcrException
      Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter.

      NOTE that after executing this method you will have a product event from the both itextcore and pdfOcr. Therefore, use this method only if you need to work with the generated PdfDocument. If you don't need this, use the createPdfFile(java.util.List, java.io.File) method. In this case, only the pdfOcr event will be dispatched.

      Parameters:
      inputImages - List of images to be OCRed
      pdfWriter - the PdfWriter object to write final PDF document to
      documentProperties - document properties
      Returns:
      result PdfDocument object
      Throws:
      PdfOcrException - if provided font is incorrect
    • createPdf

      public final com.itextpdf.kernel.pdf.PdfDocument createPdf (List<File> inputImages, com.itextpdf.kernel.pdf.PdfWriter pdfWriter) throws PdfOcrException
      Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided PdfWriter.

      NOTE that after executing this method you will have a product event from the both itextcore and pdfOcr. Therefore, use this method only if you need to work with the generated PdfDocument. If you don't need this, use the createPdfFile(java.util.List, java.io.File) method. In this case, only the pdfOcr event will be dispatched.

      Parameters:
      inputImages - List of images to be OCRed
      pdfWriter - the PdfWriter object to write final PDF document to
      Returns:
      result PdfDocument object
      Throws:
      PdfOcrException - if provided font is incorrect
    • createPdfFile

      public void createPdfFile (List<File> inputImages, File outPdfFile) throws PdfOcrException, IOException
      Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided File.
      Parameters:
      inputImages - List of images to be OCRed
      outPdfFile - the File object to write final PDF document to
      Throws:
      IOException - signals that an I/O exception of some sort has occurred.
      PdfOcrException - if it was not possible to read provided or default font
    • createPdfAFile

      public void createPdfAFile (List<File> inputImages, File outPdfFile, com.itextpdf.kernel.pdf.PdfOutputIntent pdfOutputIntent) throws PdfOcrException, IOException
      Performs OCR with set parameters using provided IOcrEngine and creates PDF using provided File and PdfOutputIntent. PDF/A-3u document will be created if provided PdfOutputIntent is not null.
      Parameters:
      inputImages - List of images to be OCRed
      outPdfFile - the File object to write final PDF document to
      pdfOutputIntent - PdfOutputIntent for PDF/A-3u document
      Throws:
      IOException - signals that an I/O exception of some sort has occurred
      PdfOcrException - if it was not possible to read provided or default font
    • getOcrEngine

      public final IOcrEngine getOcrEngine()
      Gets used IOcrEngine reader object to perform OCR.
      Returns:
      selected IOcrEngine instance
    • setOcrEngine

      public final void setOcrEngine (IOcrEngine reader)
      Sets IOcrEngine reader object to perform OCR.
      Parameters:
      reader - selected IOcrEngine instance
    • makePdfSearchable

      public void makePdfSearchable (File inputPdf, File outputPdf) throws com.itextpdf.io.exceptions.IOException, PdfOcrException
      Performs OCR of all images in an input PDF file and generates searchable PDF.

      By default, it does not allow to OCR PDF/A documents and tagged documents. The reason is that the result document might not comply with PDF/A specification and an added content might be not tagged depending on the IOcrEngine implementation. To overrule this behavior one can override validateInputPdfDocument(com.itextpdf.kernel.pdf.PdfDocument) with an empty implementation.

      Note that OcrPdfCreatorProperties.setPageSize(com.itextpdf.kernel.geom.Rectangle), OcrPdfCreatorProperties.setScaleMode(ScaleMode) and OcrPdfCreatorProperties.setImageLayerName(String) have no effect for this method.

      Parameters:
      inputPdf - PDF file to OCR
      outputPdf - searchable PDF with the recognized text on top of the images
      Throws:
      com.itextpdf.io.exceptions.IOException - if an image cannot be extracted from a PDF file
      PdfOcrException - in case of any other OCR error
    • makePdfSearchable

      public void makePdfSearchable (File inputPdf, File outputPdf, IOcrProcessProperties ocrProcessProperties) throws com.itextpdf.io.exceptions.IOException, PdfOcrException
      Performs OCR of all images in an input PDF file and generates searchable PDF.

      By default, it does not allow to OCR PDF/A documents and tagged documents. The reason is that the result document might not comply with PDF/A specification and an added content might be not tagged depending on the IOcrEngine implementation. To overrule this behavior one can override validateInputPdfDocument(com.itextpdf.kernel.pdf.PdfDocument) with an empty implementation.

      Note that OcrPdfCreatorProperties.setPageSize(com.itextpdf.kernel.geom.Rectangle), OcrPdfCreatorProperties.setScaleMode(ScaleMode) and OcrPdfCreatorProperties.setImageLayerName(String) have no effect for this method.

      Parameters:
      inputPdf - PDF file to OCR
      outputPdf - searchable PDF with the recognized text on top of the images
      ocrProcessProperties - extra OCR process properties passed to OcrProcessContext.
      Throws:
      com.itextpdf.io.exceptions.IOException - if an image cannot be extracted from a pdf
      PdfOcrException - in case of any other OCR error
    • makePdfSearchable

      public void makePdfSearchable (com.itextpdf.kernel.pdf.PdfDocument pdfDoc) throws com.itextpdf.io.exceptions.IOException, PdfOcrException
      Performs OCR of all images in an input PDF document and adds recognized text on top of the images.

      By default, it does not allow to OCR PDF/A documents and tagged documents. The reason is that the result document might not comply with PDF/A specification and an added content might be not tagged depending on the IOcrEngine implementation. To overrule this behavior one can override validateInputPdfDocument(com.itextpdf.kernel.pdf.PdfDocument) with an empty implementation.

      Note that OcrPdfCreatorProperties.setPageSize(com.itextpdf.kernel.geom.Rectangle), OcrPdfCreatorProperties.setScaleMode(ScaleMode) and OcrPdfCreatorProperties.setImageLayerName(String) have no effect for this method.

      Parameters:
      pdfDoc - PDF document with images to OCR
      Throws:
      com.itextpdf.io.exceptions.IOException - if an image cannot be extracted from a pdf
      PdfOcrException - in case of any other OCR error
    • makePdfSearchable

      public void makePdfSearchable (com.itextpdf.kernel.pdf.PdfDocument pdfDoc, IOcrProcessProperties ocrProcessProperties) throws com.itextpdf.io.exceptions.IOException, PdfOcrException
      Performs OCR of all images in an input PDF document and adds recognized text on top of the images.

      By default, it does not allow to OCR PDF/A documents and tagged documents. The reason is that the result document might not comply with PDF/A specification and an added content might be not tagged depending on the IOcrEngine implementation. To overrule this behavior one can override validateInputPdfDocument(com.itextpdf.kernel.pdf.PdfDocument) with an empty implementation.

      Note that OcrPdfCreatorProperties.setPageSize(com.itextpdf.kernel.geom.Rectangle), OcrPdfCreatorProperties.setScaleMode(ScaleMode) and OcrPdfCreatorProperties.setImageLayerName(String) have no effect for this method.

      Parameters:
      pdfDoc - PDF document with images to OCR
      ocrProcessProperties - extra OCR process properties passed to OcrProcessContext
      Throws:
      com.itextpdf.io.exceptions.IOException - if an image cannot be extracted from a pdf
      PdfOcrException - in case of any other OCR error
    • validateInputPdfDocument

      protected void validateInputPdfDocument (com.itextpdf.kernel.pdf.PdfDocument pdfDoc)
      Validates input PDF document.

      It checks that an input document is not tagged and not PDF/A. If you need to OCR tagged and/or PDF/A documents, override this method with empty implementation. In that case it would be best to use makePdfSearchable(PdfDocument, IOcrProcessProperties) overload because there you can pass PdfADocument or PdfUADocument instance which will do the validation of the output document.

      Parameters:
      pdfDoc - a PDF document to check