Class SingleDocumentExtraction


  • public class SingleDocumentExtraction
    extends Object
    This class acts as a facade where all extractors (for a given MIMEType) can be called on a single document. Extractors are automatically filtered by MIMEType.
    • Constructor Detail

      • SingleDocumentExtraction

        public SingleDocumentExtraction​(Configuration configuration,
                                        DocumentSource in,
                                        ExtractorGroup extractors,
                                        TripleHandler output)
        Builds an extractor by the specification of document source, list of extractors and output triple handler.
        Parameters:
        configuration - configuration applied during extraction.
        in - input document source.
        extractors - list of extractors to be applied.
        output - output triple handler.
      • SingleDocumentExtraction

        public SingleDocumentExtraction​(Configuration configuration,
                                        DocumentSource in,
                                        ExtractorFactory<?> factory,
                                        TripleHandler output)
        Builds an extractor by the specification of document source, extractors factory and output triple handler.
        Parameters:
        configuration - configuration applied during extraction.
        in - input document source.
        factory - the extractors factory.
        output - output triple handler.
      • SingleDocumentExtraction

        public SingleDocumentExtraction​(DocumentSource in,
                                        ExtractorFactory<?> factory,
                                        TripleHandler output)
        Builds an extractor by the specification of document source, extractors factory and output triple handler, using the DefaultConfiguration.
        Parameters:
        in - input document source.
        factory - the extractors factory.
        output - output triple handler.
    • Method Detail

      • setLocalCopyFactory

        public void setLocalCopyFactory​(LocalCopyFactory copyFactory)
        Sets the internal factory for generating the document local copy, if null the MemCopyFactory will be used.
        Parameters:
        copyFactory - local copy factory.
        See Also:
        DocumentSource
      • setMIMETypeDetector

        public void setMIMETypeDetector​(MIMETypeDetector detector)
        Sets the internal mime type detector, if null mimetype detection will be skipped and all extractors will be activated.
        Parameters:
        detector - detector instance.
      • getDetectedMIMEType

        public String getDetectedMIMEType()
                                   throws IOException
        Returns the detected mimetype for the given DocumentSource.
        Returns:
        string containing the detected mimetype.
        Throws:
        IOException - if an error occurred while accessing the data.
      • hasMatchingExtractors

        public boolean hasMatchingExtractors()
                                      throws IOException
        Check whether the given DocumentSource content activates of not at least an extractor.
        Returns:
        true if at least an extractor is activated, false otherwise.
        Throws:
        IOException - if there is an error locating matching extractors
      • getMatchingExtractors

        public List<Extractor> getMatchingExtractors()
        Returns:
        the list of all the activated extractors for the given DocumentSource.
      • getParserEncoding

        public String getParserEncoding()
        Returns:
        the configured parsing encoding.
      • setParserEncoding

        public void setParserEncoding​(String encoding)
        Sets the document parser encoding.
        Parameters:
        encoding - parser encoding.