Class Any23


  • public class Any23
    extends Object
    A facade with convenience methods for typical Any23 extraction operations.
    Author:
    Richard Cyganiak (richard@cyganiak.de), Michele Mostarda (michele.mostarda@gmail.com)
    • Field Detail

      • VERSION

        public static final String VERSION
        Any23 core library version. NOTE: there's also a version string in pom.xml, they should match.
      • DEFAULT_HTTP_CLIENT_USER_AGENT

        public static final String DEFAULT_HTTP_CLIENT_USER_AGENT
        Default HTTP User Agent defined in default configuration.
      • logger

        protected static final org.slf4j.Logger logger
    • Constructor Detail

      • Any23

        public Any23​(Configuration configuration,
                     ExtractorGroup extractorGroup)
        Constructor that allows the specification of a custom configuration and of a list of extractors.
        Parameters:
        configuration - configuration used to build the Any23 instance.
        extractorGroup - the group of extractors to be applied.
      • Any23

        public Any23​(ExtractorGroup extractorGroup)
        Constructor that allows the specification of a list of extractors.
        Parameters:
        extractorGroup - the group of extractors to be applied.
      • Any23

        public Any23​(Configuration configuration,
                     String... extractorNames)
        Constructor that allows the specification of a custom configuration and of list of extractor names.
        Parameters:
        configuration - a Configuration object
        extractorNames - list of extractor's names.
      • Any23

        public Any23​(String... extractorNames)
        Constructor that allows the specification of a list of extractor names.
        Parameters:
        extractorNames - list of extractor's names.
      • Any23

        public Any23()
        Constructor with default configuration.
    • Method Detail

      • setHTTPUserAgent

        public void setHTTPUserAgent​(String userAgent)
        Sets the HTTP Header User Agent, see RFC 2616-14.43.
        Parameters:
        userAgent - text describing the user agent.
      • getHTTPUserAgent

        public String getHTTPUserAgent()
        Returns the HTTP Header User Agent, see RFC 2616-14.43.
        Returns:
        text describing the user agent.
      • setHTTPClient

        public void setHTTPClient​(HTTPClient httpClient)
        Allows to set the HTTPClient implementation used to retrieve contents. The default instance is DefaultHTTPClient.
        Parameters:
        httpClient - a valid client instance.
        Throws:
        IllegalStateException - if invoked after client has been initialized.
      • getHTTPClient

        public HTTPClient getHTTPClient()
                                 throws IOException
        Returns the current HTTPClient implementation.
        Returns:
        instance of HTTPClient.
        Throws:
        IOException - if the HTTP client has not initialized.
      • setCacheFactory

        public void setCacheFactory​(LocalCopyFactory cache)
        Allows to set a LocalCopyFactory instance.
        Parameters:
        cache - valid cache instance.
      • setMIMETypeDetector

        public void setMIMETypeDetector​(MIMETypeDetector detector)
        Allows to set an instance of MIMETypeDetector.
        Parameters:
        detector - a valid detector instance, if null all the detectors will be used.
      • createDocumentSource

        public DocumentSource createDocumentSource​(String documentIRI)
                                            throws URISyntaxException,
                                                   IOException

        Returns the most appropriate DocumentSource for the givendocumentIRI.

        N.B. documentIRI's should contain a protocol. E.g. http:, https:, file:

        Parameters:
        documentIRI - the document IRI.
        Returns:
        a new instance of DocumentSource.
        Throws:
        URISyntaxException - if an error occurs while parsing the documentIRI as a IRI.
        IOException - if an error occurs while initializing the internal HTTPClient.
      • extract

        public ExtractionReport extract​(String in,
                                        String documentIRI,
                                        String contentType,
                                        String encoding,
                                        TripleHandler outputHandler)
                                 throws IOException,
                                        ExtractionException
        Performs metadata extraction on the in string associated to the documentIRI IRI, declaring contentType and encoding. The generated events are sent to the specified outputHandler.
        Parameters:
        in - raw data to be analyzed.
        documentIRI - IRI from which the raw data has been extracted.
        contentType - declared data content type.
        encoding - declared data encoding.
        outputHandler - handler responsible for collecting of the extracted metadata.
        Returns:
        true if some extraction occurred, false otherwise.
        Throws:
        IOException - if there is an error reading the DocumentSource
        ExtractionException - if there is an error during extraction
      • extract

        public ExtractionReport extract​(String in,
                                        String documentIRI,
                                        TripleHandler outputHandler)
                                 throws IOException,
                                        ExtractionException
        Performs metadata extraction on the in string associated to the documentIRI IRI, sending the generated events to the specified outputHandler.
        Parameters:
        in - raw data to be analyzed.
        documentIRI - IRI from which the raw data has been extracted.
        outputHandler - handler responsible for collecting of the extracted metadata.
        Returns:
        true if some extraction occurred, false otherwise.
        Throws:
        IOException - if there is an error reading the DocumentSource
        ExtractionException - if there is an error during extraction
      • extract

        public ExtractionReport extract​(File file,
                                        TripleHandler outputHandler)
                                 throws IOException,
                                        ExtractionException
        Performs metadata extraction from the content of the given file sending the generated events to the specified outputHandler.
        Parameters:
        file - file containing raw data.
        outputHandler - handler responsible for collecting of the extracted metadata.
        Returns:
        true if some extraction occurred, false otherwise.
        Throws:
        IOException - if there is an error reading the DocumentSource
        ExtractionException - if there is an error during extraction
      • extract

        public ExtractionReport extract​(ExtractionParameters eps,
                                        String documentIRI,
                                        TripleHandler outputHandler)
                                 throws IOException,
                                        ExtractionException
        Performs metadata extraction from the content of the given documentIRI sending the generated events to the specified outputHandler. If the IRI is replied with a redirect, the last will be followed.
        Parameters:
        eps - the parameters to be applied to the extraction.
        documentIRI - the IRI from which retrieve document.
        outputHandler - handler responsible for collecting of the extracted metadata.
        Returns:
        true if some extraction occurred, false otherwise.
        Throws:
        IOException - if there is an error reading the DocumentSource
        ExtractionException - if there is an error during extraction
      • extract

        public ExtractionReport extract​(String documentIRI,
                                        TripleHandler outputHandler)
                                 throws IOException,
                                        ExtractionException
        Performs metadata extraction from the content of the given documentIRI sending the generated events to the specified outputHandler. If the IRI is replied with a redirect, the last will be followed.
        Parameters:
        documentIRI - the IRI from which retrieve document.
        outputHandler - handler responsible for collecting of the extracted metadata.
        Returns:
        true if some extraction occurred, false otherwise.
        Throws:
        IOException - if there is an error reading the DocumentSource
        ExtractionException - if there is an error during extraction
      • extract

        public ExtractionReport extract​(DocumentSource in,
                                        TripleHandler outputHandler)
                                 throws IOException,
                                        ExtractionException
        Performs metadata extraction from the content of the given in document source, sending the generated events to the specified outputHandler.
        Parameters:
        in - the input document source.
        outputHandler - handler responsible for collecting of the extracted metadata.
        Returns:
        true if some extraction occurred, false otherwise.
        Throws:
        IOException - if there is an error reading the DocumentSource
        ExtractionException - if there is an error during extraction