Getting started with Apache Any23

Apache Any23 can be used:

  • via CLI (command line interface) from your preferred shell environment;
  • as a RESTful Webservice;
  • as a library.

Apache Any23 Modules

Apache Any23 is composed of the following modules:

  • api/ The base API definitions e.g. The Any23 API.
  • core/ The core library containing all extractor functionality.
  • cli/ A command line interface enabling easy invocation of Any23 tools.
  • csvutils/ Utility code for CSV extractions.
  • encoding/ Characterset detection and encoding.
  • mime/ Media-type detection.
  • service/ The REST service.
  • plugins/ The core additional plugins.

Use the Apache Any23 CLI

The command-line tools support is provided by the cli module.

Once Apache Any23 has been correctly installed, if you want to use it as a command line tool, use the shell script within the cli/target/appassembler/bin/ directory. These are provided both for Unix (Linux/OSX) and Windows.

The any23 script provides analysis, documentation, testing and debugging utilities.

Simply running ./any23 without options will show the usage options.

$ cli/target/appassembler/bin/any23

A command must be specified.
Usage: any23 [options] [command] [command options]
  Options:
    -h, --help
       Display help information.
       Default: false
        --plugins-dir
       The Any23 plugins directory.
       Default: /Users/lmcgibbn/.any23/plugins
    -X, --verbose
       Produce execution verbose output.
       Default: false
    -v, --version
       Display version information.
       Default: false
  Commands:
    extractor      Utility for obtaining documentation about metadata extractors.
      Usage: extractor [options] Extractor name
        Options:
          -a, --all
             shows a report about all available extractors
             Default: false
          -i, --input
             shows example input for the given extractor
             Default: false
          -l, --list
             shows the names of all available extractors
             Default: false
          -o, --outut
             shows example output for the given extractor
             Default: false

    microdata      Commandline Tool for extracting Microdata from file/HTTP source.
      Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/localFile.html}

    mimes      MIME Type Detector Tool.
      Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content}

    verify      Utility for plugin management verification.
      Usage: verify [options] plugins-dir

    rover      Any23 Command Line Tool.
      Usage: rover [options] input IRIs {<url>|<file>}+
        Options:
          -d, --defaultns
             Override the default namespace used to produce statements.
          -e, --extractors
             a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle
             Default: []
          -f, --format
             the output format
             Default: json
          -l, --log
             Produce log within a file.
          -n, --nesting
             Disable production of nesting triples.
             Default: false
          -t, --notrivial
             Filter trivial statements (e.g. CSS related ones).
             Default: false
          -o, --output
             Specify Output file (defaults to standard output)
             Default: java.io.PrintStream@5204062d
          -p, --pedantic
             Validate and fixes HTML content detecting commons issues.
             Default: false
          -s, --stats
             Print out extraction statistics.
             Default: false

    vocab      Prints out the RDF Schema of the vocabularies used by Any23.
      Usage: vocab [options]
        Options:
          -f, --format
             Vocabulary output format
             Default: N-Quads (mimeTypes=application/n-quads, text/x-nquads, text/nquads; ext=nq)

The any23 script detects a list of available utilities within the core and plugins classpath and allows to activate them.

The any23-core CLI tools are:

  • extractor: a utility for obtaining useful information about extractors.
  • microdata: commandline parser to extract specific Microdata content from a web page (local or remote) and produce a JSON output compliant with the Microdata specification (http://www.w3.org/TR/microdata/).
  • mimes: detects the MIME Type for any HTTP / file / direct input resource.
  • verify: a utility for verifying Apache Any23 plugins.
  • rover: the RDF extraction tool.
  • vocab: allows to dump all the RDFSchema vocabularies declared within Apache Any23.

The Rover tool

Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP) resources, specify a custom list of extractors, specify the desired output format and other flags to suppress noise and generate advanced reports.

Extract metadata from an HTML page:

cli$ any23 rover http://yourdomain/yourfile

Extract metadata from a local resource:

cli$ any23 rover myfoaf.rdf

Specify the output format, use the option "-f" or "--format": (Default output format is TURTLE).

cli$ any23 rover -f quad myfoaf.rdf

Filtering trivial statements

By default, Apache Any23 will extract HTML/head meta information, such as links to CSS stylesheets or meta information like the author or the software used to create the html. Hence, if the user is only interested in the structured content from the HTML/body tag we offer a filter functionality, activated by the "-t" command line argument.

core$ any23 rover -t -f quad myfoaf.rdf

The ExtractorDocumentation tool

The ExtractorDocumentation returns human readable information about the registered extractors.

List all the available extractors:

cli$ any23 extractor --list
                      csv [org.apache.any23.extractor.csv.CSVExtractorFactory]
     html-embedded-jsonld [org.apache.any23.extractor.html.EmbeddedJSONLDExtractorFactory]
           html-head-icbm [org.apache.any23.extractor.html.ICBMExtractorFactory]
          html-head-links [org.apache.any23.extractor.html.HeadLinkExtractorFactory]
           html-head-meta [org.apache.any23.extractor.html.HTMLMetaExtractorFactory]
          html-head-title [org.apache.any23.extractor.html.TitleExtractorFactory]
              html-mf-adr [org.apache.any23.extractor.html.AdrExtractorFactory]
              html-mf-geo [org.apache.any23.extractor.html.GeoExtractorFactory]
        html-mf-hcalendar [org.apache.any23.extractor.html.HCalendarExtractorFactory]
            html-mf-hcard [org.apache.any23.extractor.html.HCardExtractorFactory]
         html-mf-hlisting [org.apache.any23.extractor.html.HListingExtractorFactory]
          html-mf-hrecipe [org.apache.any23.extractor.html.HRecipeExtractorFactory]
          html-mf-hresume [org.apache.any23.extractor.html.HResumeExtractorFactory]
          html-mf-hreview [org.apache.any23.extractor.html.HReviewExtractorFactory]
html-mf-hreview-aggregate [org.apache.any23.extractor.html.HReviewAggregateExtractorFactory]
          html-mf-license [org.apache.any23.extractor.html.LicenseExtractorFactory]
          html-mf-species [org.apache.any23.extractor.html.SpeciesExtractorFactory]
              html-mf-xfn [org.apache.any23.extractor.html.XFNExtractorFactory]
           html-microdata [org.apache.any23.extractor.microdata.MicrodataExtractorFactory]
              html-rdfa11 [org.apache.any23.extractor.rdfa.RDFa11ExtractorFactory]
               html-xpath [org.apache.any23.extractor.xpath.XPathExtractorFactory]
               rdf-jsonld [org.apache.any23.extractor.rdf.JSONLDExtractorFactory]
                   rdf-nq [org.apache.any23.extractor.rdf.NQuadsExtractorFactory]
                   rdf-nt [org.apache.any23.extractor.rdf.NTriplesExtractorFactory]
                 rdf-trix [org.apache.any23.extractor.rdf.TriXExtractorFactory]
               rdf-turtle [org.apache.any23.extractor.rdf.TurtleExtractorFactory]
                  rdf-xml [org.apache.any23.extractor.rdf.RDFXMLExtractorFactory]
                     yaml [org.apache.any23.extractor.yaml.YAMLExtractorFactory]

The MicrodataParser tool

The MicrodataParser tool allows to apply the only MicrodataExtractor on a specific input source and returns the extracted data in the JSON format declared in the Microdata specification section JSON.

cli$ any23 microdata http://path/to/resource.html

The VocabPrinter tool

The VocabPrinter Tool prints out the RDFSchema declared by all the Apache Any23 declared vocabularies.

Just launch the command below to see all the managed vocabularies.

cli$ any23 vocab

NOTE: This tool is still in beta version.

The MimeDetector tool

The MimeDetector Tool extracts the MIME Type for a given source (http:// file:// inline://).

Examples:

cli$ any23 mimes http://www.michelemostarda.com/foaf.rdf
application/rdf+xml
cli$ any23 mimes file://../src/test/resources/application/trix/test1.trx
application/trix
cli$ any23 mimes 'inline://<http://s> <http://p> <http://o> .'
text/n3

The PluginVerifier tool

The PluginVerifier tool allows checking installed plugin in the specified input directory

Just launch the command below to sanity-check the input plugins directory

cli$ any23 verify [/path/to/plugins/dir]

Apache Any23 CLI Plugins

The Apache Any23 ToolRunner CLI (bin/any23) supports the auto detection of Tool plugins within the classpath. For further details see Plugins section.

The default any23 CLI plugins are enlisted below.

Crawler Plugin

crawler-tool The Crawler Plugin provides basic site crawling and metadata extraction capabilities.

cli$ any23 -h
[...]
    crawler      Any23 Crawler Command Line Tool.
      Usage: crawler [options] input IRIs {<url>|<file>}+
  Options:
          -d, --defaultns          Override the default namespace used to
                                   produce statements.
          -e, --extractors         a comma-separated list of extractors, e.g.
                                   rdf-xml,rdf-turtle
                                   Default: []
          -f, --format             the output format
                                   Default: turtle
          -l, --log                Produce log within a file.
          -md, --maxdepth          Max allowed crawler depth.
                                   Default: 2147483647
          -mp, --maxpages          Max number of pages before interrupting
                                   crawl.
                                   Default: 2147483647
          -n, --nesting            Disable production of nesting triples.
                                   Default: false
          -t, --notrivial          Filter trivial statements (e.g. CSS related
                                   ones).
                                   Default: false
          -nc, --numcrawlers       Sets the number of crawlers.
                                   Default: 10
          -o, --output             Specify Output file (defaults to standard
                                   output)
                                   Default: java.io.PrintStream@2911a3a4
          -pf, --pagefilter        Regex used to filter out page URLs during
                                   crawling.
                                   Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$
          -p, --pedantic           Validate and fixes HTML content detecting
                                   commons issues.
                                   Default: false
          -pd, --politenessdelay   Politeness delay in milliseconds.
                                   Default: 2147483647
          -s, --stats              Print out extraction statistics.
                                   Default: false
          -sf, --storagefolder     Folder used to store crawler temporary data.
                                   Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce

A usage example:

cli$ any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log

Use Apache Any23 as a RESTful Web Service

Apache Any23 provides a Web Service that can be used to extract RDF from Web documents. Apache Any23 services can be accessed through a RESTful API.

Running the server

The server command line tool is defined within the service module. Run the any23server script

service$ ./bin/any23server

from the command line in order to start up the server, then go to to access the web interface. A live demo version of such service is running at . You can also start the server from Java by running the Apache Any23 Servlet class. Maven can be used to create a WAR file for deployment into an existing servlet container such as Apache Tomcat.

Use Apache Any23 as a Library

See our Developers guide for more details.