Getting started with Apache Any23

Apache Any23 can be used:

  • via CLI (command line interface) from your preferred shell environment; * as a RESTful Webservice; * as a library.

Apache Any23 Modules

Apache Any23 is composed of the following modules:

  • api/ Any23 library external API.
  • core/ The library core codebase.
  • csvutils/ A CSV specific package
  • encoding/ Encoding detection library.
  • mime/ NQuads parsing and serialization library.
  • nquads/ The REST service.
  • plugins/ Library plugins codebase (read plugins/README.txt for further details).
  • service/ The library HTTP service codebase.
  • src Packing of Any23 artifacts.
  • test-resources/ Material relating to Any23 JUnit test cases.
  • RELEASE-NOTES.txt File reporting main release notes for every version.
  • LICENSE.txt Applicable Apache Software v2.0 project license.
  • README.txt The go-to resource for new users/developers

Use the Apache Any23 CLI

The command-line tools support is provided by the any23-core module.

Once Apache Any23 has been correctly installed, if you want to use it as a command line tool, use the shell script within the any23-core/bin directory. These are provided both for Unix (Linux/OSX) and Windows.

The any23 script provides analysis, documentation, testing and debugging utilities.

Simply running ./any23 without options will show the usage options.

any23-core$ ./bin/any23
A command must be specified.
Usage: any23 [options] [command] [command options]
    -h, --help          Display help information.
                        Default: false
        --plugins-dir   The Any23 plugins directory.
                        Default: ~/.any23/plugins
    -X, --verbose       Produce execution verbose output.
                        Default: false
    -v, --version       Display version information.
                        Default: false
    extractor      Utility for obtaining documentation about metadata extractors.
      Usage: extractor [options] Extractor name      
          -a, --all     shows a report about all available extractors
                        Default: false
          -i, --input   shows example input for the given extractor
                        Default: false
          -l, --list    shows the names of all available extractors
                        Default: false
          -o, --outut   shows example output for the given extractor
                        Default: false

    microdata      Commandline Tool for extracting Microdata from file/HTTP source.
      Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/local.file}
    mimes      MIME Type Detector Tool.
      Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content}
    verify      Utility for plugin management verification.
      Usage: verify [options] plugins-dir
    rover      Any23 Command Line Tool.
      Usage: rover [options] input URIs {<url>|<file>}+
          -d, --defaultns    Override the default namespace used to produce
          -e, --extractors   a comma-separated list of extractors, e.g.
                             Default: []
          -f, --format       the output format
                             Default: turtle
          -l, --log          Produce log within a file.
          -n, --nesting      Disable production of nesting triples.
                             Default: false
          -t, --notrivial    Filter trivial statements (e.g. CSS related ones).
                             Default: false
          -o, --output       Specify Output file (defaults to standard output)
          -p, --pedantic     Validate and fixes HTML content detecting commons
                             Default: false
          -s, --stats        Print out extraction statistics.
                             Default: false

    vocab      Prints out the RDF Schema of the vocabularies used by Any23.
      Usage: vocab [options]      
          -f, --format   Vocabulary output format
                         Default: NQuads

The any23 script detects a list of available utilities within the any23-core and plugins classpath and allows to activate them.

The any23-core CLI tools are:

  • extractor: a utility for obtaining useful information about extractors.
  • microdata: commandline parser to extract specific Microdata content from a web page (local or remote) and produce a JSON output compliant with the Microdata specification (
  • mimes: detects the MIME Type for any HTTP / file / direct input resource.
  • verify: a utility for verifying Apache Any23 plugins.
  • rover: the RDF extraction tool.
  • vocab: allows to dump all the RDFSchema vocabularies declared within Apache Any23.

The Rover tool

Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP) resources, specify a custom list of extractors, specify the desired output format and other flags to suppress noise and generate advanced reports.

Extract metadata from an HTML page:

any23-core$ ./bin/any23 rover http://yourdomain/yourfile

Extract metadata from a local resource:

any23-core$ ./bin/any23 rover myfoaf.rdf

Specify the output format, use the option "-f" or "--format": (Default output format is TURTLE).

any23-core$ ./bin/any23 rover -f quad myfoaf.rdf

Filtering trivial statements

By default, Apache Any23 will extract HTML/head meta information, such as links to CSS stylesheets or meta information like the author or the software used to create the html. Hence, if the user is only interested in the structured content from the HTML/body tag we offer a filter functionality, activated by the "-t" command line argument.

any23-core$ ./bin/any23 rover -t -f quad myfoaf.rdf

The ExtractorDocumentation tool

The ExtractorDocumentation returns human readable information about the registered extractors.

List all the available extractors:

any23-core/core$ ./bin/any23 extractor --list
                      csv [class org.apache.any23.extractor.csv.CSVExtractor]
           html-head-icbm [class org.apache.any23.extractor.html.ICBMExtractor]
          html-head-links [class org.apache.any23.extractor.html.HeadLinkExtractor]
          html-head-title [class org.apache.any23.extractor.html.TitleExtractor]
              html-mf-adr [class org.apache.any23.extractor.html.AdrExtractor]
              html-mf-geo [class org.apache.any23.extractor.html.GeoExtractor]
        html-mf-hcalendar [class org.apache.any23.extractor.html.HCalendarExtractor]
            html-mf-hcard [class org.apache.any23.extractor.html.HCardExtractor]
         html-mf-hlisting [class org.apache.any23.extractor.html.HListingExtractor]
          html-mf-hrecipe [class org.apache.any23.extractor.html.HRecipeExtractor]
          html-mf-hresume [class org.apache.any23.extractor.html.HResumeExtractor]
          html-mf-hreview [class org.apache.any23.extractor.html.HReviewExtractor]
          html-mf-license [class org.apache.any23.extractor.html.LicenseExtractor]
          html-mf-species [class org.apache.any23.extractor.html.SpeciesExtractor]
              html-mf-xfn [class org.apache.any23.extractor.html.XFNExtractor]
           html-microdata [class org.apache.any23.extractor.microdata.MicrodataExtractor]
              html-rdfa11 [class org.apache.any23.extractor.rdfa.RDFa11Extractor]
       html-script-turtle [class org.apache.any23.extractor.html.TurtleHTMLExtractor]
                   rdf-nq [class org.apache.any23.extractor.rdf.NQuadsExtractor]
                   rdf-nt [class org.apache.any23.extractor.rdf.NTriplesExtractor]
                 rdf-trix [class org.apache.any23.extractor.rdf.TriXExtractor]
               rdf-turtle [class org.apache.any23.extractor.rdf.TurtleExtractor]
                  rdf-xml [class org.apache.any23.extractor.rdf.RDFXMLExtractor]

The MicrodataParser tool

The MicrodataParser tool allows to apply the only MicrodataExtractor on a specific input source and returns the extracted data in the JSON format declared in the Microdata specification section JSON.

any23-core/core$ ./bin/any23 microdata http://path/to/resource.html

The VocabPrinter tool

The VocabPrinter Tool prints out the RDFSchema declared by all the Apache Any23 declared vocabularies.

Just launch the command below to see all the managed vocabularies.

any23-core/core$ ./bin/any23 vocab

NOTE: This tool is still in beta version.

The MimeDetector tool

The MimeDetector Tool extracts the MIME Type for a given source (http:// file:// inline://).


any23-core$ ./bin/any23 mimes
any23-core$ ./bin/any23 mimes file://../src/test/resources/application/trix/test1.trx
any23-core$ ./bin/any23 mimes 'inline://<http://s> <http://p> <http://o> .'

The PluginVerifier tool

The PluginVerifier tool allows checking installed plugin in the specified input directory

Just launch the command below to sanity-check the input plugins directory

any23-core$ ./bin/any23 verify [/path/to/plugins/dir]

Apache Any23 CLI Plugins

The Apache Any23 ToolRunner CLI (bin/any23tools) supports the auto detection of Tool plugins within the classpath. For further details see Plugins section.

The default any23 CLI plugins are enlisted below.

Crawler Plugin

crawler-tool The Crawler Plugin provides basic site crawling and metadata extraction capabilities.

any23-core$ ./bin/any23 -h
    crawler      Any23 Crawler Command Line Tool.
      Usage: crawler [options] input URIs {<url>|<file>}+
          -d, --defaultns          Override the default namespace used to
                                   produce statements.
          -e, --extractors         a comma-separated list of extractors, e.g.
                                   Default: []
          -f, --format             the output format
                                   Default: turtle
          -l, --log                Produce log within a file.
          -md, --maxdepth          Max allowed crawler depth.
                                   Default: 2147483647
          -mp, --maxpages          Max number of pages before interrupting
                                   Default: 2147483647
          -n, --nesting            Disable production of nesting triples.
                                   Default: false
          -t, --notrivial          Filter trivial statements (e.g. CSS related
                                   Default: false
          -nc, --numcrawlers       Sets the number of crawlers.
                                   Default: 10
          -o, --output             Specify Output file (defaults to standard
          -pf, --pagefilter        Regex used to filter out page URLs during
                                   Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$
          -p, --pedantic           Validate and fixes HTML content detecting
                                   commons issues.
                                   Default: false
          -pd, --politenessdelay   Politeness delay in milliseconds.
                                   Default: 2147483647
          -s, --stats              Print out extraction statistics.
                                   Default: false
          -sf, --storagefolder     Folder used to store crawler temporary data.
                                   Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce

A usage example:

any23-core$ ./bin/any23 crawler -s -f ntriples 1> out.nt 2> repubblica.log

Use Apache Any23 as a RESTful Web Service

Apache Any23 provides a Web Service that can be used to extract RDF from Web documents. Apache Any23 services can be accessed through a RESTful API.

Running the server

The server command line tool is defined within the any23-service module. Run the any23server script

any23-service$ ./bin/any23server

from the command line in order to start up the server, then go to to access the web interface. A live demo version of such service is running at . You can also start the server from Java by running the Apache Any23 Servlet class. Maven can be used to create a WAR file for deployment into an existing servlet container such as Apache Tomcat.

Use Apache Any23 as a Library

See our Developers guide for more details.