Getting started with Apache Any23
Apache Any23 can be used:
- via CLI (command line interface) from your preferred shell environment;
- as a RESTful Webservice;
- as a library.
Apache Any23 Modules
Apache Any23 is composed of the following modules:
api/
The base API definitions e.g. The Any23 API.core/
The core library containing all extractor functionality.cli/
A command line interface enabling easy invocation of Any23 tools.csvutils/
Utility code for CSV extractions.encoding/
Characterset detection and encoding.mime/
Media-type detection.service/
The REST service.plugins/
The core additional plugins.openie/
Additional extractor logic for the Open Information Extraction (Open IE) system.
Use the Apache Any23 CLI
The command-line tools support is provided by the cli module.
Once Apache Any23 has been correctly installed, if you want to use it as a command line tool, use the shell script within the cli/target/appassembler/bin/
directory. These are provided both for Unix (Linux/OSX) and Windows.
The any23
script provides analysis, documentation, testing and debugging utilities.
Simply running ./any23 without options will show the usage options.
$ cli/target/appassembler/bin/any23 A command must be specified. Usage: any23 [options] [command] [command options] Options: -h, --help Display help information. Default: false --plugins-dir The Any23 plugins directory. Default: /Users/lmcgibbn/.any23/plugins -X, --verbose Produce execution verbose output. Default: false -v, --version Display version information. Default: false Commands: extractor Utility for obtaining documentation about metadata extractors. Usage: extractor [options] Extractor name Options: -a, --all shows a report about all available extractors Default: false -i, --input shows example input for the given extractor Default: false -l, --list shows the names of all available extractors Default: false -o, --outut shows example output for the given extractor Default: false microdata Commandline Tool for extracting Microdata from file/HTTP source. Usage: microdata [options] Input document URL, {http://path/to/resource.html|file:/path/to/localFile.html} mimes MIME Type Detector Tool. Usage: mimes [options] Input document URL, {http://path/to/resource.html|file:///path/to/local.file|inline:// some inline content} verify Utility for plugin management verification. Usage: verify [options] plugins-dir rover Any23 Command Line Tool. Usage: rover [options] input IRIs {<url>|<file>}+ Options: -d, --defaultns Override the default namespace used to produce statements. -e, --extractors a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle Default: [] -f, --format the output format Default: json -l, --log Produce log within a file. -n, --nesting Disable production of nesting triples. Default: false -t, --notrivial Filter trivial statements (e.g. CSS related ones). Default: false -o, --output Specify Output file (defaults to standard output) Default: java.io.PrintStream@5204062d -p, --pedantic Validate and fixes HTML content detecting commons issues. Default: false -s, --stats Print out extraction statistics. Default: false vocab Prints out the RDF Schema of the vocabularies used by Any23. Usage: vocab [options] Options: -f, --format Vocabulary output format Default: N-Quads (mimeTypes=application/n-quads, text/x-nquads, text/nquads; ext=nq)
The any23
script detects a list of available utilities within the core and plugins classpath and allows to activate them.
The any23-core CLI tools are:
extractor
: a utility for obtaining useful information about extractors.microdata
: commandline parser to extract specific Microdata content from a web page (local or remote) and produce a JSON output compliant with the Microdata specification (http://www.w3.org/TR/microdata/).mimes
: detects the MIME Type for any HTTP / file / direct input resource.verify
: a utility for verifying Apache Any23 plugins.rover
: the RDF extraction tool.vocab
: allows to dump all the RDFSchema vocabularies declared within Apache Any23.
The Rover tool
Rover is the main extraction tool. It allows to extract metadata from local and remote (HTTP) resources, specify a custom list of extractors, specify the desired output format and other flags to suppress noise and generate advanced reports.
Extract metadata from an HTML page:
cli$ any23 rover http://yourdomain/yourfile
Extract metadata from a local resource:
cli$ any23 rover myfoaf.rdf
Specify the output format, use the option "-f" or "--format": (Default output format is TURTLE).
cli$ any23 rover -f quad myfoaf.rdf
Filtering trivial statements
By default, Apache Any23 will extract HTML/head meta information, such as links to CSS stylesheets or meta information like the author or the software used to create the html. Hence, if the user is only interested in the structured content from the HTML/body tag we offer a filter functionality, activated by the "-t" command line argument.
core$ any23 rover -t -f quad myfoaf.rdf
The ExtractorDocumentation tool
The ExtractorDocumentation returns human readable information about the registered extractors.
List all the available extractors:
cli$ any23 extractor --list csv [org.apache.any23.extractor.csv.CSVExtractorFactory] [text/csv;q=0.1] html-embedded-jsonld [org.apache.any23.extractor.html.EmbeddedJSONLDExtractorFactory] [text/html;q=0.02, application/xhtml+xml;q=0.02] html-head-icbm [org.apache.any23.extractor.html.ICBMExtractorFactory] [text/html;q=0.01, application/xhtml+xml;q=0.01] html-head-links [org.apache.any23.extractor.html.HeadLinkExtractorFactory] [text/html;q=0.05, application/xhtml+xml;q=0.05] html-head-meta [org.apache.any23.extractor.html.HTMLMetaExtractorFactory] [text/html;q=0.02, application/xhtml+xml;q=0.02] html-head-title [org.apache.any23.extractor.html.TitleExtractorFactory] [text/html;q=0.02, application/xhtml+xml;q=0.02] html-mf-adr [org.apache.any23.extractor.html.AdrExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-geo [org.apache.any23.extractor.html.GeoExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-hcalendar [org.apache.any23.extractor.html.HCalendarExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-hcard [org.apache.any23.extractor.html.HCardExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-hlisting [org.apache.any23.extractor.html.HListingExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-hrecipe [org.apache.any23.extractor.html.HRecipeExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-hresume [org.apache.any23.extractor.html.HResumeExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-hreview [org.apache.any23.extractor.html.HReviewExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-hreview-aggregate [org.apache.any23.extractor.html.HReviewAggregateExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-license [org.apache.any23.extractor.html.LicenseExtractorFactory] [text/html;q=0.01, application/xhtml+xml;q=0.01] html-mf-species [org.apache.any23.extractor.html.SpeciesExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-mf-xfn [org.apache.any23.extractor.html.XFNExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-microdata [org.apache.any23.extractor.microdata.MicrodataExtractorFactory] [text/html;q=0.1, application/xhtml+xml;q=0.1] html-rdfa11 [org.apache.any23.extractor.rdfa.RDFa11ExtractorFactory] [application/xhtml+xml;q=0.3, application/html;q=0.3, text/html;q=0.3] html-xpath [org.apache.any23.extractor.xpath.XPathExtractorFactory] [text/html;q=0.02, application/xhtml+xml;q=0.02] ical [org.apache.any23.extractor.calendar.ICalExtractorFactory] [text/calendar] jcal [org.apache.any23.extractor.calendar.JCalExtractorFactory] [application/calendar+json] owl-functional [org.apache.any23.extractor.rdf.FunctionalSyntaxExtractorFactory] [text/owl-functional] owl-manchester [org.apache.any23.extractor.rdf.ManchesterSyntaxExtractorFactory] [text/owl-manchester] rdf-jsonld [org.apache.any23.extractor.rdf.JSONLDExtractorFactory] [application/ld+json;q=0.1] rdf-nq [org.apache.any23.extractor.rdf.NQuadsExtractorFactory] [application/n-quads, text/x-nquads;q=0.1, text/rdf+nq;q=0.1, text/nq;q=0.1, text/nquads;q=0.1, text/n-quads;q=0.1] rdf-nt [org.apache.any23.extractor.rdf.NTriplesExtractorFactory] [application/n-triples;q=0.1, text/nt;q=0.1, text/ntriples;q=0.1, text/plain;q=0.1] rdf-trix [org.apache.any23.extractor.rdf.TriXExtractorFactory] [application/trix] rdf-turtle [org.apache.any23.extractor.rdf.TurtleExtractorFactory] [text/turtle, text/rdf+n3, text/n3, application/n3, application/x-turtle, application/turtle] rdf-xml [org.apache.any23.extractor.rdf.RDFXMLExtractorFactory] [application/rdf+xml, text/rdf, text/rdf+xml, application/rdf] xcal [org.apache.any23.extractor.calendar.XCalExtractorFactory] [application/calendar+xml] yaml [org.apache.any23.extractor.yaml.YAMLExtractorFactory] [text/x-yaml;q=0.5]
The MicrodataParser tool
The MicrodataParser tool allows to apply the only MicrodataExtractor on a specific input source and returns the extracted data in the JSON format declared in the Microdata specification section JSON.
cli$ any23 microdata http://path/to/resource.html
The VocabPrinter tool
The VocabPrinter Tool prints out the RDFSchema declared by all the Apache Any23 declared vocabularies.
Just launch the command below to see all the managed vocabularies.
cli$ any23 vocab
NOTE: This tool is still in beta version.
The MimeDetector tool
The MimeDetector Tool extracts the MIME Type for a given source (http:// file:// inline://).
Examples:
cli$ any23 mimes http://www.michelemostarda.com/foaf.rdf application/rdf+xml
cli$ any23 mimes file://../src/test/resources/application/trix/test1.trx application/trix
cli$ any23 mimes 'inline://<http://s> <http://p> <http://o> .' text/n3
The PluginVerifier tool
The PluginVerifier tool allows checking installed plugin in the specified input directory
Just launch the command below to sanity-check the input plugins directory
cli$ any23 verify [/path/to/plugins/dir]
Apache Any23 CLI Plugins
The Apache Any23 ToolRunner CLI (bin/any23) supports the auto detection of Tool plugins within the classpath. For further details see Plugins section.
The default any23 CLI plugins are enlisted below.
Crawler Plugin
crawler-tool The Crawler Plugin provides basic site crawling and metadata extraction capabilities.
cli$ any23 -h [...] crawler Any23 Crawler Command Line Tool. Usage: crawler [options] input IRIs {<url>|<file>}+ Options: -d, --defaultns Override the default namespace used to produce statements. -e, --extractors a comma-separated list of extractors, e.g. rdf-xml,rdf-turtle Default: [] -f, --format the output format Default: turtle -l, --log Produce log within a file. -md, --maxdepth Max allowed crawler depth. Default: 2147483647 -mp, --maxpages Max number of pages before interrupting crawl. Default: 2147483647 -n, --nesting Disable production of nesting triples. Default: false -t, --notrivial Filter trivial statements (e.g. CSS related ones). Default: false -nc, --numcrawlers Sets the number of crawlers. Default: 10 -o, --output Specify Output file (defaults to standard output) Default: java.io.PrintStream@2911a3a4 -pf, --pagefilter Regex used to filter out page URLs during crawling. Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$ -p, --pedantic Validate and fixes HTML content detecting commons issues. Default: false -pd, --politenessdelay Politeness delay in milliseconds. Default: 2147483647 -s, --stats Print out extraction statistics. Default: false -sf, --storagefolder Folder used to store crawler temporary data. Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce
A usage example:
cli$ any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log
Use Apache Any23 as a RESTful Web Service
Apache Any23 provides a Web Service that can be used to extract RDF from Web documents. Apache Any23 services can be accessed through a RESTful API.
Running the server
The server command line tool is defined within the service module. Run the any23server
script
service$ ./bin/any23server
from the command line in order to start up the server, then go to to access the web interface. A live demo version of such service is running at . You can also start the server from Java by running the Apache Any23 Servlet class. Maven can be used to create a WAR file for deployment into an existing servlet container such as Apache Tomcat.
Use Apache Any23 as a Library
See our Developers guide for more details.