This project has retired. For details please refer to its Attic page.
Apache Any23 – Apache Any23 - Data Extraction

Data Extraction

/*1*/ Any23 runner = new Any23();
/*2*/ runner.setHTTPUserAgent("test-user-agent");
/*3*/ HTTPClient httpClient = runner.getHTTPClient();
/*4*/ DocumentSource source = new HTTPDocumentSource(
/*5*/ ByteArrayOutputStream out = new ByteArrayOutputStream();
/*6*/ TripleHandler handler = new NTriplesWriter(out);
      try {
/*7*/     runner.extract(source, handler);
      } finally {
/*8*/     handler.close();
/*9*/ String n3 = out.toString("UTF-8");

This example demonstrates the data extraction, that is the main purpose of Apache Any23 library. At line 1 we define the Apache Any23 facade instance. As described before, the constructor allows to enforce the usage of specific extractors.

The line 2 defines the HTTP User Agent, used to identify the client during HTTP data collection. At line 3 we use the runner to create an instance of HTTPClient, used by HTTPDocumentSource for HTTP content fetching.

The line 4 instantiates an HTTPDocumentSource instance, specifying the HTTPClient and the URL addressing the content to be processed.

At line 5 we define a buffered output stream used to store data produced by the TripleHandler defined at line 6.

The extraction method at line 7 will run the metadata extraction. The produced metadata will be written within the passed TripleHandler instance.

The TripleHandler needs to be explicitly closed, this is done safely in a finally block at line 8.

The expected output is UTF-8 encoded at line 9 and is:

<> <>
"Semantic Loft (beta) - Trastevere apartments | Rental in Rome -" .

<> .

<> .

<> .

<> .

_:node14r93a8dex1 .

[The complete output is omitted for brevity.]

Filter Out Accidental Triples

To remove accidental triples Apache Any23 provides a set of useful filters, located within the org.apache.any23.filter package.

The filter IgnoreTitlesOfEmptyDocuments removes triples generated by the TitleExtractor whether the document is empty.

The filter IgnoreAccidentalRDFa removes accidental CSS related triples.

RDFWriter rdfWriter = ...
TripleHandler rdfWriterHandler = RDFWriterTripleHandler(rdfWriter);
TripleHandler tripleHandler = new ReportingTripleHandler(
        new IgnoreAccidentalRDFa(
                new IgnoreTitlesOfEmptyDocuments(rdfWriterHandler),
                true // if true the CSS triples will be removed in any case.
DocumentSource documentSource = ...
any23.extract(documentSource, rdfWriterHandler);