This project has retired. For details please refer to its Attic page.
Apache Any23 – Apache Any23 - Plugins - HTML Scraper

HTML Scraper Plugin

The HTML Scraper Plugin is meant to scrape any HTML page extracting human readable text only. Such plugin will generate a set of triples like:

<http://source-page-url> <http://vocab.sindice.net/pagecontent/de>  "<DE  Extractor Result>" .
<http://source-page-url> <http://vocab.sindice.net/pagecontent/ae>  "<AE  Extractor Result>" .
<http://source-page-url> <http://vocab.sindice.net/pagecontent/lce> "<LCE Extractor Result>" .
<http://source-page-url> <http://vocab.sindice.net/pagecontent/ce>  "<CE  Extractor Result>" .

The plugin engine is based on the Boilerpipe library extractor. The extractors mentioned as DE, AE, LCE and CE are the ones defined within the library.