HTML Scraper Plugin
The HTML Scraper Plugin is meant to scrape any HTML page extracting human readable text only. Such plugin will generate a set of triples like:
<http://source-page-url> <http://vocab.sindice.net/pagecontent/de> "<DE Extractor Result>" . <http://source-page-url> <http://vocab.sindice.net/pagecontent/ae> "<AE Extractor Result>" . <http://source-page-url> <http://vocab.sindice.net/pagecontent/lce> "<LCE Extractor Result>" . <http://source-page-url> <http://vocab.sindice.net/pagecontent/ce> "<CE Extractor Result>" .
The plugin engine is based on the Boilerpipe library extractor. The extractors mentioned as DE, AE, LCE and CE are the ones defined within the library.