HTMLScraperExtractor (Apache Any23 2.4-SNAPSHOT API)

java.lang.Object
- org.apache.any23.plugin.htmlscraper.HTMLScraperExtractor

All Implemented Interfaces:

Extractor<InputStream>, Extractor.ContentExtractor
```
public class HTMLScraperExtractor
extends Object
implements Extractor.ContentExtractor
```
Implementation of content extractor for performing HTML scraping.

Author:

Michele Mostarda (mostarda@fbk.eu)

Nested Class Summary
- Nested classes/interfaces inherited from interface org.apache.any23.extractor.Extractor
  Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor

Field Summary

Fields
Modifier and Type	Field and Description
`static org.eclipse.rdf4j.model.IRI`	`PAGE_CONTENT_AE_PROPERTY`
`static org.eclipse.rdf4j.model.IRI`	`PAGE_CONTENT_CE_PROPERTY`
`static org.eclipse.rdf4j.model.IRI`	`PAGE_CONTENT_DE_PROPERTY`
`static org.eclipse.rdf4j.model.IRI`	`PAGE_CONTENT_LCE_PROPERTY`

Constructor Summary

Constructors
Constructor and Description

HTMLScraperExtractor()

Constructors
Constructor and Description
`HTMLScraperExtractor()`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addTextExtractor(String name, org.eclipse.rdf4j.model.IRI property, de.l3s.boilerpipe.BoilerpipeExtractor extractor)`
`ExtractorDescription`	`getDescription()` Returns a `ExtractorDescription` of this extractor.
`String[]`	`getTextExtractors()`
`void`	`run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, InputStream inputStream, ExtractionResult extractionResult)` Executes the extractor.
`void`	`setStopAtFirstError(boolean b)` If `true`, the extractor will stop at first parsing error, if`false` the extractor will attempt to ignore all parsing errors.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - PAGE_CONTENT_DE_PROPERTY
```
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_DE_PROPERTY
```
  - PAGE_CONTENT_AE_PROPERTY
```
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_AE_PROPERTY
```
  - PAGE_CONTENT_LCE_PROPERTY
```
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_LCE_PROPERTY
```
  - PAGE_CONTENT_CE_PROPERTY
```
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_CE_PROPERTY
```
- Constructor Detail
  - HTMLScraperExtractor
```
public HTMLScraperExtractor()
```
- Method Detail
  - addTextExtractor
```
public void addTextExtractor(String name,
                             org.eclipse.rdf4j.model.IRI property,
                             de.l3s.boilerpipe.BoilerpipeExtractor extractor)
```
  - getTextExtractors
```
public String[] getTextExtractors()
```
  - run
```
public void run(ExtractionParameters extractionParameters,
                ExtractionContext extractionContext,
                InputStream inputStream,
                ExtractionResult extractionResult)
         throws IOException,
                ExtractionException
```
    Description copied from interface: Extractor
    
    Executes the extractor. Will be invoked only once, extractors are not reusable.
    
    Specified by:
    
    run in interface Extractor<InputStream>
    
    Parameters:
    
    extractionParameters - the parameters to be applied during the extraction.
    
    extractionContext - The document context.
    
    inputStream - The extractor input data.
    
    extractionResult - the collector for the extracted data.
    
    Throws:
    
    IOException - On error while reading from the input stream.
    
    ExtractionException - On other error, such as parse errors.
  - getDescription
```
public ExtractorDescription getDescription()
```
    Description copied from interface: Extractor
    
    Returns a ExtractorDescription of this extractor.
    
    Specified by:
    
    getDescription in interface Extractor<InputStream>
    
    Returns:
    
    the object representing the extractor description.
  - setStopAtFirstError
```
public void setStopAtFirstError(boolean b)
```
    Description copied from interface: Extractor.ContentExtractor
    
    If true, the extractor will stop at first parsing error, iffalse the extractor will attempt to ignore all parsing errors.
    
    Specified by:
    
    setStopAtFirstError in interface Extractor.ContentExtractor
    
    Parameters:
    
    b - tolerance flag.

Class HTMLScraperExtractor

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.any23.extractor.Extractor

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

PAGE_CONTENT_DE_PROPERTY

PAGE_CONTENT_AE_PROPERTY

PAGE_CONTENT_LCE_PROPERTY

PAGE_CONTENT_CE_PROPERTY

Constructor Detail

HTMLScraperExtractor

Method Detail

addTextExtractor

getTextExtractors

run

getDescription

setStopAtFirstError