public class HTMLScraperExtractor extends Object implements Extractor.ContentExtractor
Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor| Modifier and Type | Field and Description |
|---|---|
static org.eclipse.rdf4j.model.IRI |
PAGE_CONTENT_AE_PROPERTY |
static org.eclipse.rdf4j.model.IRI |
PAGE_CONTENT_CE_PROPERTY |
static org.eclipse.rdf4j.model.IRI |
PAGE_CONTENT_DE_PROPERTY |
static org.eclipse.rdf4j.model.IRI |
PAGE_CONTENT_LCE_PROPERTY |
| Constructor and Description |
|---|
HTMLScraperExtractor() |
| Modifier and Type | Method and Description |
|---|---|
void |
addTextExtractor(String name,
org.eclipse.rdf4j.model.IRI property,
de.l3s.boilerpipe.BoilerpipeExtractor extractor) |
ExtractorDescription |
getDescription()
Returns a
ExtractorDescription of this extractor. |
String[] |
getTextExtractors() |
void |
run(ExtractionParameters extractionParameters,
ExtractionContext extractionContext,
InputStream inputStream,
ExtractionResult extractionResult)
Executes the extractor.
|
void |
setStopAtFirstError(boolean b)
If
true, the extractor will stop at first parsing error,
iffalse the extractor will attempt to ignore all parsing errors. |
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_DE_PROPERTY
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_AE_PROPERTY
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_LCE_PROPERTY
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_CE_PROPERTY
public void addTextExtractor(String name, org.eclipse.rdf4j.model.IRI property, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
public String[] getTextExtractors()
public void run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, InputStream inputStream, ExtractionResult extractionResult) throws IOException, ExtractionException
Extractorrun in interface Extractor<InputStream>extractionParameters - the parameters to be applied during the extraction.extractionContext - The document context.inputStream - The extractor input data.extractionResult - the collector for the extracted data.IOException - On error while reading from the input stream.ExtractionException - On other error, such as parse errors.public ExtractorDescription getDescription()
ExtractorExtractorDescription of this extractor.getDescription in interface Extractor<InputStream>public void setStopAtFirstError(boolean b)
Extractor.ContentExtractortrue, the extractor will stop at first parsing error,
iffalse the extractor will attempt to ignore all parsing errors.setStopAtFirstError in interface Extractor.ContentExtractorb - tolerance flag.Copyright © 2010–2019 The Apache Software Foundation. All rights reserved.