public class HTMLScraperExtractor extends Object implements Extractor.ContentExtractor
Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor
Modifier and Type | Field and Description |
---|---|
static org.eclipse.rdf4j.model.IRI |
PAGE_CONTENT_AE_PROPERTY |
static org.eclipse.rdf4j.model.IRI |
PAGE_CONTENT_CE_PROPERTY |
static org.eclipse.rdf4j.model.IRI |
PAGE_CONTENT_DE_PROPERTY |
static org.eclipse.rdf4j.model.IRI |
PAGE_CONTENT_LCE_PROPERTY |
Constructor and Description |
---|
HTMLScraperExtractor() |
Modifier and Type | Method and Description |
---|---|
void |
addTextExtractor(String name,
org.eclipse.rdf4j.model.IRI property,
de.l3s.boilerpipe.BoilerpipeExtractor extractor) |
ExtractorDescription |
getDescription()
Returns a
ExtractorDescription of this extractor. |
String[] |
getTextExtractors() |
void |
run(ExtractionParameters extractionParameters,
ExtractionContext extractionContext,
InputStream inputStream,
ExtractionResult extractionResult)
Executes the extractor.
|
void |
setStopAtFirstError(boolean b)
If
true , the extractor will stop at first parsing error,
iffalse the extractor will attempt to ignore all parsing errors. |
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_DE_PROPERTY
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_AE_PROPERTY
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_LCE_PROPERTY
public static final org.eclipse.rdf4j.model.IRI PAGE_CONTENT_CE_PROPERTY
public void addTextExtractor(String name, org.eclipse.rdf4j.model.IRI property, de.l3s.boilerpipe.BoilerpipeExtractor extractor)
public String[] getTextExtractors()
public void run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, InputStream inputStream, ExtractionResult extractionResult) throws IOException, ExtractionException
Extractor
run
in interface Extractor<InputStream>
extractionParameters
- the parameters to be applied during the extraction.extractionContext
- The document context.inputStream
- The extractor input data.extractionResult
- the collector for the extracted data.IOException
- On error while reading from the input stream.ExtractionException
- On other error, such as parse errors.public ExtractorDescription getDescription()
Extractor
ExtractorDescription
of this extractor.getDescription
in interface Extractor<InputStream>
public void setStopAtFirstError(boolean b)
Extractor.ContentExtractor
true
, the extractor will stop at first parsing error,
iffalse
the extractor will attempt to ignore all parsing errors.setStopAtFirstError
in interface Extractor.ContentExtractor
b
- tolerance flag.Copyright © 2010–2019 The Apache Software Foundation. All rights reserved.