Package org.apache.any23.extractor.html
Class EntityBasedMicroformatExtractor
- java.lang.Object
-
- org.apache.any23.extractor.html.MicroformatExtractor
-
- org.apache.any23.extractor.html.EntityBasedMicroformatExtractor
-
- All Implemented Interfaces:
Extractor<Document>
,Extractor.TagSoupDOMExtractor
- Direct Known Subclasses:
AdrExtractor
,GeoExtractor
,HAdrExtractor
,HCardExtractor
,HCardExtractor
,HEntryExtractor
,HEventExtractor
,HGeoExtractor
,HItemExtractor
,HListingExtractor
,HProductExtractor
,HRecipeExtractor
,HRecipeExtractor
,HResumeExtractor
,HResumeExtractor
,HReviewAggregateExtractor
,HReviewExtractor
,SpeciesExtractor
public abstract class EntityBasedMicroformatExtractor extends MicroformatExtractor
Base class for microformat extractors based on entities.- Author:
- Gabriele Renzi
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.any23.extractor.Extractor
Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor
-
-
Field Summary
-
Fields inherited from class org.apache.any23.extractor.html.MicroformatExtractor
BEGIN_SCRIPT, END_SCRIPT, valueFactory
-
-
Constructor Summary
Constructors Constructor Description EntityBasedMicroformatExtractor()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description boolean
extract()
Performs the extraction of the data and writes them to the model.protected abstract boolean
extractEntity(Node node, ExtractionResult out)
Extracts an entity from a DOM node.protected abstract String
getBaseClassName()
Returns the base class name for the extractor.protected org.eclipse.rdf4j.model.BNode
getBlankNodeFor(Node node)
protected abstract void
resetExtractor()
Resets the internal status of the extractor to prepare it to a new extraction section.-
Methods inherited from class org.apache.any23.extractor.html.MicroformatExtractor
addBNodeProperty, addBNodeProperty, addIRIProperty, conditionallyAddLiteralProperty, conditionallyAddResourceProperty, conditionallyAddStringProperty, fixLink, fixLink, getCurrentExtractionResult, getDescription, getDocumentIRI, getExtractionContext, getHTMLDocument, includes, openSubResult, run, setCurrentExtractionResult
-
-
-
-
Method Detail
-
getBaseClassName
protected abstract String getBaseClassName()
Returns the base class name for the extractor.- Returns:
- a string containing the base of the extractor.
-
resetExtractor
protected abstract void resetExtractor()
Resets the internal status of the extractor to prepare it to a new extraction section.
-
extractEntity
protected abstract boolean extractEntity(Node node, ExtractionResult out) throws ExtractionException
Extracts an entity from a DOM node.- Parameters:
node
- the DOM node.out
- the extraction result collector.- Returns:
true
if the extraction has produces something,false
otherwise.- Throws:
ExtractionException
- if there is an error during extraction
-
extract
public boolean extract() throws ExtractionException
Description copied from class:MicroformatExtractor
Performs the extraction of the data and writes them to the model. The nodes generated in the model can have any name or implicit label but if possible they SHOULD have names (either URIs or AnonId) that are uniquely derivable from their position in the DOM tree, so that multiple extractors can merge information.- Specified by:
extract
in classMicroformatExtractor
- Returns:
- true if extraction is successful
- Throws:
ExtractionException
- if there is an error during extraction
-
getBlankNodeFor
protected org.eclipse.rdf4j.model.BNode getBlankNodeFor(Node node)
- Parameters:
node
- a DOM node representing a blank node- Returns:
- an RDF blank node corresponding to that DOM node, by using a blank node ID like "MD5 of http://doc-uri/#xpath/to/node"
-
-