Package org.apache.any23.extractor.html
Class MicroformatExtractor
- java.lang.Object
-
- org.apache.any23.extractor.html.MicroformatExtractor
-
- All Implemented Interfaces:
Extractor<Document>
,Extractor.TagSoupDOMExtractor
- Direct Known Subclasses:
EntityBasedMicroformatExtractor
,HCalendarExtractor
public abstract class MicroformatExtractor extends Object implements Extractor.TagSoupDOMExtractor
The abstract base class for any Microformat specification extractor.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.any23.extractor.Extractor
Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor
-
-
Field Summary
Fields Modifier and Type Field Description static String
BEGIN_SCRIPT
static String
END_SCRIPT
protected Any23ValueFactoryWrapper
valueFactory
-
Constructor Summary
Constructors Constructor Description MicroformatExtractor()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected void
addBNodeProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.BNode bnode)
Helper method that adds a BNode property to a node.protected void
addBNodeProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.BNode bnode)
Helper method that adds a BNode property to a node.protected void
addIRIProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.IRI object)
Helper method that adds a IRI property to a node.protected boolean
conditionallyAddLiteralProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.Literal literal)
Helper method that adds a literal property to a node.protected boolean
conditionallyAddResourceProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.IRI uri)
Helper method that adds a IRI property to a node.protected boolean
conditionallyAddStringProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI p, String value)
Helper method that adds a literal property to a subject only if the value of the property is a valid string.protected abstract boolean
extract()
Performs the extraction of the data and writes them to the model.protected org.eclipse.rdf4j.model.IRI
fixLink(String link)
protected org.eclipse.rdf4j.model.IRI
fixLink(String link, String defaultSchema)
protected ExtractionResult
getCurrentExtractionResult()
Returns theExtractionResult
associated to the extraction session.abstract ExtractorDescription
getDescription()
Returns the description of this extractor.org.eclipse.rdf4j.model.IRI
getDocumentIRI()
ExtractionContext
getExtractionContext()
HTMLDocument
getHTMLDocument()
static boolean
includes(Class<? extends MicroformatExtractor> including, Class<? extends MicroformatExtractor> included)
This method checks if there is a native nesting relationship between twoMicroformatExtractor
.protected ExtractionResult
openSubResult(ExtractionContext context)
void
run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, Document in, ExtractionResult out)
Executes the extractor.protected void
setCurrentExtractionResult(ExtractionResult out)
-
-
-
Field Detail
-
BEGIN_SCRIPT
public static final String BEGIN_SCRIPT
- See Also:
- Constant Field Values
-
END_SCRIPT
public static final String END_SCRIPT
- See Also:
- Constant Field Values
-
valueFactory
protected final Any23ValueFactoryWrapper valueFactory
-
-
Method Detail
-
getDescription
public abstract ExtractorDescription getDescription()
Returns the description of this extractor.- Specified by:
getDescription
in interfaceExtractor<Document>
- Returns:
- a human readable description.
-
extract
protected abstract boolean extract() throws ExtractionException
Performs the extraction of the data and writes them to the model. The nodes generated in the model can have any name or implicit label but if possible they SHOULD have names (either URIs or AnonId) that are uniquely derivable from their position in the DOM tree, so that multiple extractors can merge information.- Returns:
- true if extraction is successful
- Throws:
ExtractionException
- if there is an error during extraction
-
getHTMLDocument
public HTMLDocument getHTMLDocument()
-
getExtractionContext
public ExtractionContext getExtractionContext()
-
getDocumentIRI
public org.eclipse.rdf4j.model.IRI getDocumentIRI()
-
run
public final void run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, Document in, ExtractionResult out) throws IOException, ExtractionException
Description copied from interface:Extractor
Executes the extractor. Will be invoked only once, extractors are not reusable.- Specified by:
run
in interfaceExtractor<Document>
- Parameters:
extractionParameters
- the parameters to be applied during the extraction.extractionContext
- The document context.in
- The extractor input data.out
- the collector for the extracted data.- Throws:
IOException
- On error while reading from the input stream.ExtractionException
- On other error, such as parse errors.
-
getCurrentExtractionResult
protected ExtractionResult getCurrentExtractionResult()
Returns theExtractionResult
associated to the extraction session.- Returns:
- a valid extraction result.
-
setCurrentExtractionResult
protected void setCurrentExtractionResult(ExtractionResult out)
-
openSubResult
protected ExtractionResult openSubResult(ExtractionContext context)
-
conditionallyAddStringProperty
protected boolean conditionallyAddStringProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI p, String value)
Helper method that adds a literal property to a subject only if the value of the property is a valid string.- Parameters:
n
- the HTML node from which the property value has been extracted.subject
- the property subject.p
- the property IRI.value
- the property value.- Returns:
- returns
true
if the value has been accepted and added,false
otherwise.
-
conditionallyAddLiteralProperty
protected boolean conditionallyAddLiteralProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.Literal literal)
Helper method that adds a literal property to a node.- Parameters:
n
- the HTML node from which the property value has been extracted.subject
- subject the property subject.property
- the property IRI.literal
- value the property value.- Returns:
- returns
true
if the literal has been accepted and added,false
otherwise.
-
conditionallyAddResourceProperty
protected boolean conditionallyAddResourceProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.IRI uri)
Helper method that adds a IRI property to a node.- Parameters:
subject
- the property subject.property
- the property IRI.uri
- the property object.- Returns:
true
if the the resource has been added,false
otherwise.
-
addBNodeProperty
protected void addBNodeProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.BNode bnode)
Helper method that adds a BNode property to a node.- Parameters:
n
- the HTML node used for extracting such property.subject
- the property subject.property
- the property IRI.bnode
- the property value.
-
addBNodeProperty
protected void addBNodeProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.BNode bnode)
Helper method that adds a BNode property to a node.- Parameters:
subject
- the property subject.property
- the property IRI.bnode
- the property value.
-
addIRIProperty
protected void addIRIProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.IRI object)
Helper method that adds a IRI property to a node.- Parameters:
subject
- subject to addproperty
- predicate to addobject
- object to add
-
fixLink
protected org.eclipse.rdf4j.model.IRI fixLink(String link)
-
includes
public static boolean includes(Class<? extends MicroformatExtractor> including, Class<? extends MicroformatExtractor> included)
This method checks if there is a native nesting relationship between twoMicroformatExtractor
.- Parameters:
including
- the includingMicroformatExtractor
included
- the includedMicroformatExtractor
- Returns:
true
if there is a declared nesting relationship- See Also:
Includes
-
-