Package org.apache.any23.extractor.html
Class MicroformatExtractor
- java.lang.Object
-
- org.apache.any23.extractor.html.MicroformatExtractor
-
- All Implemented Interfaces:
Extractor<Document>,Extractor.TagSoupDOMExtractor
- Direct Known Subclasses:
EntityBasedMicroformatExtractor,HCalendarExtractor
public abstract class MicroformatExtractor extends Object implements Extractor.TagSoupDOMExtractor
The abstract base class for any Microformat specification extractor.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.apache.any23.extractor.Extractor
Extractor.BlindExtractor, Extractor.ContentExtractor, Extractor.TagSoupDOMExtractor
-
-
Field Summary
Fields Modifier and Type Field Description static StringBEGIN_SCRIPTstatic StringEND_SCRIPTprotected Any23ValueFactoryWrappervalueFactory
-
Constructor Summary
Constructors Constructor Description MicroformatExtractor()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected voidaddBNodeProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.BNode bnode)Helper method that adds a BNode property to a node.protected voidaddBNodeProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.BNode bnode)Helper method that adds a BNode property to a node.protected voidaddIRIProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.IRI object)Helper method that adds a IRI property to a node.protected booleanconditionallyAddLiteralProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.Literal literal)Helper method that adds a literal property to a node.protected booleanconditionallyAddResourceProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.IRI uri)Helper method that adds a IRI property to a node.protected booleanconditionallyAddStringProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI p, String value)Helper method that adds a literal property to a subject only if the value of the property is a valid string.protected abstract booleanextract()Performs the extraction of the data and writes them to the model.protected org.eclipse.rdf4j.model.IRIfixLink(String link)protected org.eclipse.rdf4j.model.IRIfixLink(String link, String defaultSchema)protected ExtractionResultgetCurrentExtractionResult()Returns theExtractionResultassociated to the extraction session.abstract ExtractorDescriptiongetDescription()Returns the description of this extractor.org.eclipse.rdf4j.model.IRIgetDocumentIRI()ExtractionContextgetExtractionContext()HTMLDocumentgetHTMLDocument()static booleanincludes(Class<? extends MicroformatExtractor> including, Class<? extends MicroformatExtractor> included)This method checks if there is a native nesting relationship between twoMicroformatExtractor.protected ExtractionResultopenSubResult(ExtractionContext context)voidrun(ExtractionParameters extractionParameters, ExtractionContext extractionContext, Document in, ExtractionResult out)Executes the extractor.protected voidsetCurrentExtractionResult(ExtractionResult out)
-
-
-
Field Detail
-
BEGIN_SCRIPT
public static final String BEGIN_SCRIPT
- See Also:
- Constant Field Values
-
END_SCRIPT
public static final String END_SCRIPT
- See Also:
- Constant Field Values
-
valueFactory
protected final Any23ValueFactoryWrapper valueFactory
-
-
Method Detail
-
getDescription
public abstract ExtractorDescription getDescription()
Returns the description of this extractor.- Specified by:
getDescriptionin interfaceExtractor<Document>- Returns:
- a human readable description.
-
extract
protected abstract boolean extract() throws ExtractionExceptionPerforms the extraction of the data and writes them to the model. The nodes generated in the model can have any name or implicit label but if possible they SHOULD have names (either URIs or AnonId) that are uniquely derivable from their position in the DOM tree, so that multiple extractors can merge information.- Returns:
- true if extraction is successful
- Throws:
ExtractionException- if there is an error during extraction
-
getHTMLDocument
public HTMLDocument getHTMLDocument()
-
getExtractionContext
public ExtractionContext getExtractionContext()
-
getDocumentIRI
public org.eclipse.rdf4j.model.IRI getDocumentIRI()
-
run
public final void run(ExtractionParameters extractionParameters, ExtractionContext extractionContext, Document in, ExtractionResult out) throws IOException, ExtractionException
Description copied from interface:ExtractorExecutes the extractor. Will be invoked only once, extractors are not reusable.- Specified by:
runin interfaceExtractor<Document>- Parameters:
extractionParameters- the parameters to be applied during the extraction.extractionContext- The document context.in- The extractor input data.out- the collector for the extracted data.- Throws:
IOException- On error while reading from the input stream.ExtractionException- On other error, such as parse errors.
-
getCurrentExtractionResult
protected ExtractionResult getCurrentExtractionResult()
Returns theExtractionResultassociated to the extraction session.- Returns:
- a valid extraction result.
-
setCurrentExtractionResult
protected void setCurrentExtractionResult(ExtractionResult out)
-
openSubResult
protected ExtractionResult openSubResult(ExtractionContext context)
-
conditionallyAddStringProperty
protected boolean conditionallyAddStringProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI p, String value)
Helper method that adds a literal property to a subject only if the value of the property is a valid string.- Parameters:
n- the HTML node from which the property value has been extracted.subject- the property subject.p- the property IRI.value- the property value.- Returns:
- returns
trueif the value has been accepted and added,falseotherwise.
-
conditionallyAddLiteralProperty
protected boolean conditionallyAddLiteralProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.Literal literal)
Helper method that adds a literal property to a node.- Parameters:
n- the HTML node from which the property value has been extracted.subject- subject the property subject.property- the property IRI.literal- value the property value.- Returns:
- returns
trueif the literal has been accepted and added,falseotherwise.
-
conditionallyAddResourceProperty
protected boolean conditionallyAddResourceProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.IRI uri)Helper method that adds a IRI property to a node.- Parameters:
subject- the property subject.property- the property IRI.uri- the property object.- Returns:
trueif the the resource has been added,falseotherwise.
-
addBNodeProperty
protected void addBNodeProperty(Node n, org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.BNode bnode)
Helper method that adds a BNode property to a node.- Parameters:
n- the HTML node used for extracting such property.subject- the property subject.property- the property IRI.bnode- the property value.
-
addBNodeProperty
protected void addBNodeProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.BNode bnode)Helper method that adds a BNode property to a node.- Parameters:
subject- the property subject.property- the property IRI.bnode- the property value.
-
addIRIProperty
protected void addIRIProperty(org.eclipse.rdf4j.model.Resource subject, org.eclipse.rdf4j.model.IRI property, org.eclipse.rdf4j.model.IRI object)Helper method that adds a IRI property to a node.- Parameters:
subject- subject to addproperty- predicate to addobject- object to add
-
fixLink
protected org.eclipse.rdf4j.model.IRI fixLink(String link)
-
includes
public static boolean includes(Class<? extends MicroformatExtractor> including, Class<? extends MicroformatExtractor> included)
This method checks if there is a native nesting relationship between twoMicroformatExtractor.- Parameters:
including- the includingMicroformatExtractorincluded- the includedMicroformatExtractor- Returns:
trueif there is a declared nesting relationship- See Also:
Includes
-
-