Package org.apache.any23.extractor.html
Class HTMLDocument
- java.lang.Object
-
- org.apache.any23.extractor.html.HTMLDocument
-
public class HTMLDocument extends Object
A wrapper around the DOM representation of an HTML document. Provides convenience access to various parts of the document.- Author:
- Gabriele Renzi, Michele Mostarda
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classHTMLDocument.TextFieldThis class represents a text extracted from the HTML DOM related to the node from which such test has been retrieved.
-
Constructor Summary
Constructors Constructor Description HTMLDocument(Node document)Constructor accepting the root node.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static StringextractRelTag(String hrefAttributeContent)Extracts the href specific rel-tag string.static StringextractRelTag(NamedNodeMap attributes)Extracts the href specific rel-tag string.HTMLDocument.TextField[]extractRelTagNodes()Extracts all thereltag nodes.Stringfind(String xpath)List<Node>findAll(String xpath)List<Node>findAllByClassName(String clazz)Finds all the nodes by class name.NodefindMicroformattedObjectNode(String objectTag, String name)StringfindMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)NodefindNodeById(String id)StringgetDefaultLanguage()Returns the document default language.NodegetDocument()String[]getPathToLocalRoot()Returns the sequence of ancestors from the document root to the local root (document).HTMLDocument.TextField[]getPluralTextField(String className)Returns a plural text field.HTMLDocument.TextField[]getPluralUrlField(String className)Returns the list of URLs associated to the fields marked with class className.HTMLDocument.TextFieldgetSingularTextField(String className)Returns a singular text field.HTMLDocument.TextFieldgetSingularUrlField(String className)Returns the URL associated to the field marked with class className.StringgetText()Returns the text contained inside a node if leaf,nullotherwise.StringreadAttribute(String attribute)Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.static StringreadNodeContent(Node node, boolean prettify)Reads the text content of the given node and returns it.static HTMLDocument.TextFieldreadTextField(Node node)Reads a text field from the given node adding the content to the given res list.static voidreadUrlField(List<HTMLDocument.TextField> res, Node node)Reads an URL field from the given node adding the content to the given res list.org.eclipse.rdf4j.model.IRIresolveIRI(String uri)
-
-
-
Method Detail
-
readTextField
public static HTMLDocument.TextField readTextField(Node node)
Reads a text field from the given node adding the content to the given res list.- Parameters:
node- the node from which read the content.- Returns:
- a valid TextField
-
readUrlField
public static void readUrlField(List<HTMLDocument.TextField> res, Node node)
Reads an URL field from the given node adding the content to the given res list.- Parameters:
res-ListofHTMLDocument.TextFieldnode- the node to read
-
extractRelTag
public static String extractRelTag(String hrefAttributeContent)
Extracts the href specific rel-tag string. See the rel-tag specification.- Parameters:
hrefAttributeContent- the content of the href attribute.- Returns:
- the rel-tag specification.
-
extractRelTag
public static String extractRelTag(NamedNodeMap attributes)
Extracts the href specific rel-tag string. See the rel-tag specification.- Parameters:
attributes- the list of attributes of a node.- Returns:
- the rel-tag specification.
-
readNodeContent
public static String readNodeContent(Node node, boolean prettify)
Reads the text content of the given node and returns it. If theprettifyflag istruethe text is cleaned up.- Parameters:
node- node to read content.prettify- iftrueblank chars will be removed.- Returns:
- the read text.
-
resolveIRI
public org.eclipse.rdf4j.model.IRI resolveIRI(String uri) throws ExtractionException
- Parameters:
uri- string to resolve toIRI- Returns:
- An absolute IRI, or null if the IRI is not fixable
- Throws:
ExtractionException- If the base IRI is invalid
-
findMicroformattedValue
public String findMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)
-
getDocument
public Node getDocument()
-
getSingularTextField
public HTMLDocument.TextField getSingularTextField(String className)
Returns a singular text field.- Parameters:
className- name of class containing text.- Returns:
- if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder
-
getPluralTextField
public HTMLDocument.TextField[] getPluralTextField(String className)
Returns a plural text field.- Parameters:
className- name of class node containing text.- Returns:
- list of fields.
-
getSingularUrlField
public HTMLDocument.TextField getSingularUrlField(String className)
Returns the URL associated to the field marked with class className.- Parameters:
className- name of node class containing the URL field.- Returns:
- if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder
-
getPluralUrlField
public HTMLDocument.TextField[] getPluralUrlField(String className)
Returns the list of URLs associated to the fields marked with class className.- Parameters:
className- name of node class containing the URL field.- Returns:
- the list of
HTMLDocument.TextFieldfound.
-
findMicroformattedObjectNode
public Node findMicroformattedObjectNode(String objectTag, String name)
-
readAttribute
public String readAttribute(String attribute)
Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.- Parameters:
attribute- the attribute name.- Returns:
- the string representing the attribute.
-
findAllByClassName
public List<Node> findAllByClassName(String clazz)
Finds all the nodes by class name.- Parameters:
clazz- the class name.- Returns:
- list of matching nodes.
-
getText
public String getText()
Returns the text contained inside a node if leaf,nullotherwise.- Returns:
- the text of a leaf node.
-
getDefaultLanguage
public String getDefaultLanguage()
Returns the document default language.- Returns:
- default language if any,
nullotherwise.
-
getPathToLocalRoot
public String[] getPathToLocalRoot()
Returns the sequence of ancestors from the document root to the local root (document).- Returns:
- a sequence of node names.
-
extractRelTagNodes
public HTMLDocument.TextField[] extractRelTagNodes()
Extracts all thereltag nodes.- Returns:
- list of rel tag nodes.
-
-