Package org.apache.any23.extractor.html
Class HTMLDocument
- java.lang.Object
-
- org.apache.any23.extractor.html.HTMLDocument
-
public class HTMLDocument extends Object
A wrapper around the DOM representation of an HTML document. Provides convenience access to various parts of the document.- Author:
- Gabriele Renzi, Michele Mostarda
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
HTMLDocument.TextField
This class represents a text extracted from the HTML DOM related to the node from which such test has been retrieved.
-
Constructor Summary
Constructors Constructor Description HTMLDocument(Node document)
Constructor accepting the root node.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static String
extractRelTag(String hrefAttributeContent)
Extracts the href specific rel-tag string.static String
extractRelTag(NamedNodeMap attributes)
Extracts the href specific rel-tag string.HTMLDocument.TextField[]
extractRelTagNodes()
Extracts all therel
tag nodes.String
find(String xpath)
List<Node>
findAll(String xpath)
List<Node>
findAllByClassName(String clazz)
Finds all the nodes by class name.Node
findMicroformattedObjectNode(String objectTag, String name)
String
findMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)
Node
findNodeById(String id)
String
getDefaultLanguage()
Returns the document default language.Node
getDocument()
String[]
getPathToLocalRoot()
Returns the sequence of ancestors from the document root to the local root (document).HTMLDocument.TextField[]
getPluralTextField(String className)
Returns a plural text field.HTMLDocument.TextField[]
getPluralUrlField(String className)
Returns the list of URLs associated to the fields marked with class className.HTMLDocument.TextField
getSingularTextField(String className)
Returns a singular text field.HTMLDocument.TextField
getSingularUrlField(String className)
Returns the URL associated to the field marked with class className.String
getText()
Returns the text contained inside a node if leaf,null
otherwise.String
readAttribute(String attribute)
Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.static String
readNodeContent(Node node, boolean prettify)
Reads the text content of the given node and returns it.static HTMLDocument.TextField
readTextField(Node node)
Reads a text field from the given node adding the content to the given res list.static void
readUrlField(List<HTMLDocument.TextField> res, Node node)
Reads an URL field from the given node adding the content to the given res list.org.eclipse.rdf4j.model.IRI
resolveIRI(String uri)
-
-
-
Method Detail
-
readTextField
public static HTMLDocument.TextField readTextField(Node node)
Reads a text field from the given node adding the content to the given res list.- Parameters:
node
- the node from which read the content.- Returns:
- a valid TextField
-
readUrlField
public static void readUrlField(List<HTMLDocument.TextField> res, Node node)
Reads an URL field from the given node adding the content to the given res list.- Parameters:
res
-List
ofHTMLDocument.TextField
node
- the node to read
-
extractRelTag
public static String extractRelTag(String hrefAttributeContent)
Extracts the href specific rel-tag string. See the rel-tag specification.- Parameters:
hrefAttributeContent
- the content of the href attribute.- Returns:
- the rel-tag specification.
-
extractRelTag
public static String extractRelTag(NamedNodeMap attributes)
Extracts the href specific rel-tag string. See the rel-tag specification.- Parameters:
attributes
- the list of attributes of a node.- Returns:
- the rel-tag specification.
-
readNodeContent
public static String readNodeContent(Node node, boolean prettify)
Reads the text content of the given node and returns it. If theprettify
flag istrue
the text is cleaned up.- Parameters:
node
- node to read content.prettify
- iftrue
blank chars will be removed.- Returns:
- the read text.
-
resolveIRI
public org.eclipse.rdf4j.model.IRI resolveIRI(String uri) throws ExtractionException
- Parameters:
uri
- string to resolve toIRI
- Returns:
- An absolute IRI, or null if the IRI is not fixable
- Throws:
ExtractionException
- If the base IRI is invalid
-
findMicroformattedValue
public String findMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)
-
getDocument
public Node getDocument()
-
getSingularTextField
public HTMLDocument.TextField getSingularTextField(String className)
Returns a singular text field.- Parameters:
className
- name of class containing text.- Returns:
- if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder
-
getPluralTextField
public HTMLDocument.TextField[] getPluralTextField(String className)
Returns a plural text field.- Parameters:
className
- name of class node containing text.- Returns:
- list of fields.
-
getSingularUrlField
public HTMLDocument.TextField getSingularUrlField(String className)
Returns the URL associated to the field marked with class className.- Parameters:
className
- name of node class containing the URL field.- Returns:
- if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder
-
getPluralUrlField
public HTMLDocument.TextField[] getPluralUrlField(String className)
Returns the list of URLs associated to the fields marked with class className.- Parameters:
className
- name of node class containing the URL field.- Returns:
- the list of
HTMLDocument.TextField
found.
-
findMicroformattedObjectNode
public Node findMicroformattedObjectNode(String objectTag, String name)
-
readAttribute
public String readAttribute(String attribute)
Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.- Parameters:
attribute
- the attribute name.- Returns:
- the string representing the attribute.
-
findAllByClassName
public List<Node> findAllByClassName(String clazz)
Finds all the nodes by class name.- Parameters:
clazz
- the class name.- Returns:
- list of matching nodes.
-
getText
public String getText()
Returns the text contained inside a node if leaf,null
otherwise.- Returns:
- the text of a leaf node.
-
getDefaultLanguage
public String getDefaultLanguage()
Returns the document default language.- Returns:
- default language if any,
null
otherwise.
-
getPathToLocalRoot
public String[] getPathToLocalRoot()
Returns the sequence of ancestors from the document root to the local root (document).- Returns:
- a sequence of node names.
-
extractRelTagNodes
public HTMLDocument.TextField[] extractRelTagNodes()
Extracts all therel
tag nodes.- Returns:
- list of rel tag nodes.
-
-