java.lang.Object
- org.apache.any23.extractor.html.HTMLDocument

```
public class HTMLDocument
extends Object
```
A wrapper around the DOM representation of an HTML document. Provides convenience access to various parts of the document.

Author:

Gabriele Renzi, Michele Mostarda

Nested Class Summary

Nested Classes
Modifier and Type	Class	Description
`static class`	`HTMLDocument.TextField`	This class represents a text extracted from the HTML DOM related to the node from which such test has been retrieved.

Constructor Summary

Constructors
Constructor Description

HTMLDocument(Node document)
Constructor accepting the root node.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`static String`	`extractRelTag(String hrefAttributeContent)`	Extracts the href specific rel-tag string.
`static String`	`extractRelTag(NamedNodeMap attributes)`	Extracts the href specific rel-tag string.
`HTMLDocument.TextField[]`	`extractRelTagNodes()`	Extracts all the `rel` tag nodes.
`String`	`find(String xpath)`
`List<Node>`	`findAll(String xpath)`
`List<Node>`	`findAllByClassName(String clazz)`	Finds all the nodes by class name.
`Node`	`findMicroformattedObjectNode(String objectTag, String name)`
`String`	`findMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)`
`Node`	`findNodeById(String id)`
`String`	`getDefaultLanguage()`	Returns the document default language.
`Node`	`getDocument()`
`String[]`	`getPathToLocalRoot()`	Returns the sequence of ancestors from the document root to the local root (document).
`HTMLDocument.TextField[]`	`getPluralTextField(String className)`	Returns a plural text field.
`HTMLDocument.TextField[]`	`getPluralUrlField(String className)`	Returns the list of URLs associated to the fields marked with class className.
`HTMLDocument.TextField`	`getSingularTextField(String className)`	Returns a singular text field.
`HTMLDocument.TextField`	`getSingularUrlField(String className)`	Returns the URL associated to the field marked with class className.
`String`	`getText()`	Returns the text contained inside a node if leaf, `null` otherwise.
`String`	`readAttribute(String attribute)`	Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.
`static String`	`readNodeContent(Node node, boolean prettify)`	Reads the text content of the given node and returns it.
`static HTMLDocument.TextField`	`readTextField(Node node)`	Reads a text field from the given node adding the content to the given res list.
`static void`	`readUrlField(List<HTMLDocument.TextField> res, Node node)`	Reads an URL field from the given node adding the content to the given res list.
`org.eclipse.rdf4j.model.IRI`	`resolveIRI(String uri)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - HTMLDocument
```
public HTMLDocument(Node document)
```
    Constructor accepting the root node.
    
    Parameters:
    
    document - a Node
- Method Detail
  - readTextField
```
public static HTMLDocument.TextField readTextField(Node node)
```
    Reads a text field from the given node adding the content to the given res list.
    
    Parameters:
    
    node - the node from which read the content.
    
    Returns:
    
    a valid TextField
  - readUrlField
```
public static void readUrlField(List<HTMLDocument.TextField> res,
                                Node node)
```
    Reads an URL field from the given node adding the content to the given res list.
    
    Parameters:
    
    res - List of HTMLDocument.TextField
    
    node - the node to read
  - extractRelTag
```
public static String extractRelTag(String hrefAttributeContent)
```
    Extracts the href specific rel-tag string. See the rel-tag specification.
    
    Parameters:
    
    hrefAttributeContent - the content of the href attribute.
    
    Returns:
    
    the rel-tag specification.
  - extractRelTag
```
public static String extractRelTag(NamedNodeMap attributes)
```
    Extracts the href specific rel-tag string. See the rel-tag specification.
    
    Parameters:
    
    attributes - the list of attributes of a node.
    
    Returns:
    
    the rel-tag specification.
  - readNodeContent
```
public static String readNodeContent(Node node,
                                     boolean prettify)
```
    Reads the text content of the given node and returns it. If the prettify flag is true the text is cleaned up.
    
    Parameters:
    
    node - node to read content.
    
    prettify - if true blank chars will be removed.
    
    Returns:
    
    the read text.
  - resolveIRI
```
public org.eclipse.rdf4j.model.IRI resolveIRI(String uri)
                                       throws ExtractionException
```
    Parameters:
    
    uri - string to resolve to IRI
    
    Returns:
    
    An absolute IRI, or null if the IRI is not fixable
    
    Throws:
    
    ExtractionException - If the base IRI is invalid
  - find
```
public String find(String xpath)
```
  - findNodeById
```
public Node findNodeById(String id)
```
  - findAll
```
public List<Node> findAll(String xpath)
```
  - findMicroformattedValue
```
public String findMicroformattedValue(String objectTag,
                                      String object,
                                      String fieldTag,
                                      String field,
                                      String key)
```
  - getDocument
```
public Node getDocument()
```
  - getSingularTextField
```
public HTMLDocument.TextField getSingularTextField(String className)
```
    Returns a singular text field.
    
    Parameters:
    
    className - name of class containing text.
    
    Returns:
    
    if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder
  - getPluralTextField
```
public HTMLDocument.TextField[] getPluralTextField(String className)
```
    Returns a plural text field.
    
    Parameters:
    
    className - name of class node containing text.
    
    Returns:
    
    list of fields.
  - getSingularUrlField
```
public HTMLDocument.TextField getSingularUrlField(String className)
```
    Returns the URL associated to the field marked with class className.
    
    Parameters:
    
    className - name of node class containing the URL field.
    
    Returns:
    
    if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder
  - getPluralUrlField
```
public HTMLDocument.TextField[] getPluralUrlField(String className)
```
    Returns the list of URLs associated to the fields marked with class className.
    
    Parameters:
    
    className - name of node class containing the URL field.
    
    Returns:
    
    the list of HTMLDocument.TextField found.
  - findMicroformattedObjectNode
```
public Node findMicroformattedObjectNode(String objectTag,
                                         String name)
```
  - readAttribute
```
public String readAttribute(String attribute)
```
    Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.
    
    Parameters:
    
    attribute - the attribute name.
    
    Returns:
    
    the string representing the attribute.
  - findAllByClassName
```
public List<Node> findAllByClassName(String clazz)
```
    Finds all the nodes by class name.
    
    Parameters:
    
    clazz - the class name.
    
    Returns:
    
    list of matching nodes.
  - getText
```
public String getText()
```
    Returns the text contained inside a node if leaf, null otherwise.
    
    Returns:
    
    the text of a leaf node.
  - getDefaultLanguage
```
public String getDefaultLanguage()
```
    Returns the document default language.
    
    Returns:
    
    default language if any, null otherwise.
  - getPathToLocalRoot
```
public String[] getPathToLocalRoot()
```
    Returns the sequence of ancestors from the document root to the local root (document).
    
    Returns:
    
    a sequence of node names.
  - extractRelTagNodes
```
public HTMLDocument.TextField[] extractRelTagNodes()
```
    Extracts all the rel tag nodes.
    
    Returns:
    
    list of rel tag nodes.

Class HTMLDocument

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

HTMLDocument

Method Detail

readTextField

readUrlField

extractRelTag

extractRelTag

readNodeContent

resolveIRI

find

findNodeById

findAll

findMicroformattedValue

getDocument

getSingularTextField

getPluralTextField

getSingularUrlField

getPluralUrlField

findMicroformattedObjectNode

readAttribute

findAllByClassName

getText

getDefaultLanguage

getPathToLocalRoot

extractRelTagNodes