Class HTMLDocument


  • public class HTMLDocument
    extends Object
    A wrapper around the DOM representation of an HTML document. Provides convenience access to various parts of the document.
    Author:
    Gabriele Renzi, Michele Mostarda
    • Constructor Detail

      • HTMLDocument

        public HTMLDocument​(Node document)
        Constructor accepting the root node.
        Parameters:
        document - a Node
    • Method Detail

      • readTextField

        public static HTMLDocument.TextField readTextField​(Node node)
        Reads a text field from the given node adding the content to the given res list.
        Parameters:
        node - the node from which read the content.
        Returns:
        a valid TextField
      • extractRelTag

        public static String extractRelTag​(String hrefAttributeContent)
        Extracts the href specific rel-tag string. See the rel-tag specification.
        Parameters:
        hrefAttributeContent - the content of the href attribute.
        Returns:
        the rel-tag specification.
      • extractRelTag

        public static String extractRelTag​(NamedNodeMap attributes)
        Extracts the href specific rel-tag string. See the rel-tag specification.
        Parameters:
        attributes - the list of attributes of a node.
        Returns:
        the rel-tag specification.
      • readNodeContent

        public static String readNodeContent​(Node node,
                                             boolean prettify)
        Reads the text content of the given node and returns it. If the prettify flag is true the text is cleaned up.
        Parameters:
        node - node to read content.
        prettify - if true blank chars will be removed.
        Returns:
        the read text.
      • resolveIRI

        public org.eclipse.rdf4j.model.IRI resolveIRI​(String uri)
                                               throws ExtractionException
        Parameters:
        uri - string to resolve to IRI
        Returns:
        An absolute IRI, or null if the IRI is not fixable
        Throws:
        ExtractionException - If the base IRI is invalid
      • findNodeById

        public Node findNodeById​(String id)
      • getDocument

        public Node getDocument()
      • getSingularTextField

        public HTMLDocument.TextField getSingularTextField​(String className)
        Returns a singular text field.
        Parameters:
        className - name of class containing text.
        Returns:
        if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder
      • getPluralTextField

        public HTMLDocument.TextField[] getPluralTextField​(String className)
        Returns a plural text field.
        Parameters:
        className - name of class node containing text.
        Returns:
        list of fields.
      • getSingularUrlField

        public HTMLDocument.TextField getSingularUrlField​(String className)
        Returns the URL associated to the field marked with class className.
        Parameters:
        className - name of node class containing the URL field.
        Returns:
        if multiple values are found just the first is returned, if we want to check that there are no n-ary values use plural finder
      • getPluralUrlField

        public HTMLDocument.TextField[] getPluralUrlField​(String className)
        Returns the list of URLs associated to the fields marked with class className.
        Parameters:
        className - name of node class containing the URL field.
        Returns:
        the list of HTMLDocument.TextField found.
      • findMicroformattedObjectNode

        public Node findMicroformattedObjectNode​(String objectTag,
                                                 String name)
      • readAttribute

        public String readAttribute​(String attribute)
        Read an attribute avoiding NullPointerExceptions, if the attr is missing it just returns an empty string.
        Parameters:
        attribute - the attribute name.
        Returns:
        the string representing the attribute.
      • findAllByClassName

        public List<Node> findAllByClassName​(String clazz)
        Finds all the nodes by class name.
        Parameters:
        clazz - the class name.
        Returns:
        list of matching nodes.
      • getText

        public String getText()
        Returns the text contained inside a node if leaf, null otherwise.
        Returns:
        the text of a leaf node.
      • getDefaultLanguage

        public String getDefaultLanguage()
        Returns the document default language.
        Returns:
        default language if any, null otherwise.
      • getPathToLocalRoot

        public String[] getPathToLocalRoot()
        Returns the sequence of ancestors from the document root to the local root (document).
        Returns:
        a sequence of node names.
      • extractRelTagNodes

        public HTMLDocument.TextField[] extractRelTagNodes()
        Extracts all the rel tag nodes.
        Returns:
        list of rel tag nodes.