Class DomUtils


  • public class DomUtils
    extends Object
    This class provides utility methods for DOM manipulation. It is separated from HTMLDocument so that its methods can be run on single DOM nodes without having to wrap them into an HTMLDocument.

    We use a mix of XPath and DOM manipulation.

    This is likely to be a performance bottleneck but at least everything is localized here.
    • Method Detail

      • getIndexInParent

        public static int getIndexInParent​(Node n)
        Given a node this method returns the index corresponding to such node within the list of the children of its parent node.
        Parameters:
        n - the node of which returning the index.
        Returns:
        a non negative number.
      • getXPathForNode

        public static String getXPathForNode​(Node node)
        Does a reverse walking of the DOM tree to generate a unique XPath expression leading to this node. The XPath generated is the canonical one based on sibling index: /html[1]/body[1]/div[2]/span[3] etc..
        Parameters:
        node - the input node.
        Returns:
        the XPath location of node as String.
      • getXPathListForNode

        public static String[] getXPathListForNode​(Node n)
        Returns a list of tag names representing the path from the document root to the given node n.
        Parameters:
        n - the node for which retrieve the path.
        Returns:
        a sequence of HTML tag names.
      • getNodeLocation

        public static int[] getNodeLocation​(Node n)
        Returns the row/col location of the given node.
        Parameters:
        n - input node.
        Returns:
        an array of two elements of type [<begin-row>, <begin-col>, <end-row> <end-col>] or null if not possible to extract such data.
      • isAncestorOf

        public static boolean isAncestorOf​(Node candidateAncestor,
                                           Node candidateSibling,
                                           boolean strict)
        Checks whether a node is ancestor or same of another node.
        Parameters:
        candidateAncestor - the candidate ancestor node.
        candidateSibling - the candidate sibling node.
        strict - if true is not allowed that the ancestor and sibling can be the same node.
        Returns:
        true if candidateSibling is ancestor of candidateSibling, false otherwise.
      • isAncestorOf

        public static boolean isAncestorOf​(Node candidateAncestor,
                                           Node candidateSibling)
        Checks whether a node is ancestor or same of another node. As isAncestorOf(org.w3c.dom.Node, org.w3c.dom.Node, boolean) with strict=false.
        Parameters:
        candidateAncestor - the candidate ancestor node.
        candidateSibling - the candidate sibling node.
        Returns:
        true if candidateSibling is ancestor of candidateSibling, false otherwise.
      • findAllByClassName

        public static List<Node> findAllByClassName​(Node root,
                                                    String className)
        Finds all nodes that have a declared class. Note that the className is transformed to lower case before being matched against the DOM.
        Parameters:
        root - the root node from which start searching.
        className - the name of the filtered class.
        Returns:
        list of matching nodes or an empty list.
      • findAllByAttributeName

        public static List<Node> findAllByAttributeName​(Node root,
                                                        String attrName)
        Finds all nodes that have a declared attribute. Note that the className is transformed to lower case before being matched against the DOM.
        Parameters:
        root - the root node from which start searching.
        attrName - the name of the filtered attribue.
        Returns:
        list of matching nodes or an empty list.
      • findAllByAttributeContains

        public static List<Node> findAllByAttributeContains​(Node node,
                                                            String attrName,
                                                            String attrContains)
      • findAllByTagAndClassName

        public static List<Node> findAllByTagAndClassName​(Node root,
                                                          String tagName,
                                                          String className)
      • findNodeById

        public static Node findNodeById​(Node root,
                                        String id)
        Mimics the JS DOM API, or prototype's $()
        Parameters:
        root - the node to locate
        id - the id of the node to locate
        Returns:
        the Node if one exists
      • findAll

        public static List<Node> findAll​(Node node,
                                         String xpath)
        Returns a NodeList composed of all the nodes that match an XPath expression, which must be valid.
        Parameters:
        node - the node object to locate
        xpath - an xpath expression
        Returns:
        a list of Node's if they exists
      • find

        public static String find​(Node node,
                                  String xpath)
        Gets the string value of an XPath expression.
        Parameters:
        node - the node object to locate
        xpath - an xpath expression
        Returns:
        a string xpath value
      • hasClassName

        public static boolean hasClassName​(Node node,
                                           String className)
        Tells if an element has a class name not checking the parents in the hierarchy mimicking the CSS .foo match.
        Parameters:
        node - the node object to locate
        className - the CSS class name
        Returns:
        true if the class name exists
      • hasAttribute

        public static boolean hasAttribute​(Node node,
                                           String attributeName,
                                           String className)
        Checks the presence of an attribute value in attributes that contain whitespace-separated lists of values. The semantic is the CSS classes' ones: "foo" matches "bar foo", "foo" but not "foob"
        Parameters:
        node - the node object to locate
        attributeName - attribute value
        className - the CSS class name
        Returns:
        true if the class has the attribute name
      • hasAttribute

        public static boolean hasAttribute​(Node node,
                                           String attributeName)
        Checks the presence of an attribute in the given node.
        Parameters:
        node - the node container.
        attributeName - the name of the attribute.
        Returns:
        true if the attribute is present
      • isElementNode

        public static boolean isElementNode​(Node target)
        Verifies if the given target node is an element.
        Parameters:
        target - target node to check
        Returns:
        true if the element the node is an element, false otherwise.
      • readAttribute

        public static String readAttribute​(Node node,
                                           String attribute,
                                           String defaultValue)
        Reads the value of the specified attribute, returning the defaultValue string if not present.
        Parameters:
        node - node to read the attribute.
        attribute - attribute name.
        defaultValue - the default value to return if attribute is not found.
        Returns:
        the attribute value or defaultValue if not found.
      • readAttributeWithPrefix

        public static String readAttributeWithPrefix​(Node node,
                                                     String attributePrefix,
                                                     String defaultValue)
        Reads the value of the first attribute which name matches with the specified attributePrefix. Returns the defaultValue if not found.
        Parameters:
        node - node to look for attributes.
        attributePrefix - attribute prefix.
        defaultValue - default returned value.
        Returns:
        the value found or default.
      • readAttribute

        public static String readAttribute​(Node node,
                                           String attribute)
        Reads the value of an attribute, returning the empty string if not present.
        Parameters:
        node - node to read the attribute.
        attribute - attribute name.
        Returns:
        the attribute value or "" if not found.
      • serializeToXML

        public static String serializeToXML​(Node node,
                                            boolean indent)
                                     throws TransformerException,
                                            IOException
        Given a DOM Node produces the XML serialization omitting the XML declaration.
        Parameters:
        node - node to be serialized.
        indent - if true the output is indented.
        Returns:
        the XML serialization.
        Throws:
        TransformerException - if an error occurs during the serializator initialization and activation.
        IOException - if there is an error locating the node
      • documentToInputStream

        public static InputStream documentToInputStream​(Document doc)
        Given a Document this method will return an input stream representing that document.
        Parameters:
        doc - the input Document
        Returns:
        an InputStream
      • nodeToInputStream

        public static InputStream nodeToInputStream​(Node node)
        Convert a w3c dom node to a InputStream
        Parameters:
        node - Node to convert
        Returns:
        the converted InputStream