Package org.apache.any23.extractor.html
Class DomUtils
- java.lang.Object
-
- org.apache.any23.extractor.html.DomUtils
-
public class DomUtils extends Object
This class provides utility methods for DOM manipulation. It is separated fromHTMLDocument
so that its methods can be run on single DOM nodes without having to wrap them into an HTMLDocument.We use a mix of XPath and DOM manipulation.
This is likely to be a performance bottleneck but at least everything is localized here.
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static InputStream
documentToInputStream(Document doc)
Given aDocument
this method will return an input stream representing that document.static String
find(Node node, String xpath)
Gets the string value of an XPath expression.static List<Node>
findAll(Node node, String xpath)
Returns a NodeList composed of all the nodes that match an XPath expression, which must be valid.static List<Node>
findAllByAttributeContains(Node node, String attrName, String attrContains)
static List<Node>
findAllByAttributeName(Node root, String attrName)
Finds all nodes that have a declared attribute.static List<Node>
findAllByClassName(Node root, String className)
Finds all nodes that have a declared class.static List<Node>
findAllByTag(Node root, String tagName)
static List<Node>
findAllByTagAndClassName(Node root, String tagName, String className)
static Node
findNodeById(Node root, String id)
Mimics the JS DOM API, or prototype's $()static int
getIndexInParent(Node n)
Given a node this method returns the index corresponding to such node within the list of the children of its parent node.static int[]
getNodeLocation(Node n)
Returns the row/col location of the given node.static String
getXPathForNode(Node node)
Does a reverse walking of the DOM tree to generate a unique XPath expression leading to this node.static String[]
getXPathListForNode(Node n)
Returns a list of tag names representing the path from the document root to the given node n.static boolean
hasAttribute(Node node, String attributeName)
Checks the presence of an attribute in the givennode
.static boolean
hasAttribute(Node node, String attributeName, String className)
Checks the presence of an attribute value in attributes that contain whitespace-separated lists of values.static boolean
hasClassName(Node node, String className)
Tells if an element has a class name not checking the parents in the hierarchy mimicking the CSS .foo match.static boolean
isAncestorOf(Node candidateAncestor, Node candidateSibling)
Checks whether a node is ancestor or same of another node.static boolean
isAncestorOf(Node candidateAncestor, Node candidateSibling, boolean strict)
Checks whether a node is ancestor or same of another node.static boolean
isElementNode(Node target)
Verifies if the given target node is an element.static InputStream
nodeToInputStream(Node node)
Convert a w3c dom node to a InputStreamstatic String
readAttribute(Node node, String attribute)
Reads the value of anattribute
, returning the empty string if not present.static String
readAttribute(Node node, String attribute, String defaultValue)
Reads the value of the specifiedattribute
, returning thedefaultValue
string if not present.static String
readAttributeWithPrefix(Node node, String attributePrefix, String defaultValue)
Reads the value of the first attribute which name matches with the specifiedattributePrefix
.static String
serializeToXML(Node node, boolean indent)
Given a DOMNode
produces the XML serialization omitting the XML declaration.
-
-
-
Method Detail
-
getIndexInParent
public static int getIndexInParent(Node n)
Given a node this method returns the index corresponding to such node within the list of the children of its parent node.- Parameters:
n
- the node of which returning the index.- Returns:
- a non negative number.
-
getXPathForNode
public static String getXPathForNode(Node node)
Does a reverse walking of the DOM tree to generate a unique XPath expression leading to this node. The XPath generated is the canonical one based on sibling index: /html[1]/body[1]/div[2]/span[3] etc..- Parameters:
node
- the input node.- Returns:
- the XPath location of node as String.
-
getXPathListForNode
public static String[] getXPathListForNode(Node n)
Returns a list of tag names representing the path from the document root to the given node n.- Parameters:
n
- the node for which retrieve the path.- Returns:
- a sequence of HTML tag names.
-
getNodeLocation
public static int[] getNodeLocation(Node n)
Returns the row/col location of the given node.- Parameters:
n
- input node.- Returns:
- an array of two elements of type
[<begin-row>, <begin-col>, <end-row> <end-col>]
ornull
if not possible to extract such data.
-
isAncestorOf
public static boolean isAncestorOf(Node candidateAncestor, Node candidateSibling, boolean strict)
Checks whether a node is ancestor or same of another node.- Parameters:
candidateAncestor
- the candidate ancestor node.candidateSibling
- the candidate sibling node.strict
- iftrue
is not allowed that the ancestor and sibling can be the same node.- Returns:
true
ifcandidateSibling
is ancestor ofcandidateSibling
,false
otherwise.
-
isAncestorOf
public static boolean isAncestorOf(Node candidateAncestor, Node candidateSibling)
Checks whether a node is ancestor or same of another node. AsisAncestorOf(org.w3c.dom.Node, org.w3c.dom.Node, boolean)
withstrict=false
.- Parameters:
candidateAncestor
- the candidate ancestor node.candidateSibling
- the candidate sibling node.- Returns:
true
ifcandidateSibling
is ancestor ofcandidateSibling
,false
otherwise.
-
findAllByClassName
public static List<Node> findAllByClassName(Node root, String className)
Finds all nodes that have a declared class. Note that the className is transformed to lower case before being matched against the DOM.- Parameters:
root
- the root node from which start searching.className
- the name of the filtered class.- Returns:
- list of matching nodes or an empty list.
-
findAllByAttributeName
public static List<Node> findAllByAttributeName(Node root, String attrName)
Finds all nodes that have a declared attribute. Note that the className is transformed to lower case before being matched against the DOM.- Parameters:
root
- the root node from which start searching.attrName
- the name of the filtered attribue.- Returns:
- list of matching nodes or an empty list.
-
findAllByAttributeContains
public static List<Node> findAllByAttributeContains(Node node, String attrName, String attrContains)
-
findAllByTagAndClassName
public static List<Node> findAllByTagAndClassName(Node root, String tagName, String className)
-
findNodeById
public static Node findNodeById(Node root, String id)
Mimics the JS DOM API, or prototype's $()- Parameters:
root
- the node to locateid
- the id of the node to locate- Returns:
- the
Node
if one exists
-
findAll
public static List<Node> findAll(Node node, String xpath)
Returns a NodeList composed of all the nodes that match an XPath expression, which must be valid.- Parameters:
node
- the node object to locatexpath
- an xpath expression- Returns:
- a list of
Node
's if they exists
-
find
public static String find(Node node, String xpath)
Gets the string value of an XPath expression.- Parameters:
node
- the node object to locatexpath
- an xpath expression- Returns:
- a string xpath value
-
hasClassName
public static boolean hasClassName(Node node, String className)
Tells if an element has a class name not checking the parents in the hierarchy mimicking the CSS .foo match.- Parameters:
node
- the node object to locateclassName
- the CSS class name- Returns:
- true if the class name exists
-
hasAttribute
public static boolean hasAttribute(Node node, String attributeName, String className)
Checks the presence of an attribute value in attributes that contain whitespace-separated lists of values. The semantic is the CSS classes' ones: "foo" matches "bar foo", "foo" but not "foob"- Parameters:
node
- the node object to locateattributeName
- attribute valueclassName
- the CSS class name- Returns:
- true if the class has the attribute name
-
hasAttribute
public static boolean hasAttribute(Node node, String attributeName)
Checks the presence of an attribute in the givennode
.- Parameters:
node
- the node container.attributeName
- the name of the attribute.- Returns:
- true if the attribute is present
-
isElementNode
public static boolean isElementNode(Node target)
Verifies if the given target node is an element.- Parameters:
target
- target node to check- Returns:
true
if the element the node is an element,false
otherwise.
-
readAttribute
public static String readAttribute(Node node, String attribute, String defaultValue)
Reads the value of the specifiedattribute
, returning thedefaultValue
string if not present.- Parameters:
node
- node to read the attribute.attribute
- attribute name.defaultValue
- the default value to return if attribute is not found.- Returns:
- the attribute value or
defaultValue
if not found.
-
readAttributeWithPrefix
public static String readAttributeWithPrefix(Node node, String attributePrefix, String defaultValue)
Reads the value of the first attribute which name matches with the specifiedattributePrefix
. Returns thedefaultValue
if not found.- Parameters:
node
- node to look for attributes.attributePrefix
- attribute prefix.defaultValue
- default returned value.- Returns:
- the value found or default.
-
readAttribute
public static String readAttribute(Node node, String attribute)
Reads the value of anattribute
, returning the empty string if not present.- Parameters:
node
- node to read the attribute.attribute
- attribute name.- Returns:
- the attribute value or
""
if not found.
-
serializeToXML
public static String serializeToXML(Node node, boolean indent) throws TransformerException, IOException
Given a DOMNode
produces the XML serialization omitting the XML declaration.- Parameters:
node
- node to be serialized.indent
- iftrue
the output is indented.- Returns:
- the XML serialization.
- Throws:
TransformerException
- if an error occurs during the serializator initialization and activation.IOException
- if there is an error locating the node
-
documentToInputStream
public static InputStream documentToInputStream(Document doc)
Given aDocument
this method will return an input stream representing that document.- Parameters:
doc
- the inputDocument
- Returns:
- an
InputStream
-
nodeToInputStream
public static InputStream nodeToInputStream(Node node)
Convert a w3c dom node to a InputStream- Parameters:
node
-Node
to convert- Returns:
- the converted
InputStream
-
-