Class TagSoupParser


  • public class TagSoupParser
    extends Object

    Parses an InputStream into an HTML DOM tree.

    Note: The resulting DOM tree will not be namespace aware, and all element names will be upper case, while attributes will be lower case. This is because the HTML parser uses the Xerces HTML DOM implementation, which doesn't support namespaces and forces uppercase element names. This works with the RDFa XSLT Converter and with XPath, so we left it this way.

    Author:
    Richard Cyganiak (richard at cyganiak dot de), Michele Mostarda (mostarda@fbk.eu), Davide Palmisano (palmisano@fbk.eu)
    • Method Detail

      • getDOM

        public Document getDOM()
                        throws IOException
        Returns the DOM of the given document IRI.
        Returns:
        the HTML DOM.
        Throws:
        IOException - if there is an error whilst accessing the DOM
      • getValidatedDOM

        public DocumentReport getValidatedDOM​(boolean applyFix)
                                       throws IOException,
                                              ValidatorException
        Returns the validated DOM and applies fixes on it if applyFix is set to true.
        Parameters:
        applyFix - whether to apply fixes to the DOM
        Returns:
        a report containing the HTML DOM that has been validated and fixed if applyFix if true. The reports contains also information about the activated rules and the the detected issues.
        Throws:
        IOException - if there is an error accessing the DOM
        ValidatorException - if there is an error validating the DOM