Package org.apache.any23.extractor.html
Class TagSoupParser
- java.lang.Object
-
- org.apache.any23.extractor.html.TagSoupParser
-
public class TagSoupParser extends Object
Parses an
InputStream
into an HTML DOM tree.Note: The resulting DOM tree will not be namespace aware, and all element names will be upper case, while attributes will be lower case. This is because the HTML parser uses the Xerces HTML DOM implementation, which doesn't support namespaces and forces uppercase element names. This works with the RDFa XSLT Converter and with XPath, so we left it this way.
- Author:
- Richard Cyganiak (richard at cyganiak dot de), Michele Mostarda (mostarda@fbk.eu), Davide Palmisano (palmisano@fbk.eu)
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
TagSoupParser.ElementLocation
Describes a DOM Element location.
-
Field Summary
Fields Modifier and Type Field Description static String
ELEMENT_LOCATION
-
Constructor Summary
Constructors Constructor Description TagSoupParser(InputStream input, String documentIRI)
TagSoupParser(InputStream input, String documentIRI, String encoding)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Document
getDOM()
Returns the DOM of the given document IRI.DocumentReport
getValidatedDOM(boolean applyFix)
Returns the validated DOM and applies fixes on it if applyFix is set totrue
.
-
-
-
Field Detail
-
ELEMENT_LOCATION
public static final String ELEMENT_LOCATION
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
TagSoupParser
public TagSoupParser(InputStream input, String documentIRI)
-
TagSoupParser
public TagSoupParser(InputStream input, String documentIRI, String encoding)
-
-
Method Detail
-
getDOM
public Document getDOM() throws IOException
Returns the DOM of the given document IRI.- Returns:
- the HTML DOM.
- Throws:
IOException
- if there is an error whilst accessing the DOM
-
getValidatedDOM
public DocumentReport getValidatedDOM(boolean applyFix) throws IOException, ValidatorException
Returns the validated DOM and applies fixes on it if applyFix is set totrue
.- Parameters:
applyFix
- whether to apply fixes to the DOM- Returns:
- a report containing the HTML DOM that has been validated and fixed if applyFix if
true
. The reports contains also information about the activated rules and the the detected issues. - Throws:
IOException
- if there is an error accessing the DOMValidatorException
- if there is an error validating the DOM
-
-