TagSoupParser (Apache Any23 2.8-SNAPSHOT API)

This project has retired. For details please refer to its Attic page.

TagSoupParser (Apache Any23 2.8-SNAPSHOT API)

java.lang.Object
- org.apache.any23.extractor.html.TagSoupParser

```
public class TagSoupParser
extends Object
```
Parses an InputStream into an HTML DOM tree.

Note: The resulting DOM tree will not be namespace aware, and all element names will be upper case, while attributes will be lower case. This is because the HTML parser uses the Xerces HTML DOM implementation, which doesn't support namespaces and forces uppercase element names. This works with the RDFa XSLT Converter and with XPath, so we left it this way.

Author:

Richard Cyganiak (richard at cyganiak dot de), Michele Mostarda (mostarda@fbk.eu), Davide Palmisano (palmisano@fbk.eu)

Nested Class Summary

Nested Classes
Modifier and Type Class Description

static class TagSoupParser.ElementLocation
Describes a DOM Element location.

Field Summary

Fields
Modifier and Type Field Description

static String ELEMENT_LOCATION

Constructor Summary

Constructors
Constructor	Description
`TagSoupParser(InputStream input, String documentIRI)`
`TagSoupParser(InputStream input, String documentIRI, String encoding)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`Document`	`getDOM()`	Returns the DOM of the given document IRI.
`DocumentReport`	`getValidatedDOM(boolean applyFix)`	Returns the validated DOM and applies fixes on it if applyFix is set to `true`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - ELEMENT_LOCATION
```
public static final String ELEMENT_LOCATION
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - TagSoupParser
```
public TagSoupParser(InputStream input,
                     String documentIRI)
```
  - TagSoupParser
```
public TagSoupParser(InputStream input,
                     String documentIRI,
                     String encoding)
```
- Method Detail
  - getDOM
```
public Document getDOM()
                throws IOException
```
    Returns the DOM of the given document IRI.
    
    Returns:
    
    the HTML DOM.
    
    Throws:
    
    IOException - if there is an error whilst accessing the DOM
  - getValidatedDOM
```
public DocumentReport getValidatedDOM(boolean applyFix)
                               throws IOException,
                                      ValidatorException
```
    Returns the validated DOM and applies fixes on it if applyFix is set to true.
    
    Parameters:
    
    applyFix - whether to apply fixes to the DOM
    
    Returns:
    
    a report containing the HTML DOM that has been validated and fixed if applyFix if true. The reports contains also information about the activated rules and the the detected issues.
    
    Throws:
    
    IOException - if there is an error accessing the DOM
    
    ValidatorException - if there is an error validating the DOM