Class MicrodataParser


  • public class MicrodataParser
    extends Object
    This class provides utility methods for handling Microdata nodes contained within a DOM document.
    Author:
    Michele Mostarda (mostarda@fbk.eu), Hans Brende (hansbrende@apache.org)
    • Field Detail

      • SRC_TAGS

        public static final Set<String> SRC_TAGS
        List of tags providing the src property.
      • HREF_TAGS

        public static final Set<String> HREF_TAGS
        List of tags providing the href property.
    • Constructor Detail

      • MicrodataParser

        public MicrodataParser​(Document document)
    • Method Detail

      • getItemScopeNodes

        public static List<Node> getItemScopeNodes​(Node node)
        Returns all the itemScopes detected within the given root node.
        Parameters:
        node - root node to search in.
        Returns:
        list of detected items.
      • isItemScope

        public static boolean isItemScope​(Node node)
        Check whether a node is an itemScope.
        Parameters:
        node - node to check.
        Returns:
        true if the node is an itemScope., false otherwise.
      • getItemPropNodes

        public static List<Node> getItemPropNodes​(Node node)
        Returns all the itemProps detected within the given root node.
        Parameters:
        node - root node to search in.
        Returns:
        list of detected items.
      • isItemProp

        public static boolean isItemProp​(Node node)
        Check whether a node is an itemProp.
        Parameters:
        node - node to check.
        Returns:
        true if the node is an itemProp., false otherwise.
      • getTopLevelItemScopeNodes

        public static List<Node> getTopLevelItemScopeNodes​(Node node)
        Returns only the itemScopes that are top level items.
        Parameters:
        node - root node to search in.
        Returns:
        list of detected top item scopes.
      • getMicrodata

        public static MicrodataParserReport getMicrodata​(Document document,
                                                         org.apache.any23.extractor.microdata.MicrodataParser.ErrorMode errorMode)
                                                  throws MicrodataParserException
        Returns all the Microdata items detected within the given document.
        Parameters:
        document - document to be processed.
        errorMode - error management policy.
        Returns:
        list of itemscope items.
        Throws:
        MicrodataParserException - if errorMode == MicrodataParser.ErrorMode.STOP_AT_FIRST_ERROR and an error occurs.
      • getMicrodata

        public static MicrodataParserReport getMicrodata​(Document document)
        Returns all the Microdata items detected within the given document, works in full report mode.
        Parameters:
        document - document to be processed.
        Returns:
        list of itemscope items.
      • getMicrodataAsJSON

        public static void getMicrodataAsJSON​(Document document,
                                              PrintStream ps)
        Returns a JSON containing the list of all extracted Microdata, as described at Microdata JSON Specification.
        Parameters:
        document - document to be processed.
        ps - the PrintStream to write JSON to
      • setErrorMode

        public void setErrorMode​(org.apache.any23.extractor.microdata.MicrodataParser.ErrorMode errorMode)
      • getErrorMode

        public org.apache.any23.extractor.microdata.MicrodataParser.ErrorMode getErrorMode()
      • getItemProps

        public List<ItemProp> getItemProps​(Node scopeNode,
                                           boolean skipRoot)
                                    throws MicrodataParserException
        Returns all the itemprops for the given itemscope node.
        Parameters:
        scopeNode - node representing the itemscope
        skipRoot - if true the given root node will be not read as a property, even if it contains the itemprop attribute.
        Returns:
        the list of itemprops detected within the given itemscope.
        Throws:
        MicrodataParserException - if an error occurs while retrieving an property value.
      • deferProperties

        public ItemProp[] deferProperties​(String... refs)
                                   throws MicrodataParserException
        Given a document and a list of itemprop names this method will return such itemprops.
        Parameters:
        refs - list of references.
        Returns:
        list of retrieved itemprops.
        Throws:
        MicrodataParserException - if a loop is detected or a property name is missing.