Class HTMLTagBalancer

java.lang.Object
org.cyberneko.html.HTMLTagBalancer
All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent, org.apache.xerces.xni.parser.XMLDocumentFilter, org.apache.xerces.xni.parser.XMLDocumentSource, org.apache.xerces.xni.XMLDocumentHandler, HTMLComponent

public class HTMLTagBalancer extends Object implements org.apache.xerces.xni.parser.XMLDocumentFilter, HTMLComponent
Balances tags in an HTML document. This component receives document events and tries to correct many common mistakes that human (and computer) HTML document authors make. This tag balancer can:
  • add missing parent elements;
  • automatically close elements with optional end tags; and
  • handle mis-matched inline element tags.

This component recognizes the following features:

  • http://cyberneko.org/html/features/augmentations
  • http://cyberneko.org/html/features/report-errors
  • http://cyberneko.org/html/features/balance-tags/document-fragment
  • http://cyberneko.org/html/features/balance-tags/ignore-outside-content

This component recognizes the following properties:

  • http://cyberneko.org/html/properties/names/elems
  • http://cyberneko.org/html/properties/names/attrs
  • http://cyberneko.org/html/properties/error-reporter
  • http://cyberneko.org/html/properties/balance-tags/current-stack
Version:
$Id: HTMLTagBalancer.java,v 1.20 2005/02/14 04:06:22 andyc Exp $
Author:
Andy Clark, Marc Guillemot
See Also:
  • Field Details

    • NAMESPACES

      protected static final String NAMESPACES
      Namespaces.
      See Also:
    • AUGMENTATIONS

      protected static final String AUGMENTATIONS
      Include infoset augmentations.
      See Also:
    • REPORT_ERRORS

      protected static final String REPORT_ERRORS
      Report errors.
      See Also:
    • DOCUMENT_FRAGMENT_DEPRECATED

      protected static final String DOCUMENT_FRAGMENT_DEPRECATED
      Document fragment balancing only (deprecated).
      See Also:
    • DOCUMENT_FRAGMENT

      protected static final String DOCUMENT_FRAGMENT
      Document fragment balancing only.
      See Also:
    • IGNORE_OUTSIDE_CONTENT

      protected static final String IGNORE_OUTSIDE_CONTENT
      Ignore outside content.
      See Also:
    • NAMES_ELEMS

      protected static final String NAMES_ELEMS
      Modify HTML element names: { "upper", "lower", "default" }.
      See Also:
    • NAMES_ATTRS

      protected static final String NAMES_ATTRS
      Modify HTML attribute names: { "upper", "lower", "default" }.
      See Also:
    • ERROR_REPORTER

      protected static final String ERROR_REPORTER
      Error reporter.
      See Also:
    • FRAGMENT_CONTEXT_STACK

      public static final String FRAGMENT_CONTEXT_STACK
      EXPERIMENTAL: may change in next release
      Name of the property holding the stack of elements in which context a document fragment should be parsed.
      See Also:
    • NAMES_NO_CHANGE

      protected static final short NAMES_NO_CHANGE
      Don't modify HTML names.
      See Also:
    • NAMES_MATCH

      protected static final short NAMES_MATCH
      Match HTML element names.
      See Also:
    • NAMES_UPPERCASE

      protected static final short NAMES_UPPERCASE
      Uppercase HTML names.
      See Also:
    • NAMES_LOWERCASE

      protected static final short NAMES_LOWERCASE
      Lowercase HTML names.
      See Also:
    • SYNTHESIZED_ITEM

      protected static final HTMLEventInfo SYNTHESIZED_ITEM
      Synthesized event info item.
    • fNamespaces

      protected boolean fNamespaces
      Namespaces.
    • fAugmentations

      protected boolean fAugmentations
      Include infoset augmentations.
    • fReportErrors

      protected boolean fReportErrors
      Report errors.
    • fDocumentFragment

      protected boolean fDocumentFragment
      Document fragment balancing only.
    • fIgnoreOutsideContent

      protected boolean fIgnoreOutsideContent
      Ignore outside content.
    • fAllowSelfclosingIframe

      protected boolean fAllowSelfclosingIframe
      Allows self closing iframe tags.
    • fAllowSelfclosingTags

      protected boolean fAllowSelfclosingTags
      Allows self closing tags.
    • fNamesElems

      protected short fNamesElems
      Modify HTML element names.
    • fNamesAttrs

      protected short fNamesAttrs
      Modify HTML attribute names.
    • fErrorReporter

      protected HTMLErrorReporter fErrorReporter
      Error reporter.
    • fDocumentSource

      protected org.apache.xerces.xni.parser.XMLDocumentSource fDocumentSource
      The document source.
    • fDocumentHandler

      protected org.apache.xerces.xni.XMLDocumentHandler fDocumentHandler
      The document handler.
    • fElementStack

      protected final HTMLTagBalancer.InfoStack fElementStack
      The element stack.
    • fInlineStack

      protected final HTMLTagBalancer.InfoStack fInlineStack
      The inline stack.
    • fSeenAnything

      protected boolean fSeenAnything
      True if seen anything. Important for xml declaration.
    • fSeenDoctype

      protected boolean fSeenDoctype
      True if root element has been seen.
    • fSeenRootElement

      protected boolean fSeenRootElement
      True if root element has been seen.
    • fSeenRootElementEnd

      protected boolean fSeenRootElementEnd
      True if seen the end of the document element. In other words, this variable is set to false until the end </HTML> tag is seen (or synthesized). This is used to ensure that extraneous events after the end of the document element do not make the document stream ill-formed.
    • fSeenHeadElement

      protected boolean fSeenHeadElement
      True if seen <head< element.
    • fSeenBodyElement

      protected boolean fSeenBodyElement
      True if seen <body< element.
    • fOpenedForm

      protected boolean fOpenedForm
      True if a form is in the stack (allow to discard opening of nested forms)
    • tagBalancingListener

      protected HTMLTagBalancingListener tagBalancingListener
  • Constructor Details

    • HTMLTagBalancer

      public HTMLTagBalancer()
  • Method Details

    • getFeatureDefault

      public Boolean getFeatureDefault(String featureId)
      Returns the default state for a feature.
      Specified by:
      getFeatureDefault in interface HTMLComponent
      Specified by:
      getFeatureDefault in interface org.apache.xerces.xni.parser.XMLComponent
    • getPropertyDefault

      public Object getPropertyDefault(String propertyId)
      Returns the default state for a property.
      Specified by:
      getPropertyDefault in interface HTMLComponent
      Specified by:
      getPropertyDefault in interface org.apache.xerces.xni.parser.XMLComponent
    • getRecognizedFeatures

      public String[] getRecognizedFeatures()
      Returns recognized features.
      Specified by:
      getRecognizedFeatures in interface org.apache.xerces.xni.parser.XMLComponent
    • getRecognizedProperties

      public String[] getRecognizedProperties()
      Returns recognized properties.
      Specified by:
      getRecognizedProperties in interface org.apache.xerces.xni.parser.XMLComponent
    • reset

      public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager) throws org.apache.xerces.xni.parser.XMLConfigurationException
      Resets the component.
      Specified by:
      reset in interface org.apache.xerces.xni.parser.XMLComponent
      Throws:
      org.apache.xerces.xni.parser.XMLConfigurationException
    • setFeature

      public void setFeature(String featureId, boolean state) throws org.apache.xerces.xni.parser.XMLConfigurationException
      Sets a feature.
      Specified by:
      setFeature in interface org.apache.xerces.xni.parser.XMLComponent
      Throws:
      org.apache.xerces.xni.parser.XMLConfigurationException
    • setProperty

      public void setProperty(String propertyId, Object value) throws org.apache.xerces.xni.parser.XMLConfigurationException
      Sets a property.
      Specified by:
      setProperty in interface org.apache.xerces.xni.parser.XMLComponent
      Throws:
      org.apache.xerces.xni.parser.XMLConfigurationException
    • setDocumentHandler

      public void setDocumentHandler(org.apache.xerces.xni.XMLDocumentHandler handler)
      Sets the document handler.
      Specified by:
      setDocumentHandler in interface org.apache.xerces.xni.parser.XMLDocumentSource
    • getDocumentHandler

      public org.apache.xerces.xni.XMLDocumentHandler getDocumentHandler()
      Returns the document handler.
      Specified by:
      getDocumentHandler in interface org.apache.xerces.xni.parser.XMLDocumentSource
    • startDocument

      public void startDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start document.
      Specified by:
      startDocument in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • xmlDecl

      public void xmlDecl(String version, String encoding, String standalone, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      XML declaration.
      Specified by:
      xmlDecl in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • doctypeDecl

      public void doctypeDecl(String rootElementName, String publicId, String systemId, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Doctype declaration.
      Specified by:
      doctypeDecl in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • endDocument

      public void endDocument(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      End document.
      Specified by:
      endDocument in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • comment

      public void comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Comment.
      Specified by:
      comment in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • processingInstruction

      public void processingInstruction(String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Processing instruction.
      Specified by:
      processingInstruction in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • startElement

      public void startElement(org.apache.xerces.xni.QName elem, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start element.
      Specified by:
      startElement in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • emptyElement

      public void emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Empty element.
      Specified by:
      emptyElement in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • startGeneralEntity

      public void startGeneralEntity(String name, org.apache.xerces.xni.XMLResourceIdentifier id, String encoding, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start entity.
      Specified by:
      startGeneralEntity in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • textDecl

      public void textDecl(String version, String encoding, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Text declaration.
      Specified by:
      textDecl in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • endGeneralEntity

      public void endGeneralEntity(String name, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      End entity.
      Specified by:
      endGeneralEntity in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • startCDATA

      public void startCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start CDATA section.
      Specified by:
      startCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • endCDATA

      public void endCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      End CDATA section.
      Specified by:
      endCDATA in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • characters

      public void characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Characters.
      Specified by:
      characters in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • ignorableWhitespace

      public void ignorableWhitespace(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Ignorable whitespace.
      Specified by:
      ignorableWhitespace in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • endElement

      public void endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      End element.
      Specified by:
      endElement in interface org.apache.xerces.xni.XMLDocumentHandler
      Throws:
      org.apache.xerces.xni.XNIException
    • setDocumentSource

      public void setDocumentSource(org.apache.xerces.xni.parser.XMLDocumentSource source)
      Sets the document source.
      Specified by:
      setDocumentSource in interface org.apache.xerces.xni.XMLDocumentHandler
    • getDocumentSource

      public org.apache.xerces.xni.parser.XMLDocumentSource getDocumentSource()
      Returns the document source.
      Specified by:
      getDocumentSource in interface org.apache.xerces.xni.XMLDocumentHandler
    • startDocument

      public void startDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start document.
      Throws:
      org.apache.xerces.xni.XNIException
    • startPrefixMapping

      public void startPrefixMapping(String prefix, String uri, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Start prefix mapping.
      Throws:
      org.apache.xerces.xni.XNIException
    • endPrefixMapping

      public void endPrefixMapping(String prefix, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      End prefix mapping.
      Throws:
      org.apache.xerces.xni.XNIException
    • getElement

      protected HTMLElements.Element getElement(org.apache.xerces.xni.QName elementName)
      Returns an HTML element.
    • callStartElement

      protected final void callStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Call document handler start element.
      Throws:
      org.apache.xerces.xni.XNIException
    • callEndElement

      protected final void callEndElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException
      Call document handler end element.
      Throws:
      org.apache.xerces.xni.XNIException
    • getElementDepth

      protected final int getElementDepth(HTMLElements.Element element)
      Returns the depth of the open tag associated with the specified element name or -1 if no matching element is found.
      Parameters:
      element - The element.
    • getParentDepth

      protected int getParentDepth(HTMLElements.Element[] parents, short bounds)
      Returns the depth of the open tag associated with the specified element parent names or -1 if no matching element is found.
      Parameters:
      parents - The parent elements.
    • emptyAttributes

      protected final org.apache.xerces.xni.XMLAttributes emptyAttributes()
      Returns a set of empty attributes.
    • synthesizedAugs

      protected final org.apache.xerces.xni.Augmentations synthesizedAugs()
      Returns an augmentations object with a synthesized item added.
    • modifyName

      protected static final String modifyName(String name, short mode)
      Modifies the given name based on the specified mode.
    • getNamesValue

      protected static final short getNamesValue(String value)
      Converts HTML names string value to constant value.
      See Also: