Class HtmlTool


  • @DefaultKey("htmlTool")
    public class HtmlTool
    extends org.apache.velocity.tools.generic.SafeConfig
    An Apache Velocity tool that provides utility methods to manipulate HTML code using jsoup HTML5 parser.

    The methods utilise CSS selectors to refer to specific elements for manipulation.

    Since:
    1.0
    Author:
    Andrius Velykis
    See Also:
    jsoup HTML parser, jsoup CSS selectors
    • Constructor Detail

      • HtmlTool

        public HtmlTool()
    • Method Detail

      • configure

        protected void configure​(org.apache.velocity.tools.generic.ValueParser values)
        Overrides:
        configure in class org.apache.velocity.tools.generic.SafeConfig
        See Also:
        SafeConfig.configure(ValueParser)
      • split

        public List<String> split​(String content,
                                  String separatorCssSelector)
        Splits the given HTML content into partitions based on the given separator selector. The separators themselves are dropped from the results.
        Parameters:
        content - HTML content to split
        separatorCssSelector - CSS selector for separators.
        Returns:
        a list of HTML partitions split on separator locations, but without the separators.
        Since:
        1.0
        See Also:
        split(String, String, JoinSeparator)
      • splitOnStarts

        public List<String> splitOnStarts​(String content,
                                          String separatorCssSelector)
        Splits the given HTML content into partitions based on the given separator selector. The separators are kept as first elements of the partitions.

        Note that the first part is removed if the split was successful. This is because the first part does not include the separator.

        Parameters:
        content - HTML content to split
        separatorCssSelector - CSS selector for separators
        Returns:
        a list of HTML partitions split on separator locations (except the first one), with separators at the beginning of each partition
        Since:
        1.0
        See Also:
        split(String, String, JoinSeparator)
      • split

        public List<String> split​(String content,
                                  String separatorCssSelector,
                                  String separatorStrategy)
        Splits the given HTML content into partitions based on the given separator selector. The separators are either dropped or joined with before/after depending on the indicated separator strategy.
        Parameters:
        content - HTML content to split
        separatorCssSelector - CSS selector for separators
        separatorStrategy - strategy to drop or keep separators, one of "after", "before" or "no"
        Returns:
        a list of HTML partitions split on separator locations.
        Since:
        1.0
        See Also:
        split(String, String, JoinSeparator)
      • split

        public List<String> split​(String content,
                                  String separatorCssSelector,
                                  HtmlTool.JoinSeparator separatorStrategy)
        Splits the given HTML content into partitions based on the given separator selector.The separators are either dropped or joined with before/after depending on the indicated separator strategy.

        Note that splitting algorithm tries to resolve nested elements so that returned partitions are self-contained HTML elements. The nesting is normally contained within the first applicable partition.

        Parameters:
        content - HTML content to split
        separatorCssSelector - CSS selector for separators
        separatorStrategy - strategy to drop or keep separators
        Returns:
        a list of HTML partitions split on separator locations. If no splitting occurs, returns the original content as the single element of the list
        Since:
        1.0
      • reorderToTop

        public String reorderToTop​(String content,
                                   String selector,
                                   int amount)
        Reorders elements in HTML content so that selected elements are found at the top of the content. Can be limited to a certain amount, e.g. to bring just the first of selected elements to the top.
        Parameters:
        content - HTML content to reorder
        selector - CSS selector for elements to bring to top of the content
        amount - Maximum number of elements to reorder
        Returns:
        HTML content with reordered elements, or the original content if no such elements found.
        Since:
        1.0
      • reorderToTop

        public String reorderToTop​(String content,
                                   String selector,
                                   int amount,
                                   String wrapRemaining)
        Reorders elements in HTML content so that selected elements are found at the top of the content. Can be limited to a certain amount, e.g. to bring just the first of selected elements to the top.
        Parameters:
        content - HTML content to reorder
        selector - CSS selector for elements to bring to top of the content
        amount - Maximum number of elements to reorder
        wrapRemaining - HTML to wrap the remaining (non-reordered) part
        Returns:
        HTML content with reordered elements, or the original content if no such elements found.
        Since:
        1.0
      • extract

        public HtmlTool.ExtractResult extract​(String content,
                                              String selector,
                                              int amount)
        Extracts HTML elements from the main HTML content. The result consists of the extracted HTML elements and the remainder of HTML content, with these elements removed. Can be limited to a certain amount, e.g. to extract just the first of selected elements.
        Parameters:
        content - HTML content to extract elements from
        selector - CSS selector for elements to extract
        amount - Maximum number of elements to extract
        Returns:
        HTML content of the extracted elements together with the remainder of the original content. If no elements are found, the remainder contains the original content.
        Since:
        1.0
      • setAttr

        public String setAttr​(String content,
                              String selector,
                              String attributeKey,
                              String value)
        Sets attribute to the given value on elements in HTML.
        Parameters:
        content - HTML content to set attributes on
        selector - CSS selector for elements to modify
        attributeKey - Attribute name
        value - Attribute value
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • getAttr

        public List<String> getAttr​(String content,
                                    String selector,
                                    String attributeKey)
        Retrieves attribute value on elements in HTML. Will return all attribute values for the selector, since there can be more than one element.
        Parameters:
        content - HTML content to read attributes from
        selector - CSS selector for elements to find
        attributeKey - Attribute name
        Returns:
        Attribute values for all matching elements. If no elements are found, empty list is returned.
        Since:
        1.0
      • addClass

        public String addClass​(String content,
                               String selector,
                               List<String> classNames,
                               int amount)
        Adds given class names to the elements in HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to add classes to
        classNames - Names of classes to add to the selected elements
        amount - Maximum number of elements to modify
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • addClass

        public String addClass​(String content,
                               String selector,
                               List<String> classNames)
        Adds given class names to the elements in HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to add classes to
        classNames - Names of classes to add to the selected elements
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • addClass

        public String addClass​(String content,
                               String selector,
                               String className)
        Adds given class to the elements in HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to add the class to
        className - Name of class to add to the selected elements
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • wrap

        public String wrap​(String content,
                           String selector,
                           String wrapHtml,
                           int amount)
        Wraps elements in HTML with the given HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to wrap
        wrapHtml - HTML to use for wrapping the selected elements
        amount - Maximum number of elements to modify
        Returns:
        HTML content with modified elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • remove

        public String remove​(String content,
                             String selector)
        Removes elements from HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to remove
        Returns:
        HTML content with removed elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • replace

        public String replace​(String content,
                              String selector,
                              String replacement)
        Replaces elements in HTML.
        Parameters:
        content - HTML content to modify
        selector - CSS selector for elements to replace
        replacement - HTML replacement (must parse to a single element)
        Returns:
        HTML content with replaced elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • replaceAll

        public String replaceAll​(String content,
                                 Map<String,​String> replacements)
        Replaces elements in HTML.
        Parameters:
        content - HTML content to modify
        replacements - Map of CSS selectors to their replacement HTML texts. CSS selectors find elements to be replaced with the HTML in the mapping. The HTML must parse to a single element.
        Returns:
        HTML content with replaced elements. If no elements are found, the original content is returned.
        Since:
        1.0
      • text

        public List<String> text​(String content,
                                 String selector)
        Retrieves text content of the selected elements in HTML. Renders the element's text as it would be displayed on the web page (including its children).
        Parameters:
        content - HTML content with the elements
        selector - CSS selector for elements to extract contents
        Returns:
        A list of element texts as rendered to display. Empty list if no elements are found.
        Since:
        1.0
      • headingAnchorToId

        public String headingAnchorToId​(String content)
        Transforms the given HTML content by moving anchor (<a name="myheading">) names to IDs for heading elements.

        The anchors are used to indicate positions within a HTML page. In HTML5, however, the name attribute is no longer supported on <a>) tag. The positions within pages are indicated using id attribute instead, e.g. <h1 id="myheading">.

        The method finds anchors inside, immediately before or after the heading tags and uses their name as heading id instead. The anchors themselves are removed.

        Parameters:
        content - HTML content to modify
        Returns:
        HTML content with modified elements. Anchor names are used for adjacent headings, and anchor tags are removed. If no elements are found, the original content is returned.
        Since:
        1.0
      • concat

        public static List<String> concat​(List<String> elements,
                                          String text,
                                          boolean append)
        Utility method to concatenate a String to a list of Strings. The text can be either appended or prepended.
        Parameters:
        elements - list of elements to append/prepend the text to
        text - the given text to append/prepend
        append - if true, text will be appended to the elements. If false, it will be prepended
        Returns:
        list of elements with the text appended/prepended
        Since:
        1.0
      • ensureHeadingIds

        public String ensureHeadingIds​(String content,
                                       String idSeparator)
        Transforms the given HTML content by adding IDs to all heading elements (h1-6) that do not have one.

        IDs on heading elements are used to indicate positions within a HTML page in HTML5. If a heading tag without an id is found, its "slug" is generated automatically based on the heading contents and used as the ID.

        Note that the algorithm also modifies existing IDs that have symbols not allowed in CSS selectors, e.g. ":", ".", etc. The symbols are removed.

        Parameters:
        content - HTML content to modify
        Returns:
        HTML content with all heading elements having id attributes. If all headings were with IDs already, the original content is returned.
        Since:
        1.0
      • fixTableHeads

        public String fixTableHeads​(String content)
        Fixes table heads: wraps rows with <th> (table heading) elements into <thead> element if they are currently in <tbody>.
        Parameters:
        content - HTML content to modify
        Returns:
        HTML content with all table heads fixed. If all heads were correct, the original content is returned.
        Since:
        1.0
      • slug

        public static String slug​(String input,
                                  String separator)
        Creates a slug (latin text with no whitespace or other symbols) for a longer text (i.e. to use in URLs).
        Parameters:
        input - text to generate the slug from
        separator - separator for whitespace replacement
        Returns:
        the slug of the given text that contains alphanumeric symbols and separator only
        Since:
        1.0
        See Also:
        https://www.codecodex.com/wiki/Generate_a_url_slug
      • slug

        public static String slug​(String input)
        Creates a slug (latin text with no whitespace or other symbols) for a longer text (i.e. to use in URLs). Uses "-" as a whitespace separator.
        Parameters:
        input - text to generate the slug from
        Returns:
        the slug of the given text that contains alphanumeric symbols and "-" only
        Since:
        1.0
      • headingTree

        public List<? extends HtmlTool.IdElement> headingTree​(String content)
        Reads all headings in the given HTML content as a hierarchy. Subsequent smaller headings are nested within bigger ones, e.g. <h2> is nested under preceding <h1>.

        Only headings with IDs are included in the hierarchy. The result elements contain ID and heading text for each heading. The hierarchy is useful to generate a Table of Contents for a page.

        Parameters:
        content - HTML content to extract heading hierarchy from
        Returns:
        a list of top-level heading items (with id and text). The remaining headings are nested within these top-level items. Empty list if no headings are in the content.
        Since:
        1.0
      • parseBodyFragment

        public static org.jsoup.nodes.Element parseBodyFragment​(String content)
        A generic method to use jsoup parser on an arbitrary HTML body fragment. Allows writing HTML manipulations in the template without adding Java code to the class.
        Parameters:
        content - HTML content to parse
        Returns:
        the wrapper element for the parsed content (i.e. the body element as if the content was body contents).
        Since:
        1.0