Skip to content

JSoup with html snippet

François De Serres edited this page Jul 10, 2015 · 1 revision

JSoup is more HTML5-friendly than TagSoup. Nevertheless, html-snippet passes a java.io.StringReader instance to html-resource, but Jsoup/parse doesn't come with a corresponding interface.

Many thanks to @dhruvbhatia for proposing this workaround:

(ns my.namespace
  (:import [org.jsoup Jsoup]
           [org.jsoup.nodes Attribute Attributes Comment DataNode Document
                            DocumentType Element Node TextNode XmlDeclaration]
           [org.jsoup.parser Parser Tag]))

(def ^:private ->key (comp keyword #(.. % toString toLowerCase)))

(defprotocol IEnlive
  (->nodes [d] "Convert object into Enlive node(s)."))

(extend-protocol IEnlive
  Attribute
  (->nodes [a] [(->key (.getKey a)) (.getValue a)])

  Attributes
  (->nodes [as] (not-empty (into {} (map ->nodes as))))

  Comment
  (->nodes [c] {:type :comment :data (.getData c)})

  DataNode
  (->nodes [dn] (str dn))

  Document
  (->nodes [d] (not-empty (map ->nodes (.childNodes d))))

  DocumentType
  (->nodes [dtd] {:type :dtd :data ((juxt :name :publicid :systemid) (->nodes (.attributes dtd)))})

  Element
  (->nodes [e] {:tag     (->key (.tagName e))
                :attrs   (->nodes (.attributes e))
                :content (not-empty (map ->nodes (.childNodes e)))})

  TextNode
  (->nodes [tn] (.getWholeText tn))

  nil
  (->nodes [_] nil))

; redefined parser fn to support jsoup
(defn parser
  "Parse a HTML document stream into Enlive nodes using JSoup."
  [stream]
  (with-open [^java.io.Closeable stream stream]
    (->nodes (Jsoup/parse stream "ISO-8859-1" ""))))

; then this will work
(net.cgrand.enlive-html/html-resource (-> "<h1>Hi, cgrand!</h1>" (.getBytes "ISO-8859-1")
                                            java.io.ByteArrayInputStream.) {:parser parser})