Can I access the contents of HTML elements without it being trimmed? #160
-
describe what you want to archive I want to use this library to parse HTML documents and find links that (incorrectly) contain spaces inside links. For example Please click <a href="www.example.com">link </a>to visit my website. But it appears that the content is always trimmed. Is there any way I can access the contents of these links without trim being invoked? Code Sample val text: List<String> = ....
val links = text.flatMap {
try {
htmlDocument(it).findAll("a")
} catch (e: ElementNotFoundException) {
emptyList()
}
}
val linksWithSpaces = links.filter {
it.text != it.text.trim()
} |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
hey, @Language("HTML")
val someMarkupWithLinks: String = """
<div class="foo">
<ul>
<li>
<span>abc</span>
<a href="http://www.valid.com">a valid link</a>
</li>
<li>
<span>def</span>
<a href="http://www . invalid . com">an invalid link</a>
</li>
<li>
<span>ghi</span>
<a href="http://valid.com/rocks">another valid link</a>
</li>
<li>
<span>jkl</span>
<a href="http://www.invalid.com/ whitespaced/path">another invalid link</a>
</li>
</ul>
</div>
"""
fun main() {
// get the href attributes of all a-tags in the document
val hrefValues = htmlDocument(someMarkupWithLinks) {
a {
findAll {
eachHref
}
}
}
// filter for the ones that contain a whitespace
val hrefsWithWhiteSpace = hrefValues.filter { it.contains(" ") }
println(hrefsWithWhiteSpace) // will print following list --> '[http://www . invalid . com, http://www.invalid.com/ whitespaced/path]'
} if you need the hrefs together with its corresponding link text you can instead of using htmlDocument(someMarkupWithLinks) { a { findAll { eachLink } } }.filter { it.value.contains(" ") }.also { println(it) }
// will print --> '{an invalid link=http://www . invalid . com, another invalid link=http://www.invalid.com/ whitespaced/path}' If you would want to know if the links text contains whitespaces you can just filter on the eachLink keys instead of the values Hope this helps. just let me know if you have more questions :) i'm using version |
Beta Was this translation helpful? Give feedback.
hey,
i think i would probably do something like this: