You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I configure the parser to track positions and feed in the fragment <table>foo<tr><td>bar</td></tr></table>, both foo and bar get parsed as text nodes. However, the source range for foo contains invalid start/end positions. I realise foo is a misplaced text, but is there a specific reason why we do not populate the positions? Also, I see the same unexpected behaviour if I add whitespaces between <table> and <tr>. The resulting text node contains invalid positions.
Here's a simple test program:
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.Range;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.HtmlTreeBuilder;
import org.jsoup.parser.Parser;
import org.jsoup.select.NodeTraversor;
public class Test {
public static void main(String[] args) {
HtmlTreeBuilder treeBuilder = new HtmlTreeBuilder();
Parser parser = new Parser(treeBuilder);
parser.setTrackPosition(true);
Document document = parser.parseInput("<table>foo<tr><td>bar</td></tr></table>", "");
NodeTraversor.traverse((Node node, int depth) -> {
if (node instanceof TextNode textNode) {
Range sourceRange = textNode.sourceRange();
System.out.printf("text=%s start=%d end=%d%n",
textNode.text(),
sourceRange.start().pos(),
sourceRange.end().pos());
}
}, document);
}
}
And the unexpected output:
$ java -cp ~/.m2/repository/org/jsoup/jsoup/1.15.4/jsoup-1.15.4.jar Test.java
text=foo start=0 end=-1 # start/end positions are invalid here
text=bar start=18 end=21
The text was updated successfully, but these errors were encountered:
Thanks, fixed - the parser was losing the Token start/end positions for fostered table text as we were only storing the pending string data, vs the original token.
If I configure the parser to track positions and feed in the fragment
<table>foo<tr><td>bar</td></tr></table>
, bothfoo
andbar
get parsed as text nodes. However, the source range forfoo
contains invalid start/end positions. I realisefoo
is a misplaced text, but is there a specific reason why we do not populate the positions? Also, I see the same unexpected behaviour if I add whitespaces between<table>
and<tr>
. The resulting text node contains invalid positions.Here's a simple test program:
And the unexpected output:
The text was updated successfully, but these errors were encountered: