Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning: not set pos to end if fail to ReadUntil #40232

Merged
merged 14 commits into from
Jan 12, 2023
4 changes: 3 additions & 1 deletion br/pkg/lightning/mydump/csv_parser.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"github.com/pingcap/tidb/br/pkg/lightning/worker"
"github.com/pingcap/tidb/types"
"github.com/pingcap/tidb/util/mathutil"
"go.uber.org/zap"
)

var (
Expand Down Expand Up @@ -341,7 +342,7 @@ func (parser *CSVParser) readUntil(chars *byteSet) ([]byte, byte, error) {
if err == nil {
err = io.EOF
}
parser.pos += int64(len(buf))
// parser.pos += int64(len(buf))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changed the semantics of pos. pos means the current file offset. Since we have read to the end, the pos should be set to the end offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a test. I didn't change it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This demand is from import on cloud. We are trying to use the offset reported in the syntax error to show the content in users' files. However, in an unterminated quote error, the offset always exceeds the range of the file. So I am about to return the position to the previous one.

If we meet an error, setting back the pos seems safe because we will never use it again.

Copy link
Contributor

@sleepymole sleepymole Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, in an unterminated quote error, the offset always exceeds the range of the file.

Why does offset exceed the range of the file? The end offset is also a valid offset.

It's hard to tell which pos is the starting point of syntax error. Suppose we have a line of data: a,b,c,"xxxxx
It's obvious that it has an unterminated quote error. But what is the right pos of syntax error? Is it the start pos of "xxxx or the end pos of this line? In my opinion, the end pos is not a wrong answer, because everything is still valid until the end.

I'm not sure how you use the syntax error pos. If you just want to print content near the error, you can print the content around the pos. e.g. When pos is 1000, print content[800:1200].

Copy link
Contributor Author

@buchuitoudegou buchuitoudegou Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to present the starting point of the unterminated quote (i.e., the first quote). Because users should fix the data files and import them again, informing them of the starting point would be better than telling them the end offset of the whole file, which is a bit meaningless

Copy link
Contributor Author

@buchuitoudegou buchuitoudegou Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does offset exceed the range of the file? The end offset is also a valid offset.

It is the position of EOF (e.g., the size of the file is 35 bytes, then the final offset is 35).

return buf, 0, errors.Trace(err)
}
index := IndexAnyByte(parser.buf, chars)
Expand Down Expand Up @@ -450,6 +451,7 @@ func (parser *CSVParser) readQuotedField() error {
content, terminator, err := parser.readUntil(&parser.quoteByteSet)
err = parser.replaceEOF(err, errUnterminatedQuotedField)
if err != nil {
parser.Logger.Error("err content", zap.ByteString("content", content))
sleepymole marked this conversation as resolved.
Show resolved Hide resolved
return err
}
parser.recordBuffer = append(parser.recordBuffer, content...)
Expand Down
1 change: 1 addition & 0 deletions br/tests/lightning_csv/errData/db-schema-create.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
create database if not exists db;
1 change: 1 addition & 0 deletions br/tests/lightning_csv/errData/db.test-schema.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
create table test(a int primary key, b int, c int, d int);
4 changes: 4 additions & 0 deletions br/tests/lightning_csv/errData/db.test.1.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
a,b,c,d
1,2,3,4
2,10,4,5
1111,",7,8
6 changes: 6 additions & 0 deletions br/tests/lightning_csv/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,9 @@ for BACKEND in tidb local; do
check_not_contains 'id:'

done

run_lightning --backend local -d "tests/$TEST_NAME/errData" --log-file "$TEST_DIR/lightning-err.log"
# err content presented
grep ",7,8" "$TEST_DIR/lightning-err.log"
# pos should not set to end
grep "[\"syntax error\"] [pos=31]" "$TEST_DIR/lightning-err.log"