Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClickHouseBulkCopy dropping rows #218

Closed
jad97 opened this issue Oct 25, 2022 · 5 comments
Closed

ClickHouseBulkCopy dropping rows #218

jad97 opened this issue Oct 25, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@jad97
Copy link

jad97 commented Oct 25, 2022

Using a sample data table and a sample data file containing 100 million rows:

CREATE TABLE my_database.data
(
	sensor UInt32,
	time DateTime,
	temp UInt32,
	precip Float32,
	wind_speed UInt32,
	wind_dir UInt32
)
ENGINE = MergeTree()
PRIMARY KEY (sensor, time)

and the following code to bulk insert in batches:

using (var fs = new StreamReader(File.Open("100M.dat", FileMode.Open, FileAccess.Read)))
using (var connection = new ClickHouse.Client.ADO.ClickHouseConnection("conn_string"))
using (var bulkCopy = new ClickHouseBulkCopy(connection))
{
    connection.Open();
    bulkCopy.DestinationTableName = "my_database.data";
    fs.ReadLine(); // throw away header line

    List<object[]> rows = new List<object[]>();
    int batchSize = 100000; // Setting to >1M drops rows
    string? line = null;
    while ((line = fs.ReadLine()) != null)
    {
        object[] vals = line.Split(',');
        rows.Add(new object[] { Convert.ToInt32(vals[0]), Convert.ToDateTime(vals[1]), Convert.ToInt32(vals[2]),
            Convert.ToDouble(vals[3]), Convert.ToInt32(vals[4]), Convert.ToInt32(vals[5]) });
        if (rows.Count % batchSize == 0)
        {
            await bulkCopy.WriteToServerAsync(rows, cancelToken);
            if (bulkCopy.RowsWritten % batchSize != 0)
            {
                throw new Exception($"Dropped rows: {bulkCopy.RowsWritten} rows written");
            }
            rows = new List<object[]>();
        }
    }
    if (rows.Count > 0)
    {
        await bulkCopy.WriteToServerAsync(rows, cancelToken);
    }
}

Setting the batchSize >= 1,000,000 will consistently drop rows (bulkCopy.RowsWritten != rows.Count).

@DarkWanderer DarkWanderer self-assigned this Oct 25, 2022
@DarkWanderer DarkWanderer added the bug Something isn't working label Oct 25, 2022
@DarkWanderer
Copy link
Owner

DarkWanderer commented Oct 25, 2022

  1. Does this reproduce with latest version? (6.x)
  2. Does this reproduce with previous major version? (5.x)
  3. Does the number of rows in table correspond to RowsWritten counter or to the original data?
  4. Are there any exceptions if you run above code under debugger?

@jad97
Copy link
Author

jad97 commented Oct 25, 2022

Thanks for quick response - answers:

  1. Latest version (6.1.0) reproduces problem with batch sizes of 1M and 5M
  2. Version 5.1.1 works ok with batch sizes of 1M and 5M
  3. At the end, the number of rows in the table will correspond to the RowsWritten counter. For example, with batch size 1M the output is "System.Exception: Dropped rows: 999991 rows written" and the rows in database are "999991"
  4. No exceptions in debugger

Code used to generate data:

public void GenerateData(int records)
{
    // 100M = ~40s ~3.5GB
    int sensors = 20;
    DateTime time = new DateTime(2010, 1, 1);
    int count = 0;

    var r = new Random();
    using (var fs = new StreamWriter(File.Open("100M.dat", FileMode.CreateNew, FileAccess.Write)))
    {
        fs.WriteLine("sensor,time,temp,precip,wind_speed,wind_dir");
        while (count < records)
        {
            for (int sensor = 1; sensor <= sensors; sensor++)
            {
                int temp = r.Next(15, 105);
                double precip = r.Next(0, 100) / 100.0;
                int windSpeed = r.Next(0, 30);
                int windDir = r.Next(0, 7);

                fs.WriteLine($"{sensor},{time.ToString("yyyy-MM-dd HH:mm:ss")},{temp},{precip},{windSpeed},{windDir}");
                count++;
            }
            time = time.AddSeconds(1);
        }
    }
}

@DarkWanderer
Copy link
Owner

DarkWanderer commented Oct 30, 2022

Found the reason for the issue - I had a logic bug in the new BulkCopy implementation (version 6.1.0 specifically was affected). Thank you very much for bug report and detailed reproduction scenario.

I've released fix in version 6.1.1 - please let me know if the issue still reproduces

@jad97
Copy link
Author

jad97 commented Nov 1, 2022

Looks to be fixed. Used batch sizes of both 1M and 5M rows. Thanks!

@DarkWanderer
Copy link
Owner

Great, glad to hear it's fixed!

Note that you can tweak internal batch size in ClickHouseBulkCopy by setting BatchSize property on the object -

        var bulkCopy = new ClickHouseBulkCopy(connection)
        {
            DestinationTableName = "mydatabase.mytable",
            MaxDegreeOfParallelism = 2,
            BatchSize = 500000
        };

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants