Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make copy into can parallel load directory multi file #4584

Closed
wubx opened this issue Mar 26, 2022 · 2 comments
Closed

make copy into can parallel load directory multi file #4584

wubx opened this issue Mar 26, 2022 · 2 comments
Labels
A-query Area: databend query C-feature Category: feature good first issue Category: good first issue
Milestone

Comments

@wubx
Copy link
Member

wubx commented Mar 26, 2022

Summary

EC2: c5n.9xLarge
there are 100 files under the m_ontime directory.

copy into ontime
  from 's3://repo.databend.rs/m_ontime/'
  pattern ='.*[.]csv'
  file_format = (type = 'CSV' field_delimiter = '\t'  record_delimiter = '\n' skip_header = 0);

Query OK, 0 rows affected (31 min 34.38 sec)
Read 405375310 rows, 293.43 GB in 1894.383 sec., 213.99 thousand rows/sec., 154.89 MB/sec.

--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw
  2   1  97   0   0|   0     0 | 624B  450B|   0     0 | 425   140
  2   0  97   0   0|   0     0 | 100k   24M|   0     0 |3109   201
  4   0  96   0   0|   0     0 |  57M  130k|   0     0 |7143  5933
  3   0  97   0   0|   0     0 |  27M  102k|   0     0 |4794   993
  4   0  96   0   0|   0     0 |  50M  168k|   0     0 |7801  3669
  3   0  97   0   0|   0     0 |  32M  125k|   0     0 |5592  1823
  4   0  96   0   0|   0     0 |  35M  165k|   0     0 |7531  4034
  4   0  96   0   0|   0     0 |  42M  165k|   0     0 |7608  3506
  4   0  96   0   0|   0   104k|  49M  198k|   0     0 |9039  4587
  3   1  96   0   0|   0     0 |  17M   76k|   0     0 |3639  1329
  2   0  97   0   0|   0     0 | 108B  376B|   0     0 | 428   160
  2   0  98   0   0|   0     0 |6854k   25M|   0     0 |4691  1431
  4   0  96   0   0|   0     0 |  43M  106k|   0     0 |7060  4434
  4   0  96   0   0|   0     0 |  40M  130k|   0     0 |7516  4505
  4   0  96   0   0|   0     0 |  43M  153k|   0     0 |7406  4944
  4   0  96   0   0|   0     0 |  33M  116k|   0     0 |5792  3671
  4   0  96   0   0|   0     0 |  45M  179k|   0     0 |8793  5368
  4   0  96   0   0|   0     0 |  47M  178k|   0     0 |8669  5459
  4   0  96   0   0|   0     0 |  49M  152k|   0     0 |7376  4569
  2   1  97   0   0|   0     0 |  11M   28k|   0     0 |1728   887
  2   0  97   0   0|   0     0 |4146B  929k|   0     0 | 506   134
  3   0  97   0   0|   0     0 |  46M   22M|   0     0 |9437  6507

There are a lot of Resources not use, if used

mkdir ontime
cd ontime
wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip

ls *.zip |xargs -I{} -P 4 bash -c "echo {}; unzip -q {} '*.csv' -d ./dataset"
time ls ./dataset/*.csv|xargs -P 8 -I{} curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:1" -F "upload=@{}" -XPUT http://localhost:8000/v1/streaming_load

real 2m54.668s
user 0m9.752s
sys 0m46.410s

I think support copy into parallel will fast for load data.

@wubx wubx added the C-feature Category: feature label Mar 26, 2022
@BohuTANG BohuTANG added the good first issue Category: good first issue label Mar 26, 2022
@BohuTANG BohuTANG added the A-query Area: databend query label Mar 28, 2022
@GrapeBaBa
Copy link
Contributor

@sundy-li Is this issue actually same with we talked yesterday?

@sundy-li
Copy link
Member

sundy-li commented Apr 10, 2022

Yes, duplicated with #4308

@Xuanwo Xuanwo added this to the v0.8 milestone May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query C-feature Category: feature good first issue Category: good first issue
Projects
None yet
Development

No branches or pull requests

5 participants