Skip to content

Data providing institutions

Julaiti Alafate edited this page Apr 6, 2020 · 14 revisions

Institutions used for training

We use the data from 7 institutions for training:

  • AGSO
  • JAMSTEC/JAMSTEC2
  • NGA/NGA2
  • NGDC
  • NOAA_geodas
  • SIO
  • US_multi

There are two institutions, JAMSTEC and NGA, that have the cruises in two subsets. We train two models for the two subsets, and test them individually, so that we can evaluate if the model trained on one subset can generalize to the other.

How we divide data

We divide data in two steps:

  1. Chunking: we chunk the measurements in a cruise into multiple small segments. For the single-beam measurements, each part contains around 5,000 measurements; for the multi-beam measurements, each part contains around 100,000 measurements.
  2. Train/test split: we then split the segments into three sets for training, validation, and testing respectively.

Data location

  • Raw data: /cryosat3/btozer/CREATE_ML_FEATURES/tsv_all
  • Chunked data: /cryosat3/jalafate/bathymetry/data/chulks (typo)

Edit rate

After splitting into chunks

overall edit rate edit rate in train set edit rate in validate set edit rate in test set
AGSO 0.72 0.64 0.35 1.35
JAMSTEC 0.57 0.66 0.42 0.23
JAMSTEC2 6.45 5.74 6.31 12.57
NGA 21.29 21.23 24.13 18.98
NGA2 0.12 0.11 0.12 0.13
NGDC 4.20 3.85 4.20 5.77
NOAA_geodas 10.39 10.64 9.84 9.82
SIO 13.11 12.72 15.00 13.00
US_multi 5.06 4.66 4.60 7.17

Original

overall edit rate edit rate in train set edit rate in validate set edit rate in test set
AGSO 0.77 0.36 2.08 0.12
JAMSTEC 0.55 0.44 1.29 0.40
JAMSTEC2 7.28 7.80 10.27 1.67
NGA 21.53 22.74 23.16 10.69
NGA2 0.12 0.13 0.02 0.59
NGDC 4.19 4.46 3.93 3.30
NOAA_geodas 10.29 10.29 8.10 12.67
SIO 13.71 15.43 13.84 3.41
US_multi 5.30 5.61 6.43 2.63

Number of cruises from each institutions and perplexity

After splitting into chunks

overall train set validate set test set
AGSO 209 (201.32) 144 (138.94) 31 (29.16) 34 (33.22)
JAMSTEC 1088 (888.69) 773 (641.08) 159 (124.48) 156 (123.16)
JAMSTEC2 168 (84.81) 128 (63.52) 25 (12.50) 15 (8.88)
NGA 2034 (1200.50) 1408 (825.80) 302 (180.07) 324 (194.68)
NGA2 1921 (1918.99) 1347 (1345.58) 302 (302.00) 272 (271.41)
NGDC 3974 (1785.70) 2754 (1240.03) 611 (271.08) 609 (274.65)
NOAA_geodas 8073 (7527.66) 5593 (5208.89) 1267 (1185.24) 1213 (1133.56)
SIO 6037 (3901.72) 4194 (2716.40) 933 (608.04) 910 (577.43)
US_multi 772 (565.04) 534 (388.26) 116 (88.05) 122 (88.77)

Original

overall train set validate set test set
AGSO 48 (25.31) 35 (22.08) 7 (2.78) 6 (2.85)
JAMSTEC 538 (165.75) 379 (121.12) 80 (27.72) 79 (18.43)
JAMSTEC2 150 (42.70) 105 (29.93) 23 (8.47) 22 (4.98)
NGA 1368 (54.42) 960 (31.48) 204 (14.35) 204 (15.97)
NGA2 24 (11.94) 17 (8.90) 4 (2.16) 3 (1.94)
NGDC 1040 (445.19) 731 (328.96) 155 (58.43) 154 (60.14)
NOAA_geodas 3672 (1681.77) 2572 (1164.31) 551 (256.48) 549 (264.38)
SIO 243 (91.31) 172 (70.02) 35 (16.36) 36 (6.87)
US_multi 615 (296.38) 432 (209.54) 92 (45.92) 91 (41.23)

Number of cruises

total num. of cruises num. of cruises in train set num. of cruises in validate set num. of cruises in test set
AGSO 48 35 7 6
JAMSTEC 538 379 80 79
JAMSTEC2 150 105 23 22
NGA 1368 960 204 204
NGA2 24 17 4 3
NGDC 1040 731 155 154
NOAA_geodas 3672 2572 551 549
SIO 243 172 35 36
US_multi 615 432 92 91

Number of measurements (in millions)

After splitting into chunks

total num. of measures num. of measures in train set num. of measures in validate set num. of measures in test set
AGSO 19.63 13.53 2.84 3.26
JAMSTEC 81.37 59.04 11.23 11.10
JAMSTEC2 5.44 4.14 0.81 0.49
NGA 4.55 3.13 0.67 0.75
NGA2 9.59 6.73 1.51 1.35
NGDC 110.27 76.19 17.02 17.06
NOAA_geodas 35.28 24.38 5.56 5.34
SIO 39.59 27.52 6.10 5.97
US_multi 45.30 31.02 6.98 7.30

Original

total num. of measures num. of measures in train set num. of measures in validate set num. of measures in test set
AGSO 20.4 13.8 5.1 1.6
JAMSTEC 89.3 64.9 11.6 12.9
JAMSTEC2 6.1 4.5 0.8 0.8
NGA 4.7 3.5 0.7 0.5
NGA2 9.6 7.1 2.3 0.3
NGDC 124.9 86.2 18.3 20.4
NOAA_geodas 39.2 28.0 5.8 5.3
SIO 40.9 30.4 5.4 5.1
US_multi 52.5 36.0 8.7 7.9

Data types

unknown. multi-beam grid single-beam Point measurement
AGSO 21.0 M 0 0 0 0
SIO 0 11.6 M 29.6 M 0 0
NGDC 0 112.8 M 0 12.6 M 0
US_multi 0 52.6 M 0 0 0
JAMSTEC 0 90.1 M 0 0 0
JAMSTEC2 0 6.1 M 0 0 0
NOAA_geodas 0 0 0 39.2 M 0
NGA 0 0 0 4.7 M 0
NGA2 0 0 0 9.6 M 0

Institutions not used

We are not using the cruises provided by these agencies:

  • 3DGBR
  • DNC
  • IFREMER
  • GEBCO
  • lakes
  • IBCAO
  • NAVO
  • NOAA
  • CCOM
  • GEOMAR

Source

This wiki is generated using notebook data-process/Count-Lines.ipynb.