Skip to content

Commit

Permalink
dataset card
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Nov 10, 2021
1 parent 804fad1 commit 2efb9bd
Showing 1 changed file with 11 additions and 3 deletions.
14 changes: 11 additions & 3 deletions datasets/id_newspapers_2018/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,15 @@ Indonesian
```
### Data Instances

[More Information Needed]
An instance from the dataset is

```
{'id': '0',
'url': 'https://www.cnnindonesia.com/olahraga/20161221234219-156-181385/lorenzo-ingin-samai-rekor-rossi-dan-stoner',
'date': '2016-12-22 07:00:00',
'title': 'Lorenzo Ingin Samai Rekor Rossi dan Stoner',
'content': 'Jakarta, CNN Indonesia -- Setelah bergabung dengan Ducati, Jorge Lorenzo berharap bisa masuk dalam jajaran pebalap yang mampu jadi juara dunia kelas utama dengan dua pabrikan berbeda. Pujian Max Biaggi untuk Valentino Rossi Jorge Lorenzo Hadir dalam Ucapan Selamat Natal Yamaha Iannone: Saya Sering Jatuh Karena Ingin yang Terbaik Sepanjang sejarah, hanya ada lima pebalap yang mampu jadi juara kelas utama (500cc/MotoGP) dengan dua pabrikan berbeda, yaitu Geoff Duke, Giacomo Agostini, Eddie Lawson, Valentino Rossi, dan Casey Stoner. Lorenzo ingin bergabung dalam jajaran legenda tersebut. “Fakta ini sangat penting bagi saya karena hanya ada lima pebalap yang mampu menang dengan dua pabrikan berbeda dalam sejarah balap motor.” “Kedatangan saya ke Ducati juga menghadirkan tantangan yang sangat menarik karena hampir tak ada yang bisa menang dengan Ducati sebelumnya, kecuali Casey Stoner. Hal itu jadi motivasi yang sangat bagus bagi saya,” tutur Lorenzo seperti dikutip dari Crash Lorenzo saat ini diliputi rasa penasaran yang besar untuk menunggang sepeda motor Desmosedici yang dipakai tim Ducati karena ia baru sekali menjajal motor tersebut pada sesi tes di Valencia, usai MotoGP musim 2016 berakhir. “Saya sangat tertarik dengan Ducati arena saya hanya memiliki kesempatan mencoba motor itu di Valencia dua hari setelah musim berakhir. Setelah itu saya tak boleh lagi menjajalnya hingga akhir Januari mendatang. Jadi saya menjalani penantian selama dua bulan yang panjang,” kata pebalap asal Spanyol ini. Dengan kondisi tersebut, maka Lorenzo memanfaatkan waktu yang ada untuk liburan dan melepaskan penat. “Setidaknya apa yang terjadi pada saya saat ini sangat bagus karena saya jadi memiliki waktu bebas dan sedikit liburan.” “Namun tentunya saya tak akan larut dalam liburan karena saya harus lebih bersiap, terutama dalam kondisi fisik dibandingkan sebelumnya, karena saya akan menunggangi motor yang sulit dikendarai,” ucap Lorenzo. Selama sembilan musim bersama Yamaha, Lorenzo sendiri sudah tiga kali jadi juara dunia, yaitu pada 2010, 2012, dan 2015. (kid)'}
```

### Data Fields
- `id`: id of the sample
Expand All @@ -96,7 +104,7 @@ Indonesian

### Data Splits

The dataset contains train set.
The dataset contains train set of 499164 samples.

## Dataset Creation

Expand Down Expand Up @@ -153,7 +161,7 @@ The dataset contains train set.

### Citation Information

[More Information Needed]
[N/A]

### Contributions

Expand Down

1 comment on commit 2efb9bd

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.083882 / 0.011353 (0.072529) 0.004486 / 0.011008 (-0.006522) 0.037292 / 0.038508 (-0.001216) 0.041451 / 0.023109 (0.018342) 0.359360 / 0.275898 (0.083462) 0.388361 / 0.323480 (0.064881) 0.095019 / 0.007986 (0.087033) 0.004715 / 0.004328 (0.000387) 0.010492 / 0.004250 (0.006242) 0.047207 / 0.037052 (0.010154) 0.353298 / 0.258489 (0.094809) 0.389953 / 0.293841 (0.096112) 0.101002 / 0.128546 (-0.027545) 0.010307 / 0.075646 (-0.065340) 0.301742 / 0.419271 (-0.117529) 0.053720 / 0.043533 (0.010187) 0.361617 / 0.255139 (0.106478) 0.384933 / 0.283200 (0.101734) 0.093146 / 0.141683 (-0.048537) 1.988605 / 1.452155 (0.536450) 2.091876 / 1.492716 (0.599160)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.269741 / 0.018006 (0.251734) 0.479156 / 0.000490 (0.478667) 0.003111 / 0.000200 (0.002911) 0.000085 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042223 / 0.037411 (0.004812) 0.026390 / 0.014526 (0.011864) 0.029541 / 0.176557 (-0.147016) 0.231537 / 0.737135 (-0.505599) 0.034763 / 0.296338 (-0.261575)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.494580 / 0.215209 (0.279371) 4.945767 / 2.077655 (2.868112) 2.117809 / 1.504120 (0.613689) 1.861950 / 1.541195 (0.320756) 1.928338 / 1.468490 (0.459848) 0.494557 / 4.584777 (-4.090220) 5.881541 / 3.745712 (2.135829) 2.555998 / 5.269862 (-2.713864) 1.040581 / 4.565676 (-3.525095) 0.059043 / 0.424275 (-0.365232) 0.013206 / 0.007607 (0.005599) 0.619625 / 0.226044 (0.393580) 6.209380 / 2.268929 (3.940452) 2.675444 / 55.444624 (-52.769180) 2.214821 / 6.876477 (-4.661656) 2.339120 / 2.142072 (0.197047) 0.627031 / 4.805227 (-4.178196) 0.136332 / 6.500664 (-6.364332) 0.066199 / 0.075469 (-0.009270)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.839878 / 1.841788 (-0.001910) 14.204868 / 8.074308 (6.130560) 31.383696 / 10.191392 (21.192304) 0.888334 / 0.680424 (0.207911) 0.615957 / 0.534201 (0.081756) 0.440629 / 0.579283 (-0.138654) 0.632877 / 0.434364 (0.198514) 0.297980 / 0.540337 (-0.242358) 0.310410 / 1.386936 (-1.076526)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.081806 / 0.011353 (0.070454) 0.004428 / 0.011008 (-0.006580) 0.037183 / 0.038508 (-0.001325) 0.039026 / 0.023109 (0.015917) 0.352839 / 0.275898 (0.076941) 0.385502 / 0.323480 (0.062022) 0.098957 / 0.007986 (0.090972) 0.005378 / 0.004328 (0.001049) 0.008700 / 0.004250 (0.004450) 0.042726 / 0.037052 (0.005674) 0.346499 / 0.258489 (0.088010) 0.400307 / 0.293841 (0.106466) 0.099993 / 0.128546 (-0.028553) 0.010292 / 0.075646 (-0.065354) 0.322380 / 0.419271 (-0.096892) 0.053540 / 0.043533 (0.010007) 0.354130 / 0.255139 (0.098991) 0.380151 / 0.283200 (0.096951) 0.092665 / 0.141683 (-0.049018) 2.059725 / 1.452155 (0.607571) 2.083787 / 1.492716 (0.591071)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.255258 / 0.018006 (0.237252) 0.480278 / 0.000490 (0.479788) 0.002016 / 0.000200 (0.001816) 0.000095 / 0.000054 (0.000040)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039036 / 0.037411 (0.001625) 0.024885 / 0.014526 (0.010359) 0.030048 / 0.176557 (-0.146508) 0.228400 / 0.737135 (-0.508736) 0.031342 / 0.296338 (-0.264997)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.504909 / 0.215209 (0.289700) 5.069833 / 2.077655 (2.992178) 2.210994 / 1.504120 (0.706874) 1.926114 / 1.541195 (0.384919) 2.012512 / 1.468490 (0.544022) 0.493427 / 4.584777 (-4.091350) 5.796446 / 3.745712 (2.050734) 3.922640 / 5.269862 (-1.347221) 1.053075 / 4.565676 (-3.512602) 0.059072 / 0.424275 (-0.365203) 0.013115 / 0.007607 (0.005508) 0.637779 / 0.226044 (0.411734) 6.362307 / 2.268929 (4.093378) 2.784345 / 55.444624 (-52.660279) 2.331512 / 6.876477 (-4.544964) 2.454581 / 2.142072 (0.312508) 0.635041 / 4.805227 (-4.170186) 0.135902 / 6.500664 (-6.364762) 0.066527 / 0.075469 (-0.008942)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.857131 / 1.841788 (0.015343) 13.951172 / 8.074308 (5.876864) 32.201135 / 10.191392 (22.009743) 0.893061 / 0.680424 (0.212637) 0.623401 / 0.534201 (0.089200) 0.443569 / 0.579283 (-0.135714) 0.607035 / 0.434364 (0.172671) 0.302768 / 0.540337 (-0.237569) 0.315261 / 1.386936 (-1.071675)

CML watermark

Please sign in to comment.