Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add a dict to dataset with add_column and update it with map, but get wrong result. #42190

Closed
zhijianma opened this issue Jan 5, 2024 · 2 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@zhijianma
Copy link

zhijianma commented Jan 5, 2024

What happened + What you expected to happen

  1. I use add_column to add a empty dict meta to dataset and then use map to update the meta dict, but get a wrong dataset.
def test1():
    ds = ray.data.range(40)
    ds = ds.add_column('meta',lambda df: [{}] * len(df))
    def fn(sample):
        sample['meta']['id'] = sample['id']
        print(sample)
        return sample
    ds = ds.map(fn)
  1. Expected behaviour
sample[0]['meta']['id'] == 0
sample[1]['meta']['id'] == 1
...
sample[38]['meta']['id'] == 38
sample[39]['meta']['id'] == 39

But I get :

sample[0]['meta']['id'] == 1
sample[1]['meta']['id'] == 1
...
sample[38]['meta']['id'] == 39
sample[39]['meta']['id'] == 39
  1. Log

RAY_DEDUP_LOGS=0 python test.py

2024-01-05 09:43:24,850 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379...
2024-01-05 09:43:24,868 INFO worker.py:1642 -- Connected to Ray cluster.
2024-01-05 09:43:26,440 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->MapBatches(process_batch)->Map(fn)]
2024-01-05 09:43:26,441 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-01-05 09:43:26,441 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 0, 'meta': {'id': 0}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 1, 'meta': {'id': 1}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 20, 'meta': {'id': 20}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 21, 'meta': {'id': 21}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 22, 'meta': {'id': 22}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 23, 'meta': {'id': 23}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 24, 'meta': {'id': 24}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 25, 'meta': {'id': 25}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 26, 'meta': {'id': 26}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 27, 'meta': {'id': 27}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 28, 'meta': {'id': 28}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 29, 'meta': {'id': 29}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 30, 'meta': {'id': 30}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 31, 'meta': {'id': 31}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 32, 'meta': {'id': 32}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 33, 'meta': {'id': 33}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 34, 'meta': {'id': 34}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 35, 'meta': {'id': 35}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 36, 'meta': {'id': 36}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 37, 'meta': {'id': 37}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 38, 'meta': {'id': 38}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73956) {'id': 39, 'meta': {'id': 39}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73955) {'id': 2, 'meta': {'id': 2}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73955) {'id': 3, 'meta': {'id': 3}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73963) {'id': 12, 'meta': {'id': 12}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73963) {'id': 13, 'meta': {'id': 13}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73966) {'id': 18, 'meta': {'id': 18}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73966) {'id': 19, 'meta': {'id': 19}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73962) {'id': 10, 'meta': {'id': 10}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73962) {'id': 11, 'meta': {'id': 11}}                                                                  
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73959) {'id': 4, 'meta': {'id': 4}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73959) {'id': 5, 'meta': {'id': 5}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73961) {'id': 8, 'meta': {'id': 8}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73961) {'id': 9, 'meta': {'id': 9}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73960) {'id': 6, 'meta': {'id': 6}}                                                                    
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73960) {'id': 7, 'meta': {'id': 7}}                                                                    
test1 ds =  [{'id': 0, 'meta': {'id': 1}}, {'id': 1, 'meta': {'id': 1}}, {'id': 20, 'meta': {'id': 21}}, {'id': 21, 'meta': {'id': 21}}, {'id': 22, 'meta': {'id': 23}}, {'id': 23, 'meta': {'id': 23}}, {'id': 24, 'meta': {'id': 25}}, {'id': 25, 'meta': {'id': 25}}, {'id': 26, 'meta': {'id': 27}}, {'id': 27, 'meta': {'id': 27}}, {'id': 28, 'meta': {'id': 29}}, {'id': 29, 'meta': {'id': 29}}, {'id': 30, 'meta': {'id': 31}}, {'id': 31, 'meta': {'id': 31}}, {'id': 32, 'meta': {'id': 33}}, {'id': 33, 'meta': {'id': 33}}, {'id': 34, 'meta': {'id': 35}}, {'id': 35, 'meta': {'id': 35}}, {'id': 36, 'meta': {'id': 37}}, {'id': 37, 'meta': {'id': 37}}, {'id': 2, 'meta': {'id': 3}}, {'id': 3, 'meta': {'id': 3}}, {'id': 38, 'meta': {'id': 39}}, {'id': 39, 'meta': {'id': 39}}, {'id': 10, 'meta': {'id': 11}}, {'id': 11, 'meta': {'id': 11}}, {'id': 12, 'meta': {'id': 13}}, {'id': 13, 'meta': {'id': 13}}, {'id': 18, 'meta': {'id': 19}}, {'id': 19, 'meta': {'id': 19}}, {'id': 4, 'meta': {'id': 5}}, {'id': 5, 'meta': {'id': 5}}, {'id': 6, 'meta': {'id': 7}}, {'id': 7, 'meta': {'id': 7}}, {'id': 8, 'meta': {'id': 9}}, {'id': 9, 'meta': {'id': 9}}, {'id': 14, 'meta': {'id': 15}}, {'id': 15, 'meta': {'id': 15}}, {'id': 16, 'meta': {'id': 17}}, {'id': 17, 'meta': {'id': 17}}]
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73965) {'id': 16, 'meta': {'id': 16}}
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73965) {'id': 17, 'meta': {'id': 17}}
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73964) {'id': 14, 'meta': {'id': 14}}
(ReadRange->MapBatches(process_batch)->Map(fn) pid=73964) {'id': 15, 'meta': {'id': 15}}
2024-01-05 09:43:29,506 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->Map(fn)]
2024-01-05 09:43:29,506 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-01-05 09:43:29,506 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
test2 ds =  [{'id': 0, 'meta': {'id': 0}}, {'id': 1, 'meta': {'id': 1}}, {'id': 2, 'meta': {'id': 2}}, {'id': 3, 'meta': {'id': 3}}, {'id': 4, 'meta': {'id': 4}}, {'id': 5, 'meta': {'id': 5}}, {'id': 6, 'meta': {'id': 6}}, {'id': 7, 'meta': {'id': 7}}, {'id': 8, 'meta': {'id': 8}}, {'id': 9, 'meta': {'id': 9}}, {'id': 12, 'meta': {'id': 12}}, {'id': 13, 'meta': {'id': 13}}, {'id': 14, 'meta': {'id': 14}}, {'id': 15, 'meta': {'id': 15}}, {'id': 16, 'meta': {'id': 16}}, {'id': 17, 'meta': {'id': 17}}, {'id': 18, 'meta': {'id': 18}}, {'id': 19, 'meta': {'id': 19}}, {'id': 20, 'meta': {'id': 20}}, {'id': 21, 'meta': {'id': 21}}, {'id': 22, 'meta': {'id': 22}}, {'id': 23, 'meta': {'id': 23}}, {'id': 24, 'meta': {'id': 24}}, {'id': 25, 'meta': {'id': 25}}, {'id': 26, 'meta': {'id': 26}}, {'id': 27, 'meta': {'id': 27}}, {'id': 28, 'meta': {'id': 28}}, {'id': 29, 'meta': {'id': 29}}, {'id': 30, 'meta': {'id': 30}}, {'id': 31, 'meta': {'id': 31}}, {'id': 32, 'meta': {'id': 32}}, {'id': 33, 'meta': {'id': 33}}, {'id': 34, 'meta': {'id': 34}}, {'id': 35, 'meta': {'id': 35}}, {'id': 36, 'meta': {'id': 36}}, {'id': 37, 'meta': {'id': 37}}, {'id': 38, 'meta': {'id': 38}}, {'id': 39, 'meta': {'id': 39}}, {'id': 10, 'meta': {'id': 10}}, {'id': 11, 'meta': {'id': 11}}]

Versions / Dependencies

Package Version Editable project location


absl-py 1.3.0
accelerate 0.20.3
addict 2.4.0
aie-ipyleaflet 0.15.1
aiofiles 23.1.0
aiohttp 3.8.4
aiosignal 1.3.1
alabaster 0.7.12
albumentations 1.3.0
alembic 1.11.1
aliyun-python-sdk-core 2.13.36
aliyun-python-sdk-kms 2.16.0
altair 4.2.2
anaconda-client 1.11.0
anaconda-navigator 2.3.1
anaconda-project 0.11.1
anyio 3.6.2
appdirs 1.4.4
applaunchservices 0.3.0
appnope 0.1.2
appscript 1.1.2
APScheduler 3.10.1
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
array-record 0.4.0
arrow 1.2.2
astor 0.8.1
astroid 2.11.7
astropy 5.1
asttokens 2.4.1
astunparse 1.6.3
async-timeout 4.0.2
atomicwrites 1.4.0
attrs 23.1.0
audioread 3.0.0
Automat 20.2.0
autopage 0.5.1
autopep8 1.6.0
av 10.0.0
Babel 2.9.1
backcall 0.2.0
backports.functools-lru-cache 1.6.4
backports.tempfile 1.0
backports.weakref 1.0.post1
base58 2.1.1
bcrypt 3.2.0
beautifulsoup4 4.11.1
binaryornot 0.4.4
bitarray 2.5.1
bitsandbytes 0.38.0
bkcharts 0.2
black 22.3.0
bleach 4.1.0
blinker 1.6.2
blis 0.7.9
bokeh 2.4.3
boltons 23.0.0
boto3 1.16.49
botocore 1.19.63
Bottleneck 1.3.5
Brotli 1.0.9
brotlipy 0.7.0
cachetools 5.2.0
catalogue 2.0.8
certifi 2022.12.7
cffi 1.15.1
cfgv 3.3.1
chardet 4.0.0
charset-normalizer 3.1.0
chex 0.1.7
click 8.1.3
cliff 4.3.0
cloudpickle 2.0.0
clu 0.0.9
clyent 1.2.2
cmaes 0.9.1
cmd2 2.4.3
codecarbon 2.2.3
colorama 0.4.5
colorcet 3.0.0
coloredlogs 15.0.1
colorlog 6.7.0
commonmark 0.9.1
conda 23.3.1
conda-build 3.22.0
conda-content-trust 0.1.3
conda-pack 0.6.0
conda-package-handling 1.9.0
conda-repo-cli 1.0.20
conda-token 0.4.0
conda-verify 3.4.2
confection 0.0.3
constantly 15.1.0
contextlib2 21.6.0
contourpy 1.0.7
cookiecutter 1.7.3
courlan 0.9.3
crcmod 1.7
cryptography 38.0.1
cssselect 1.1.0
cycler 0.11.0
cymem 2.0.7
Cython 0.29.32
cytoolz 0.11.0
daal4py 2021.6.0
dask 2022.7.0
data-juicer 0.1.0
dataclasses 0.6
datasets 2.11.0
datashader 0.14.1
datashape 0.5.4
datasketch 1.5.9
dateparser 1.1.8
debugpy 1.5.1
decorator 4.4.2
defusedxml 0.7.1
descartes 1.1.0
diff-match-patch 20200713
diffusers 0.16.1
dill 0.3.4
distlib 0.3.6
distributed 2022.7.0
dlib 19.24.2
dm-tree 0.1.8
docopt 0.6.2
docstring-parser 0.15
docutils 0.18.1
easydict 1.10
editdistance 0.6.2
einops 0.6.1
embeddings 0.0.8
emoji 2.2.0
en-core-web-md 3.5.0
entrypoints 0.4
et-xmlfile 1.1.0
etils 1.3.0
evaluate 0.3.0
exceptiongroup 1.1.2
executing 2.0.1
fairscale 0.4.12
Faker 18.9.0
fastapi 0.95.1
fastcore 1.5.27
fastdownload 0.0.7
fastjsonschema 2.16.2
fastprogress 1.0.3
fasttext 0.9.2
ffmpeg 1.4
ffmpeg-python 0.2.0
ffmpy 0.3.0
filelock 3.11.0
fire 0.4.0
flake8 4.0.1
Flask 1.1.2
flatbuffers 2.0.7
flax 0.6.11
fonttools 4.39.3
frozendict 2.3.8
frozenlist 1.3.3
fsspec 2023.3.0
ftfy 6.1.1
future 0.18.2
fuzzywuzzy 0.18.0
gast 0.4.0
gdown 4.7.1
gensim 4.1.2
gin-config 0.5.0
gitdb 4.0.10
GitPython 3.1.31
glob2 0.7
gmpy2 2.1.2
google-auth 2.21.0
google-auth-oauthlib 1.0.0
google-pasta 0.2.0
googleapis-common-protos 1.59.1
gradio 3.35.2
gradio_client 0.2.7
graphviz 0.20.1
greenlet 1.1.1
grpcio 1.50.0
h11 0.14.0
h5py 3.7.0
harvesttext 0.8.1.8
HeapDict 1.0.1
hjson 3.1.0
holoviews 1.15.0
htmldate 1.4.3
httpcore 0.17.0
httpx 0.24.0
huggingface-hub 0.15.1
humanfriendly 10.0
hvplot 0.8.0
hyperlink 21.0.0
hypothesis 6.80.0
identify 2.5.5
idna 3.4
imagecodecs 2021.8.26
imagededup 0.3.2
imageio 2.9.0
imageio-ffmpeg 0.4.7
imagesize 1.4.1
imgaug 0.4.0
immutabledict 2.2.4
importlib 1.0.4
importlib-metadata 4.11.3
importlib-resources 5.12.0
incremental 21.3.0
inflate64 0.3.1
inflection 0.5.1
iniconfig 1.1.1
intake 0.6.5
internetarchive 3.5.0
intervaltree 3.1.0
ipadic 1.0.0
ipykernel 6.15.2
ipython 8.18.1
ipython-genutils 0.2.0
ipywidgets 7.6.5
isodate 0.6.1
isort 4.3.21
itemadapter 0.3.0
itemloaders 1.0.4
itsdangerous 2.0.1
jax 0.3.25
jaxlib 0.3.25
jdcal 1.4.1
jedi 0.18.1
jellyfish 0.9.0
jieba 0.42.1
Jinja2 3.1.2
jinja2-time 0.2.0
jiwer 2.2.0
jmespath 0.10.0
joblib 1.2.0
json-tricks 3.16.1
json5 0.9.6
jsonargparse 4.21.1
jsonlines 3.1.0
jsonpatch 1.32
jsonplus 0.8.0
jsonpointer 2.1
jsonschema 4.17.3
jupyter 1.0.0
jupyter_client 7.3.4
jupyter-console 6.4.3
jupyter_core 4.11.1
jupyter-server 1.18.1
jupyterlab 3.4.4
jupyterlab-pygments 0.1.2
jupyterlab-server 2.10.3
jupyterlab-widgets 1.0.0
just-testsimhash-pybind 0.0.1
jusText 3.0.0
kaleido 0.2.1
kenlm 0.0.0
keras 2.12.0
keyring 23.4.0
kiwisolver 1.4.4
kornia 0.6.8
langcodes 3.3.0
langid 1.1.6
lazy-object-proxy 1.6.0
Levenshtein 0.21.1
libarchive-c 2.9
libclang 16.0.0
librosa 0.8.0
linkify-it-py 2.0.0
livereload 2.6.3
llvmlite 0.39.1
lmdb 1.3.0
locket 1.0.0
loguru 0.5.3
lpips 0.1.4
ltp 4.2.13
ltp-core 0.1.4
ltp-extension 0.1.10
lxml 4.9.2
lz4 3.1.3
Mako 1.2.4
Markdown 3.3.4
markdown-it-py 2.2.0
MarkupSafe 2.1.2
matplotlib 3.7.1
matplotlib-inline 0.1.6
mccabe 0.6.1
mdit-py-plugins 0.3.3
mdurl 0.1.2
megatron-util 1.3.2
mesh-tensorflow 0.1.21
mistune 0.8.4
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
ml-collections 0.1.1
ml-datasets 0.2.0
ml-dtypes 0.2.0
mmcls 0.24.1
mmdet 2.25.3
mock 2.0.0
modelscope 1.9.5
moviepy 1.0.3
mpmath 1.2.1
msgpack 1.0.3
multidict 6.0.4
multipledispatch 0.6.0
multiprocess 0.70.12
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.9
mypy 1.0.1
mypy-extensions 0.4.3
navigator-updater 0.3.0
nbclassic 0.3.5
nbclient 0.5.13
nbconvert 6.4.4
nbformat 5.5.0
nest-asyncio 1.5.5
networkx 2.8.4
nh3 0.2.15
ninja 1.11.1
nlpaug 1.1.11
nltk 3.5
nodeenv 1.7.0
nose 1.3.7
notebook 6.4.12
numba 0.56.4
numexpr 2.8.3
numpy 1.23.5
numpydoc 1.4.0
nuscenes-devkit 1.1.9
oauthlib 3.2.2
olefile 0.46
onnxruntime 1.13.1
OpenCC 1.1.6
opencc-python-reimplemented 0.1.7
opencv-python 4.6.0.66
opencv-python-headless 4.6.0.66
openpyxl 3.0.10
opt-einsum 3.3.0
optax 0.1.5
optuna 2.10.0
orjson 3.8.10
oss2 2.16.0
packaging 23.2
pai-easycv 0.7.0
pandas 2.0.0
pandocfilters 1.5.0
panel 0.13.1
param 1.12.0
parsel 1.6.0
parso 0.8.3
partd 1.2.0
pathlib 1.0.1
pathspec 0.9.0
pathy 0.10.2
patsy 0.5.2
pbr 5.11.1
pdfminer 20191125
pdfminer.six 20221105
pdfplumber 0.9.0
pep8 1.7.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.5.0
pip 23.1.2
pkginfo 1.8.2
platformdirs 2.5.2
plotly 5.14.1
pluggy 1.0.0
ply 3.11
pooch 1.7.0
portalocker 2.7.0
poyo 0.5.0
pre-commit 3.2.1
preshed 3.0.8
prettytable 3.5.0
proglog 0.1.10
prometheus-client 0.14.1
promise 2.3
prompt-toolkit 3.0.41
Protego 0.1.16
protobuf 3.20.3
psutil 5.9.0
psycopg2 2.8.6
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.11.0
py-cpuinfo 9.0.0
py-data-juicer 0.1.2 /Users/mazhijian/Documents/Project_2023/P01_LLM/C02_Solutions/data-juicer
py4j 0.10.9.7
py7zr 0.20.5
pyarrow 12.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pybcj 1.0.1
pybind11 2.10.4
pyclipper 1.3.0.post4
pycocotools 2.0.6
pycodestyle 2.8.0
pycosat 0.6.3
pycparser 2.21
pycryptodome 3.15.0
pycryptodomex 3.18.0
pyct 0.4.8
pycurl 7.45.1
pydantic 1.7.4
pydeck 0.8.1b0
PyDispatcher 2.0.5
pydocstyle 6.1.1
pydub 0.25.1
pyerfa 2.0.0
pyflakes 2.4.0
pyglove 0.3.0
Pygments 2.15.1
PyHamcrest 2.0.2
PyJWT 2.6.0
pylint 2.14.5
pyls-spyder 0.4.0
pyltp 0.4.0
Pympler 1.0.1
pynvml 11.5.0
pyobjc-core 8.5
pyobjc-framework-Cocoa 8.5
pyobjc-framework-CoreServices 8.5
pyobjc-framework-FSEvents 8.5
pyodbc 4.0.34
pyOpenSSL 22.0.0
pyparsing 3.0.9
pyperclip 1.8.2
pypinyin 0.49.0
pyplumber 0.1.9
pyppmd 1.0.0
PyQt5-sip 12.11.0
pyquaternion 0.9.9
pyrsistent 0.19.3
PySocks 1.7.1
pyspark 3.4.0
pytest 7.1.2
pytest-timeout 1.4.2
pythainlp 4.0.2
python-crfsuite 0.9.9
python-dateutil 2.8.2
python-docx 0.8.11
python-Levenshtein 0.21.1
python-louvain 0.16
python-lsp-black 1.2.1
python-lsp-jsonrpc 1.0.0
python-lsp-server 1.5.0
python-multipart 0.0.6
python-pptx 0.6.21
python-slugify 8.0.1
python-snappy 0.6.0
pytorch-metric-learning 1.6.3
pytz 2023.3
pytz-deprecation-shim 0.1.0.post0
pyvi 0.1.1
pyviz-comms 2.0.2
PyWavelets 1.3.0
PyYAML 5.4.1
pyzmq 23.2.0
pyzstd 0.15.9
QDarkStyle 3.0.2
qstylizer 0.1.10
QtAwesome 1.0.3
qtconsole 5.3.2
QtPy 2.2.0
qudida 0.0.4
queuelib 1.5.0
rapidfuzz 2.13.2
ray 2.7.1
rdflib 6.3.2
readme-renderer 42.0
recommonmark 0.7.1
redis 4.5.5
regex 2022.7.9
requests 2.28.2
requests-file 1.5.1
requests-oauthlib 1.3.1
requests-toolbelt 1.0.0
resampy 0.4.2
responses 0.18.0
rfc3986 2.0.0
rich 13.3.5
rope 0.22.0
rouge 1.0.1
rouge-score 0.1.2
rsa 4.9
Rtree 0.9.7
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.6
ruamel-yaml-conda 0.15.100
s3transfer 0.3.7
sacrebleu 2.0.0
sacremoses 0.0.53
safetensors 0.4.0
schema 0.7.5
scikit-image 0.19.3
scikit-learn 1.2.2
scikit-learn-intelex 2021.20221004.121333
scipy 1.11.3
Scrapy 2.6.2
seaborn 0.11.2
selectolax 0.3.13
semantic-version 2.10.0
Send2Trash 1.8.0
sentencepiece 0.1.95
seqeval 1.2.2
seqio 0.0.16
seqio-nightly 0.0.15.dev20230702
service-identity 18.1.0
setuptools 68.0.0
Shapely 1.8.5.post1
shotdetect-scenedetect-lgss 0.0.3
simhash-py 0.4.2
simhash-pybind 0.0.2
simplejson 3.18.0
sip 6.6.2
six 1.16.0
sklearn 0.0.post1
sklearn-crfsuite 0.3.6
smart-open 5.2.1
smmap 5.0.0
sniffio 1.3.0
snowballstemmer 2.2.0
sortedcollections 2.1.0
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 2.3.1
spacy 3.5.0
spacy-legacy 3.0.12
spacy-loggers 1.0.4
spacy-pkuseg 0.0.32
Sphinx 5.0.2
sphinx-autobuild 2021.3.14
sphinx-rtd-theme 1.2.2
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jquery 4.1
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
spyder 5.3.3
spyder-kernels 2.3.3
SQLAlchemy 1.4.39
srsly 2.4.5
stack-data 0.6.3
stanza 1.7.0
starlette 0.26.1
statsmodels 0.13.2
stevedore 5.1.0
streamlit 1.25.0
subword-nmt 0.3.8
sympy 1.10.1
t5 0.9.4
tables 3.6.1
tabulate 0.8.10
TBB 0.2
tblib 1.7.0
tenacity 8.2.2
tensorboard 2.12.3
tensorboard-data-server 0.7.1
tensorboard-plugin-wit 1.8.1
tensorflow-datasets 4.9.2
tensorflow-estimator 2.12.0
tensorflow-hub 0.13.0
tensorflow-io-gcs-filesystem 0.32.0
tensorflow-metadata 1.13.1
tensorflow-text 2.12.1
tensorstore 0.1.40
termcolor 2.1.0
terminado 0.13.1
terminaltables 3.1.10
testpath 0.6.0
text-unidecode 1.3
textdistance 4.2.1
texttable 1.6.7
tf-slim 1.1.0
tfds-nightly 4.9.2.dev202307030045
thinc 8.1.10
thinc-apple-ops 0.1.3
thop 0.1.1.post2209072238
threadpoolctl 2.2.0
three-merge 0.1.1
tifffile 2021.7.2
timm 0.6.11
tinycss 0.4
tld 0.13
tldextract 3.2.0
tokenizers 0.13.3
toml 0.10.2
tomli 1.2.3
tomlkit 0.11.1
toolz 0.12.0
torch 2.1.1
torch-struct 0.5
torchmetrics 0.10.3
torchvision 0.16.1
tornado 6.1
tqdm 4.66.1
trafilatura 1.6.0
traitlets 5.1.1
traittypes 0.2.1
trankit 1.1.1
transformers 4.31.0
twine 4.0.2
Twisted 22.2.0
typer 0.7.0
types-mock 5.0.0.7
types-requests 2.31.0.1
types-setuptools 68.0.0.0
types-urllib3 1.26.25.13
typeshed-client 2.3.0
typing 3.7.4.3
typing_extensions 4.5.0
tzdata 2023.3
tzlocal 4.3
uc-micro-py 1.0.1
ujson 5.4.0
ukkonen 1.0.1
Unidecode 1.2.0
urllib3 1.26.15
uvicorn 0.21.1
validators 0.20.0
virtualenv 20.17.1
w3lib 1.21.0
Wand 0.6.11
wasabi 0.10.1
watchdog 2.1.6
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.58.0
websockets 11.0.1
Werkzeug 2.0.3
wget 3.2
whatthepatch 1.0.2
wheel 0.40.0
widgetsnbextension 3.5.2
wrapt 1.14.1
wurlitzer 3.0.2
xarray 0.20.1
xgboost 1.5.2
xlrd 2.0.1
XlsxWriter 3.0.3
xlwings 0.27.15
xtcocotools 1.12
xxhash 3.1.0
xyzservices 2022.9.0
yacs 0.1.8
yapf 0.31.0
yarl 1.8.2
zh-core-web-md 3.5.0
zhconv 1.4.3
zhon 1.1.5
zict 2.1.0
zipp 3.8.0
zope.interface 5.4.0
zstandard 0.21.0

Reproduction script

Source code in test.py

import ray
ray.init()

# The Result is Wrong.
def test1():
    ds = ray.data.range(40)
    ds = ds.add_column('meta',lambda df: [{}] * len(df))

    def fn(sample):
        sample['meta']['id'] = sample['id']
        print(sample)
        return sample
    ds = ds.map(fn)
    print('test1 ds = ', ds.take_all())

# The Result is Correct.
def test2():
    ds = ray.data.range(40)

    def fn(sample):
        if 'meta' not in sample:
            sample['meta'] = {}
        sample['meta']['id'] = sample['id']
        return sample
    ds = ds.map(fn)
    print('test2 ds = ', ds.take_all())

test1()
test2()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@zhijianma zhijianma added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 5, 2024
@anyscalesam anyscalesam added the data Ray Data-related issues label Jan 8, 2024
@scottjlee
Copy link
Contributor

This is because in this line:

ds = ds.add_column('meta',lambda df: [{}] * len(df))

the same dict object is used in for each element in the resulting list. After updating this to:

ds = ds.add_column('meta',lambda df: [{} for _ in range(len(df))])

I get the following expected result:

test1 ds =  [{'id': 0, 'meta': {'id': 0}}, {'id': 1, 'meta': {'id': 1}}, {'id': 20, 'meta': {'id': 20}}, {'id': 21, 'meta': {'id': 21}}, {'id': 2, 'meta': {'id': 2}}, {'id': 3, 'meta': {'id': 3}}, {'id': 4, 'meta': {'id': 4}}, {'id': 5, 'meta': {'id': 5}}, {'id': 6, 'meta': {'id': 6}}, {'id': 7, 'meta': {'id': 7}}, {'id': 8, 'meta': {'id': 8}}, {'id': 9, 'meta': {'id': 9}}, {'id': 10, 'meta': {'id': 10}}, {'id': 11, 'meta': {'id': 11}}, {'id': 12, 'meta': {'id': 12}}, {'id': 13, 'meta': {'id': 13}}, {'id': 14, 'meta': {'id': 14}}, {'id': 15, 'meta': {'id': 15}}, {'id': 16, 'meta': {'id': 16}}, {'id': 17, 'meta': {'id': 17}}, {'id': 18, 'meta': {'id': 18}}, {'id': 19, 'meta': {'id': 19}}, {'id': 22, 'meta': {'id': 22}}, {'id': 23, 'meta': {'id': 23}}, {'id': 24, 'meta': {'id': 24}}, {'id': 25, 'meta': {'id': 25}}, {'id': 26, 'meta': {'id': 26}}, {'id': 27, 'meta': {'id': 27}}, {'id': 28, 'meta': {'id': 28}}, {'id': 29, 'meta': {'id': 29}}, {'id': 30, 'meta': {'id': 30}}, {'id': 31, 'meta': {'id': 31}}, {'id': 32, 'meta': {'id': 32}}, {'id': 33, 'meta': {'id': 33}}, {'id': 34, 'meta': {'id': 34}}, {'id': 35, 'meta': {'id': 35}}, {'id': 36, 'meta': {'id': 36}}, {'id': 37, 'meta': {'id': 37}}, {'id': 38, 'meta': {'id': 38}}, {'id': 39, 'meta': {'id': 39}}]
2024-01-08 14:53:37,353	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=20 for stage ReadRange to satisfy parallelism at least twice the available number of CPUs (10).
2024-01-08 14:53:37,353	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange->Map(fn)]
2024-01-08 14:53:37,353	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-01-08 14:53:37,353	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
test2 ds =  [{'id': 0, 'meta': {'id': 0}}, {'id': 1, 'meta': {'id': 1}}, {'id': 2, 'meta': {'id': 2}}, {'id': 3, 'meta': {'id': 3}}, {'id': 4, 'meta': {'id': 4}}, {'id': 5, 'meta': {'id': 5}}, {'id': 6, 'meta': {'id': 6}}, {'id': 7, 'meta': {'id': 7}}, {'id': 8, 'meta': {'id': 8}}, {'id': 9, 'meta': {'id': 9}}, {'id': 10, 'meta': {'id': 10}}, {'id': 11, 'meta': {'id': 11}}, {'id': 12, 'meta': {'id': 12}}, {'id': 13, 'meta': {'id': 13}}, {'id': 14, 'meta': {'id': 14}}, {'id': 15, 'meta': {'id': 15}}, {'id': 16, 'meta': {'id': 16}}, {'id': 17, 'meta': {'id': 17}}, {'id': 18, 'meta': {'id': 18}}, {'id': 19, 'meta': {'id': 19}}, {'id': 20, 'meta': {'id': 20}}, {'id': 21, 'meta': {'id': 21}}, {'id': 22, 'meta': {'id': 22}}, {'id': 23, 'meta': {'id': 23}}, {'id': 24, 'meta': {'id': 24}}, {'id': 25, 'meta': {'id': 25}}, {'id': 26, 'meta': {'id': 26}}, {'id': 27, 'meta': {'id': 27}}, {'id': 28, 'meta': {'id': 28}}, {'id': 29, 'meta': {'id': 29}}, {'id': 30, 'meta': {'id': 30}}, {'id': 31, 'meta': {'id': 31}}, {'id': 32, 'meta': {'id': 32}}, {'id': 33, 'meta': {'id': 33}}, {'id': 34, 'meta': {'id': 34}}, {'id': 35, 'meta': {'id': 35}}, {'id': 36, 'meta': {'id': 36}}, {'id': 37, 'meta': {'id': 37}}, {'id': 38, 'meta': {'id': 38}}, {'id': 39, 'meta': {'id': 39}}]

Please feel free to re-open the issue if I missed anything.

@zhijianma
Copy link
Author

@scottjlee Thank you so much. It works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants