Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge #11

Merged
merged 2 commits into from
Feb 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,22 @@

FROM python:3.8.18

# prepare the java env
WORKDIR /opt
# download jdk
RUN wget https://aka.ms/download-jdk/microsoft-jdk-17.0.9-linux-x64.tar.gz -O jdk.tar.gz && \
tar -xzf jdk.tar.gz && \
rm -rf jdk.tar.gz && \
mv jdk-17.0.9+8 jdk

# set the environment variable
ENV JAVA_HOME=/opt/jdk

WORKDIR /data-juicer

# install requirements first to better reuse installed library cache
COPY environments/ environments/
RUN cat environments/* | xargs pip install
RUN cat environments/* | xargs pip install --default-timeout 1000

# install data-juicer then
COPY . .
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ pip install py-data-juicer
latest `data-juicer` with provided [Dockerfile](Dockerfile):

```shell
docker build -t data-juicer:<version_tag> .
docker build -t datajuicer/data-juicer:<version_tag> .
```

### Installation check
Expand Down Expand Up @@ -276,7 +276,7 @@ docker run --rm \ # remove container after the processing
--name dj \ # name of the container
-v <host_data_path>:<image_data_path> \ # mount data or config directory into the container
-v ~/.cache/:/root/.cache/ \ # mount the cache directory into the container to reuse caches and models (recommended)
data-juicer:<version_tag> \ # image to run
datajuicer/data-juicer:<version_tag> \ # image to run
dj-process --config /path/to/config.yaml # similar data processing commands
```

Expand All @@ -289,7 +289,7 @@ docker run -dit \ # run the container in the background
--name dj \
-v <host_data_path>:<image_data_path> \
-v ~/.cache/:/root/.cache/ \
data-juicer:latest /bin/bash
datajuicer/data-juicer:latest /bin/bash

# enter into this container and then you can use data-juicer in editable mode
docker exec -it <container_id> bash
Expand Down
6 changes: 3 additions & 3 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ pip install py-data-juicer
- 或者运行如下命令用我们提供的 [Dockerfile](Dockerfile) 来构建包括最新版本的 `data-juicer` 的 docker 镜像:

```shell
docker build -t data-juicer:<version_tag> .
docker build -t datajuicer/data-juicer:<version_tag> .
```

### 安装校验
Expand Down Expand Up @@ -254,7 +254,7 @@ docker run --rm \ # 在处理结束后将容器移除
--name dj \ # 容器名称
-v <host_data_path>:<image_data_path> \ # 将本地的数据或者配置目录挂载到容器中
-v ~/.cache/:/root/.cache/ \ # 将 cache 目录挂载到容器以复用 cache 和模型资源(推荐)
data-juicer:<version_tag> \ # 运行的镜像
datajuicer/data-juicer:<version_tag> \ # 运行的镜像
dj-process --config /path/to/config.yaml # 类似的数据处理命令
```

Expand All @@ -267,7 +267,7 @@ docker run -dit \ # 在后台启动容器
--name dj \
-v <host_data_path>:<image_data_path> \
-v ~/.cache/:/root/.cache/ \
data-juicer:latest /bin/bash
datajuicer/data-juicer:latest /bin/bash

# 进入这个容器,然后您可以在编辑模式下使用 data-juicer
docker exec -it <container_id> bash
Expand Down
4 changes: 2 additions & 2 deletions data_juicer/ops/common/helper_func.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def strip(document, strip_characters):
emojis).

:param document: document to be processed
:param strip_characters: characters uesd for stripping document
:param strip_characters: characters used for stripping document
:return: stripped document
"""
if not document:
Expand Down Expand Up @@ -76,7 +76,7 @@ def split_on_newline_tab_whitespace(document):

First split on "\\\\n", then on "\\\\t", then on " ".
:param document: document to be splited
:return: setence list obtained after splitting document
:return: sentence list obtained after splitting document
"""
sentences = document.split('\n')
sentences = [sentence.split('\t') for sentence in sentences]
Expand Down
4 changes: 2 additions & 2 deletions data_juicer/ops/filter/image_size_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

@OPERATORS.register_module('image_size_filter')
class ImageSizeFilter(Filter):
"""Keep data samples whose image size (in bytes/kb/MB/...) within a
"""Keep data samples whose image size (in Bytes/KB/MB/...) within a
specific range.
"""

Expand All @@ -24,7 +24,7 @@ def __init__(self,
:param min_size: The min image size to keep samples. set to be "0" by
default for no size constraint
:param max_size: The max image size to keep samples. set to be
"1Tb" by default, an approximate for un-limited case
"1TB" by default, an approximate for un-limited case
:param any_or_all: keep this sample with 'any' or 'all' strategy of
all images. 'any': keep this sample if any images meet the
condition. 'all': keep this sample only if all images meet the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ def __init__(self,
"""
Initialization method.

:param keep_alphabet: whether to keep alpabet
:param keep_alphabet: whether to keep alphabet
:param keep_number: whether to keep number
:param keep_punc: whether to keep punctuation
:param args: extra args
Expand Down