Skip to content

Commit

Permalink
fix logger not init when call preprocess_data.py (#744)
Browse files Browse the repository at this point in the history
  • Loading branch information
GuoxiaWang authored Sep 16, 2022
1 parent 85870f8 commit 12fbfd2
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 9 deletions.
12 changes: 6 additions & 6 deletions ppfleetx/data/data_tools/gpt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ cd PaddleFleetX
### 原始数据
首先下载样例数据:
```
mkdir -p data/wikitext_103_en
wget -O data/wikitext_103_en/wikitext-103-en.txt http://fleet.bj.bcebos.com/datasets/gpt/wikitext-103-en.txt
mkdir -p dataset/wikitext_103_en
wget -O dataset/wikitext_103_en/wikitext-103-en.txt http://fleet.bj.bcebos.com/datasets/gpt/wikitext-103-en.txt
```
### 原始数据转换 jsonl 格式
使用`raw_trans_to_json.py`转化为json串格式,下面是脚本的使用说明
Expand Down Expand Up @@ -82,7 +82,7 @@ optional arguments:
```
根据说明,我们使用下面简单命令,可以得到`wikitext_103_en.jsonl`文件。此处,我们对所有doc进行了shuffle。
```shell
python ppfleetx/data/data_tools/gpt/raw_trans_to_json.py --input_path ./data/wikitext_103_en --output_path ./data/wikitext_103_en/wikitext_103_en
python ppfleetx/data/data_tools/gpt/raw_trans_to_json.py --input_path ./dataset/wikitext_103_en --output_path ./dataset/wikitext_103_en/wikitext_103_en

# output of terminal
# Time to startup: 0.0075109004974365234
Expand All @@ -93,7 +93,7 @@ python ppfleetx/data/data_tools/gpt/raw_trans_to_json.py --input_path ./data/wi
# File shuffled!!!

# 查看数据。因为对数据有 shuffle,下面的内容可能会不一样。
tail -1 ./data/wikitext_103_en/wikitext_103_en.jsonl
tail -1 ./dataset/wikitext_103_en/wikitext_103_en.jsonl
{"text": "The album was released in June 1973 . Although it received good reviews , it did not sell well , except in Austin , where it sold more copies than earlier records by Nelson did nationwide . The recording led Nelson to a new style ; he later stated regarding his new musical identity that Shotgun Willie had \" cleared his throat . \" It became his breakthrough record , and one of the first of the outlaw movement , music created without the influence of the conservative Nashville Sound . The album — the first to feature Nelson with long hair and a beard on the cover — gained him the interest of younger audiences . It peaked at number 41 on Billboard 's album chart and the songs \" Shotgun Willie \" and \" Stay All Night ( Stay A Little Longer ) \" peaked at number 60 and 22 on Billboard Hot 100 respectively .\nRolling Stone wrote : \" With this flawless album , Willie Nelson finally demonstrates why he has for so long been regarded as a Country & Western singer @-@ songwriter 's singer @-@ songwriter ... At the age of 39 , Nelson finally seems destined for the stardom he deserves \" . Robert Christgau wrote : \" This attempt to turn Nelson into a star runs into trouble when it induces him to outshout Memphis horns or Western swing . \"\nBillboard wrote : \" This is Willie Nelson at his narrative best . He writes and sings with the love and the hurt and the down @-@ to @-@ earth things he feels , and he has a few peers . \" Texas Monthly praised Nelson and Wexler regarding the change in musical style : \" They 've switched his arrangements from Ray Price to Ray Charles — the result : a revitalized music . He 's the same old Willie , but veteran producer Jerry Wexler finally captured on wax the energy Nelson projects in person \" . School Library Journal wrote : \" Willie Nelson differs ( from ) rock artists framing their music with a country & western facade — in that he appears a honky @-@ tonk stardust cowboy to the core . This album abounds in unabashed sentimentalism , nasal singing , lyrics preoccupied with booze , religion , and love gone bad , and stereotyped Nashville instrumentation ( twangy steel guitars , fiddles , and a clean rhythm section characterized by the minimal use of bass drum and cymbals , both of which gain heavy mileage with rock performers ) .\nStephen Thomas Erlewine wrote in his review for Allmusic : \" Willie Nelson offered his finest record to date for his debut – possibly his finest album ever . Shotgun Willie encapsulates Willie 's world view and music , finding him at a peak as a composer , interpreter , and performer . This is laid @-@ back , deceptively complex music , equal parts country , rock attitude , jazz musicianship , and troubadour storytelling \" .\n"}
```

Expand Down Expand Up @@ -156,9 +156,9 @@ python ppfleetx/data/data_tools/gpt/preprocess_data.py \
--model_name gpt2 \
--tokenizer_name GPTTokenizer \
--data_format JSON \
--input_path ./data/wikitext_103_en/wikitext_103_en.jsonl \
--input_path ./dataset/wikitext_103_en/wikitext_103_en.jsonl \
--append_eos \
--output_prefix ./data/wikitext_103_en/wikitext_103_en \
--output_prefix ./dataset/wikitext_103_en/wikitext_103_en \
--workers 40 \
--log_interval 1000
Expand Down
9 changes: 8 additions & 1 deletion ppfleetx/data/data_tools/gpt/preprocess_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,14 @@
import numpy as np
from tqdm import tqdm

import ppfleetx.data.tokenizers as tfs
try:
from ppfleetx.data import tokenizers as tfs
except ImportError:
__dir__ = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.abspath(os.path.join(__dir__, '../../../../')))
from ppfleetx.data import tokenizers as tfs
from ppfleetx.utils.logger import init_logger
init_logger()

try:
import nltk
Expand Down
2 changes: 1 addition & 1 deletion ppfleetx/utils/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
import time
import requests
import shutil
from . import logger
from ppfleetx.utils import logger
from tqdm import tqdm
import paddle

Expand Down
2 changes: 1 addition & 1 deletion ppfleetx/utils/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def init_logger(name='ppfleetx', log_file=None, log_level=logging.INFO):
"""
global _logger

# solve mutiple init issue when using paddleclas.py and engin.engin
# solve mutiple init issue
init_flag = False
if _logger is None:
_logger = logging.getLogger(name)
Expand Down

0 comments on commit 12fbfd2

Please sign in to comment.