-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have questions about creating a pre-training model. #22
Comments
If you happen to notice, there are two flags in the Hence, one (possible but not recommended) way to do this using bash alone could be NUM_PROC=1000
For i in `seq 0 $((NUM_PROC - 1))`; do
python data_utils.py \
.... \ # other flags
--task=${i} \
--num_task=${NUM_PROC} &
done Essentially, you launch 1000 processes to process the data in parallel. For a better handling, you could wrap the code with the python "multiprocessing" module. For the second question, we never tried other sub-word models. But my guess is it should be fine. |
@zihangdai |
@zihangdai |
|
Unless the |
Good catch. |
HI We're working on a pre-training model. I have two questions about this process.
First of all, The amount of data I have is about 180 million sentences, and it takes too long to make a tfrecord. I need advice to make tfrecord.
Second, Is there no performance problem if I change the model type to another type when I create the Sentencepiece model? like bpe, char, or word.
The text was updated successfully, but these errors were encountered: