Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Llama2 pickle to Llama3 #55

Open
G4V opened this issue Aug 12, 2024 · 14 comments
Open

Convert Llama2 pickle to Llama3 #55

G4V opened this issue Aug 12, 2024 · 14 comments
Assignees

Comments

@G4V
Copy link
Contributor

G4V commented Aug 12, 2024

No description provided.

@G4V
Copy link
Contributor Author

G4V commented Aug 12, 2024

import pickle
from functools import partial

import pandas as pd
from transformers import AutoTokenizer

llama_prompt_system = "<|start_header_id|>system<|end_header_id|>\n\n{}<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
llama_prompt_no_system = "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

def format_llama_input(row):
    if row['system_prompt']:
        return llama_prompt_system.format(row['system_prompt'], row['question'])
    else:
        return llama_prompt_no_system.format(row['question'])

def _tokenize_helper(x, llama_tokenizer=None):
    if not isinstance(x, str):
        return []

    return llama_tokenizer(x)["input_ids"]

input_pkl = sys.argv[1]  #"/local/mnt/workspace/gsimpson/work_collection_old/downloaded_openorca_mlperf_dataset/open_orca_gpt4_tokenized_llama.sampled_24576.pkl"
output_pkl = sys.argv[2] #"/local/mnt/workspace/gsimpson/work_collection/downloaded_openorca_mlperf_dataset_llama3_full/open_orca_gpt4_tokenized_llama.sampled_24576.pkl"
with open(input_pkl, "rb") as f:
    df = pickle.load(f)

df["input"] = df.apply(format_llama_input, axis=1)

input_tokenizer = partial(_tokenize_helper, llama_tokenizer=tok)
output_tokenizer = partial(_tokenize_helper, llama_tokenizer=tok)
df['tok_input'] = df['input'].apply(input_tokenizer)
df['tok_output'] = df['output'].apply(output_tokenizer)
df['tok_input_length'] = df['tok_input'].apply(lambda x: len(x))
df['tok_output_length'] = df['tok_output'].apply(lambda x: len(x))

print(df["input"][0])
print(input_tokenizer(df["input"][0]))
print(df["tok_input"][0] == input_tokenizer(df["input"][0]))

with open(output_pkl, "wb") as f:
    pickle.dump(df, f)

@maria-18-git maria-18-git self-assigned this Aug 19, 2024
@maria-18-git
Copy link
Contributor

1. Download hf_tokeniser

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ time axs byquery downloaded,hf_tokeniser,model_family=llama2,variant=7b,hf_token=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
...
        "/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/bin/huggingface-cli" download "meta-llama/Llama-2-7b-chat-hf" --include "tokenizer*" --local-dir "/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser" --local-dir-use-symlinks False --token=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
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/lib/python3.8/site-packages/huggingface_hub/commands/download.py:132: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
  warnings.warn(
Fetching 3 files:   0%|                                                                                                                                                                      | 0/3 [00:00<?, ?it/s]
Downloading 'tokenizer.json' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer.json.a6e931b92caff4c79c5c56282f1e89569a0ae558.incomplete'
Downloading 'tokenizer.model' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer.model.9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347.incomplete'
Downloading 'tokenizer_config.json' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer_config.json.a0024735c8dd7afe47fe72792b2c4edaff63bd3b.incomplete'
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62k/1.62k [00:00<00:00, 363kB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer_config.json                                          | 0.00/1.84M [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.36MB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer.json████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.38MB/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 2.52MB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer.model█████████████████████████████████████████| 500k/500k [00:00<00:00, 2.53MB/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.29it/s]
/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.

['^', 'byname', 'downloaded_Llama-2-7b-chat-hf_tokeniser']

real    0m6.397s
user    0m2.993s
sys     0m0.231s

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery downloaded,hf_tokeniser,model_family=llama2 --- , get_path                        /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
total 2312
drwxr-xr-x  3 mmirkina users    4096 Aug 28 06:08 .
drwxr-xr-x 67 mmirkina users    4096 Aug 28 06:08 ..
drwxr-xr-x  3 mmirkina users    4096 Aug 28 06:08 .cache
-rw-r--r--  1 mmirkina users     969 Aug 28 06:08 data_axs.json
-rw-r--r--  1 mmirkina users    1618 Aug 28 06:08 tokenizer_config.json
-rw-r--r--  1 mmirkina users 1842767 Aug 28 06:08 tokenizer.json
-rw-r--r--  1 mmirkina users  499723 Aug 28 06:08 tokenizer.model

@maria-18-git
Copy link
Contributor

maria-18-git commented Aug 28, 2024

2. Download dataset

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ time axs byquery downloaded,dataset_name=openorca
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/mlperf_inference_git_master/language/llama2-70b/processorca.py --dataset_pq_path=/local/mnt/workspace/mmirkina/work_collection/downloa
ded_1M-GPT4-Augmented.parquet/1M-GPT4-Augmented.parquet --model_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser --seqlen_limit=1024 --export_dir=/local/mnt/workspace/mmi
rkina/work_collection/downloaded_openorca_mlperf_dataset --num_total_samples=24576
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tokenizing input
Loaded parquet and tokenized in 831.7050881385803 sec.
Unique sample origin datasets: ['flan' 't0' 'cot' 'niv']
Subset 'cot' has 69692 samples
Subset 'flan' has 371689 samples
Subset 'niv' has 25195 samples
Subset 't0' has 109271 samples
Sampling 6144 from cot
Sampling 6144 from flan
Sampling 6144 from niv

...
['^', 'byname', 'downloaded_openorca_mlperf_dataset']

real    15m36.378s
user    14m44.304s
sys     0m38.590s

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery downloaded,dataset_name=openorca , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset
total 3992124
drwxr-xr-x  2 mmirkina users       4096 Aug 28 06:34 .
drwxr-xr-x 70 mmirkina users       4096 Aug 28 06:18 ..
-rw-r--r--  1 mmirkina users       2861 Aug 28 06:34 data_axs.json
-rw-r--r--  1 mmirkina users    3708395 Aug 28 06:34 open_orca_gpt4_tokenized_llama.calibration_1000.pkl
-rw-r--r--  1 mmirkina users  163295727 Aug 28 06:33 open_orca_gpt4_tokenized_llama.cot.pkl
-rw-r--r--  1 mmirkina users 1203462167 Aug 28 06:34 open_orca_gpt4_tokenized_llama.flan.pkl
-rw-r--r--  1 mmirkina users 1996603812 Aug 28 06:33 open_orca_gpt4_tokenized_llama.full.pkl
-rw-r--r--  1 mmirkina users  109943881 Aug 28 06:34 open_orca_gpt4_tokenized_llama.niv.pkl
-rw-r--r--  1 mmirkina users   90970516 Aug 28 06:34 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
-rw-r--r--  1 mmirkina users  519905608 Aug 28 06:34 open_orca_gpt4_tokenized_llama.t0.pkl

@maria-18-git
Copy link
Contributor

maria-18-git commented Aug 28, 2024

3. Convert llama2 pickle file to llama3 pickle file

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery converted,pickle_file,llama2_to_llama3
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3/convert_pickle_llama2_to_llama3.py --input_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --output_pkl_path /local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3/open_orca_gpt4_tokenized_llama.sampled_24576.pkl
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|start_header_id|>system<|end_header_id|>

You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-bystep and justify your answer.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Given the sentence "A woman with a fairy tattoo on her back is carrying a purse with a red floral print." can we conclude that "The woman's purse has red flowers on it."?
Options:
- yes
- it is not possible to tell
- no Now, let's be accurate as possible. Some thinking first:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 15592, 18328, 430, 8779, 1274, 1505, 2038, 13, 2724, 690, 499, 3041, 499, 264, 3488, 13, 4718, 3465, 374, 311, 4320, 439, 94176, 439, 499, 649, 13, 6104, 36864, 1781, 3094, 1481, 599, 752, 323, 9541, 701, 4320, 13, 128009, 198, 128006, 882, 128007, 271, 22818, 279, 11914, 330, 32, 5333, 449, 264, 45586, 32894, 389, 1077, 1203, 374, 15691, 264, 53101, 449, 264, 2579, 46119, 1194, 1210, 649, 584, 32194, 430, 330, 791, 5333, 596, 53101, 706, 2579, 19837, 389, 433, 1210, 5380, 3883, 512, 12, 10035, 198, 12, 433, 374, 539, 3284, 311, 3371, 198, 12, 912, 4800, 11, 1095, 596, 387, 13687, 439, 3284, 13, 4427, 7422, 1176, 25, 128009, 128006, 78191, 128007]
True
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'converted_pickle_file_llama2_to_llama3']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery converted,pickle_file,llama2_to_llama3 , get_path
/local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3
total 87156
drwxr-xr-x  2 mmirkina users     4096 Aug 28 10:15 .
drwxr-xr-x 74 mmirkina users     4096 Aug 28 10:15 ..
-rw-r--r--  1 mmirkina users      334 Aug 28 10:15 data_axs.json
-rw-r--r--  1 mmirkina users 89234160 Aug 28 10:15 open_orca_gpt4_tokenized_llama.sampled_24576.pkl

@maria-18-git
Copy link
Contributor

Renamed an entry:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev$ axs byname convert_pickle_file_llama2_to_llama3
['^', 'byname', 'convert_pickle_file_llama2_to_llama3']

@maria-18-git
Copy link
Contributor

maria-18-git commented Sep 3, 2024

Preprocess converted pickle file from llama2

  • Download llama2 dataset.
  • Convert pickle file for llama2 to llama3.
  • Run preprocess for llama3(using converted pickle file)

1. Download llama2 dataset:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/dataset_openorca_mlperf_recipe$ axs byquery downloaded,dataset_name=openorca,model_family=llama2,variant=7b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/mlperf_inference_git_master/language/llama2-70b/processorca.py --dataset_pq_path=/local/mnt/workspace/mmirkina/work_collection/downloaded_1M-GPT4-Augmented.parquet/1M-GPT4-Augmented.parquet --model_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser --seqlen_limit=1024 --export_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b --num_total_samples=24576
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tokenizing input
Loaded parquet and tokenized in 834.4697501659393 sec.
Unique sample origin datasets: ['flan' 't0' 'cot' 'niv']
Subset 'cot' has 69692 samples
Subset 'flan' has 371689 samples
Subset 'niv' has 25195 samples
Subset 't0' has 109271 samples
Sampling 6144 from cot
Sampling 6144 from flan
Sampling 6144 from niv
Sampling 6144 from t0
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.

['^', 'byname', 'downloaded_openorca_mlperf_dataset_llama2_7b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama2,variant=7b , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b
total 3992132
drwxr-xr-x  2 mmirkina users       4096 Aug 30 09:24 .
drwxr-xr-x 77 mmirkina users       4096 Aug 30 09:48 ..
-rw-r--r--  1 mmirkina users        378 Aug 30 08:20 data_axs.json
-rw-r--r--  1 mmirkina users    3708395 Aug 30 08:35 open_orca_gpt4_tokenized_llama.calibration_1000.pkl
-rw-r--r--  1 mmirkina users  163295727 Aug 30 08:35 open_orca_gpt4_tokenized_llama.cot.pkl
-rw-r--r--  1 mmirkina users 1203462167 Aug 30 08:35 open_orca_gpt4_tokenized_llama.flan.pkl
-rw-r--r--  1 mmirkina users 1996603812 Aug 30 08:35 open_orca_gpt4_tokenized_llama.full.pkl
-rw-r--r--  1 mmirkina users  109943881 Aug 30 08:35 open_orca_gpt4_tokenized_llama.niv.pkl
-rw-r--r--  1 mmirkina users   90970516 Aug 30 08:35 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
-rw-r--r--  1 mmirkina users  519905608 Aug 30 08:35 open_orca_gpt4_tokenized_llama.t0.pkl

@maria-18-git
Copy link
Contributor

maria-18-git commented Sep 3, 2024

2. Convert pickle file from llama2 to llama3:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama3,variant=8b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convert_pickle_file_llama2_to_llama3/convert_pickle_llama2_to_llama3.py --input_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --output_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|start_header_id|>system<|end_header_id|>

You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-bystep and justify your answer.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Given the sentence "A woman with a fairy tattoo on her back is carrying a purse with a red floral print." can we conclude that "The woman's purse has red flowers on it."?
Options:
- yes
- it is not possible to tell
- no Now, let's be accurate as possible. Some thinking first:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 15592, 18328, 430, 8779, 1274, 1505, 2038, 13, 2724, 690, 499, 3041, 499, 264, 3488, 13, 4718, 3465, 374, 311, 4320, 439, 94176, 439, 499, 649, 13, 6104, 36864, 1781, 3094, 1481, 599, 752, 323, 9541, 701, 4320, 13, 128009, 198, 128006, 882, 128007, 271, 22818, 279, 11914, 330, 32, 5333, 449, 264, 45586, 32894, 389, 1077, 1203, 374, 15691, 264, 53101, 449, 264, 2579, 46119, 1194, 1210, 649, 584, 32194, 430, 330, 791, 5333, 596, 53101, 706, 2579, 19837, 389, 433, 1210, 5380, 3883, 512, 12, 10035, 198, 12, 433, 374, 539, 3284, 311, 3371, 198, 12, 912, 4800, 11, 1095, 596, 387, 13687, 439, 3284, 13, 4427, 7422, 1176, 25, 128009, 128006, 78191, 128007]
True
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.

['^', 'byname', 'downloaded_openorca_dataset_llama3_8b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b
total 87160
drwxr-xr-x  2 mmirkina users     4096 Aug 30 08:58 .
drwxr-xr-x 77 mmirkina users     4096 Aug 30 09:48 ..
-rw-r--r--  1 mmirkina users      402 Aug 30 08:57 data_axs.json
-rw-r--r--  1 mmirkina users 89234160 Aug 30 08:58 open_orca_gpt4_tokenized_llama.sampled_24576.pkl

@maria-18-git
Copy link
Contributor

3. Preprocess llama2

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama2,variant=7b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama2_7b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama2_7b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama2_7b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 09:43 .
drwxr-xr-x 76 mmirkina users      4096 Aug 30 09:43 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 09:43 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 09:43 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 masked_tokens.bin

@maria-18-git
Copy link
Contributor

4. Preprocess llama3

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama3_8b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 09:48 .
drwxr-xr-x 77 mmirkina users      4096 Aug 30 09:48 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 09:48 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 09:48 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 masked_tokens.bin

@maria-18-git
Copy link
Contributor

Should add downloading tokenizers for llama3 and use it for preprocess.

  1. Download tokenizer
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/llm_hf_weights_recipe$ axs byquery downloaded,hf_tokeniser,model_family=llama3,variant=8b,hf_token=hf_VFQvAeybBofkPWsLDQPTgSix
crcuMZpYAb
...
        "/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/bin/huggingface-cli" download "meta-llama/Meta-Llama-3-8B" --include "tokenizer*" --local-dir "/local/mnt/work
space/mmirkina/work_collection/downloaded_Meta-Llama-3-8B_tokeniser" --local-dir-use-symlinks False --token=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
...
['^', 'byname', 'downloaded_Meta-Llama-3-8B_tokeniser']

Then run preprocess again:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Meta-Llama-3-8B_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama3_8b']

@maria-18-git
Copy link
Contributor

Results:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 12:41 .
drwxr-xr-x 79 mmirkina users      4096 Aug 30 12:41 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 12:41 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 12:41 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 masked_tokens.bin

md5sum:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b$ md5sum *
ad85eb788057c30b577515c6a0ea9dde  attention_mask.bin
8d75c4e008272ca80b86921c3ce74c13  data_axs.json
ab2342a9d49ab1f262dda8a631c89ed3  input_ids_padded.bin
185870a3dcf544c5e8019b9253799bc5  input_lengths.bin
8f9a38edc2e0b024eb4240f7dd93a0cb  masked_tokens.bin

Gavin's results - md5sum:

gsimpson@aus121-r760-0:~/work_collection/preprocessed_openorca_dataset_full_2024.08.09_07h56m49s$ md5sum *
ad85eb788057c30b577515c6a0ea9dde  attention_mask.bin
bc8a9916b4b544ed2bc4034ca10fede3  input_ids_padded.bin
185870a3dcf544c5e8019b9253799bc5  input_lengths.bin
8f9a38edc2e0b024eb4240f7dd93a0cb  masked_tokens.bin

@maria-18-git
Copy link
Contributor

maria-18-git commented Sep 3, 2024

only input_ids_padded.bin is different.

@maria-18-git
Copy link
Contributor

Converted pickle file:

gsimpson@aus121-r760-0:~/datasets/llama3/openorca$ md5sum open_orca_gpt4_tokenized_llama.sampled_24576.pkl
526a7f803d9600d90b766f42b8a4ca75  open_orca_gpt4_tokenized_llama.sampled_24576.pkl
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b$ md5sum open_orca_gpt4_tokenized_llama.sampled_24576.pkl
9abc215c84747ff248d5c3e5cec4442f  open_orca_gpt4_tokenized_llama.sampled_24576.pkl

@maria-18-git
Copy link
Contributor

maria-18-git commented Sep 3, 2024

Commits:
- axs2mlperf:
Added model_family, variant for llama2 in dataset_openorca_mlperf_recipe
Added tokeniser rule for llama3
Added model_family, variant to openorca_preprocessor
- axs2qaic-dev:
Added model_family and variant to convert_pickle_file_llama2_to_llama3

For Debugging:
run short accuracy experiment as in reference code using converted pickle file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants