Convert Llama2 pickle to Llama3 #55

G4V · 2024-08-12T12:23:43Z

No description provided.

G4V · 2024-08-12T12:30:08Z

import pickle
from functools import partial

import pandas as pd
from transformers import AutoTokenizer

llama_prompt_system = "<|start_header_id|>system<|end_header_id|>\n\n{}<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
llama_prompt_no_system = "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

def format_llama_input(row):
    if row['system_prompt']:
        return llama_prompt_system.format(row['system_prompt'], row['question'])
    else:
        return llama_prompt_no_system.format(row['question'])

def _tokenize_helper(x, llama_tokenizer=None):
    if not isinstance(x, str):
        return []

    return llama_tokenizer(x)["input_ids"]

input_pkl = sys.argv[1]  #"/local/mnt/workspace/gsimpson/work_collection_old/downloaded_openorca_mlperf_dataset/open_orca_gpt4_tokenized_llama.sampled_24576.pkl"
output_pkl = sys.argv[2] #"/local/mnt/workspace/gsimpson/work_collection/downloaded_openorca_mlperf_dataset_llama3_full/open_orca_gpt4_tokenized_llama.sampled_24576.pkl"
with open(input_pkl, "rb") as f:
    df = pickle.load(f)

df["input"] = df.apply(format_llama_input, axis=1)

input_tokenizer = partial(_tokenize_helper, llama_tokenizer=tok)
output_tokenizer = partial(_tokenize_helper, llama_tokenizer=tok)
df['tok_input'] = df['input'].apply(input_tokenizer)
df['tok_output'] = df['output'].apply(output_tokenizer)
df['tok_input_length'] = df['tok_input'].apply(lambda x: len(x))
df['tok_output_length'] = df['tok_output'].apply(lambda x: len(x))

print(df["input"][0])
print(input_tokenizer(df["input"][0]))
print(df["tok_input"][0] == input_tokenizer(df["input"][0]))

with open(output_pkl, "wb") as f:
    pickle.dump(df, f)

maria-18-git · 2024-08-28T11:20:54Z

1. Download `hf_tokeniser`

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ time axs byquery downloaded,hf_tokeniser,model_family=llama2,variant=7b,hf_token=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
...
        "/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/bin/huggingface-cli" download "meta-llama/Llama-2-7b-chat-hf" --include "tokenizer*" --local-dir "/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser" --local-dir-use-symlinks False --token=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
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/lib/python3.8/site-packages/huggingface_hub/commands/download.py:132: FutureWarning: Ignoring --local-dir-use-symlinks. Downloading to a local directory does not use symlinks anymore.
  warnings.warn(
Fetching 3 files:   0%|                                                                                                                                                                      | 0/3 [00:00<?, ?it/s]
Downloading 'tokenizer.json' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer.json.a6e931b92caff4c79c5c56282f1e89569a0ae558.incomplete'
Downloading 'tokenizer.model' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer.model.9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347.incomplete'
Downloading 'tokenizer_config.json' to '/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/.cache/huggingface/download/tokenizer_config.json.a0024735c8dd7afe47fe72792b2c4edaff63bd3b.incomplete'
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62k/1.62k [00:00<00:00, 363kB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer_config.json                                          | 0.00/1.84M [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.36MB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer.json████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.38MB/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 2.52MB/s]
Download complete. Moving file to /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser/tokenizer.model█████████████████████████████████████████| 500k/500k [00:00<00:00, 2.53MB/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.29it/s]
/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.

['^', 'byname', 'downloaded_Llama-2-7b-chat-hf_tokeniser']

real    0m6.397s
user    0m2.993s
sys     0m0.231s

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery downloaded,hf_tokeniser,model_family=llama2 --- , get_path                        /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser
total 2312
drwxr-xr-x  3 mmirkina users    4096 Aug 28 06:08 .
drwxr-xr-x 67 mmirkina users    4096 Aug 28 06:08 ..
drwxr-xr-x  3 mmirkina users    4096 Aug 28 06:08 .cache
-rw-r--r--  1 mmirkina users     969 Aug 28 06:08 data_axs.json
-rw-r--r--  1 mmirkina users    1618 Aug 28 06:08 tokenizer_config.json
-rw-r--r--  1 mmirkina users 1842767 Aug 28 06:08 tokenizer.json
-rw-r--r--  1 mmirkina users  499723 Aug 28 06:08 tokenizer.model

maria-18-git · 2024-08-28T12:51:52Z

2. Download dataset

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ time axs byquery downloaded,dataset_name=openorca
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/mlperf_inference_git_master/language/llama2-70b/processorca.py --dataset_pq_path=/local/mnt/workspace/mmirkina/work_collection/downloa
ded_1M-GPT4-Augmented.parquet/1M-GPT4-Augmented.parquet --model_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser --seqlen_limit=1024 --export_dir=/local/mnt/workspace/mmi
rkina/work_collection/downloaded_openorca_mlperf_dataset --num_total_samples=24576
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tokenizing input
Loaded parquet and tokenized in 831.7050881385803 sec.
Unique sample origin datasets: ['flan' 't0' 'cot' 'niv']
Subset 'cot' has 69692 samples
Subset 'flan' has 371689 samples
Subset 'niv' has 25195 samples
Subset 't0' has 109271 samples
Sampling 6144 from cot
Sampling 6144 from flan
Sampling 6144 from niv

...
['^', 'byname', 'downloaded_openorca_mlperf_dataset']

real    15m36.378s
user    14m44.304s
sys     0m38.590s

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery downloaded,dataset_name=openorca , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset
total 3992124
drwxr-xr-x  2 mmirkina users       4096 Aug 28 06:34 .
drwxr-xr-x 70 mmirkina users       4096 Aug 28 06:18 ..
-rw-r--r--  1 mmirkina users       2861 Aug 28 06:34 data_axs.json
-rw-r--r--  1 mmirkina users    3708395 Aug 28 06:34 open_orca_gpt4_tokenized_llama.calibration_1000.pkl
-rw-r--r--  1 mmirkina users  163295727 Aug 28 06:33 open_orca_gpt4_tokenized_llama.cot.pkl
-rw-r--r--  1 mmirkina users 1203462167 Aug 28 06:34 open_orca_gpt4_tokenized_llama.flan.pkl
-rw-r--r--  1 mmirkina users 1996603812 Aug 28 06:33 open_orca_gpt4_tokenized_llama.full.pkl
-rw-r--r--  1 mmirkina users  109943881 Aug 28 06:34 open_orca_gpt4_tokenized_llama.niv.pkl
-rw-r--r--  1 mmirkina users   90970516 Aug 28 06:34 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
-rw-r--r--  1 mmirkina users  519905608 Aug 28 06:34 open_orca_gpt4_tokenized_llama.t0.pkl

maria-18-git · 2024-08-28T15:05:27Z

3. Convert llama2 pickle file to llama3 pickle file

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery converted,pickle_file,llama2_to_llama3
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3/convert_pickle_llama2_to_llama3.py --input_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --output_pkl_path /local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3/open_orca_gpt4_tokenized_llama.sampled_24576.pkl
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|start_header_id|>system<|end_header_id|>

You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-bystep and justify your answer.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Given the sentence "A woman with a fairy tattoo on her back is carrying a purse with a red floral print." can we conclude that "The woman's purse has red flowers on it."?
Options:
- yes
- it is not possible to tell
- no Now, let's be accurate as possible. Some thinking first:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 15592, 18328, 430, 8779, 1274, 1505, 2038, 13, 2724, 690, 499, 3041, 499, 264, 3488, 13, 4718, 3465, 374, 311, 4320, 439, 94176, 439, 499, 649, 13, 6104, 36864, 1781, 3094, 1481, 599, 752, 323, 9541, 701, 4320, 13, 128009, 198, 128006, 882, 128007, 271, 22818, 279, 11914, 330, 32, 5333, 449, 264, 45586, 32894, 389, 1077, 1203, 374, 15691, 264, 53101, 449, 264, 2579, 46119, 1194, 1210, 649, 584, 32194, 430, 330, 791, 5333, 596, 53101, 706, 2579, 19837, 389, 433, 1210, 5380, 3883, 512, 12, 10035, 198, 12, 433, 374, 539, 3284, 311, 3371, 198, 12, 912, 4800, 11, 1095, 596, 387, 13687, 439, 3284, 13, 4427, 7422, 1176, 25, 128009, 128006, 78191, 128007]
True
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'converted_pickle_file_llama2_to_llama3']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ axs byquery converted,pickle_file,llama2_to_llama3 , get_path
/local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convertor_pickle_llama2_to_llama3$ ls -la /local/mnt/workspace/mmirkina/work_collection/converted_pickle_file_llama2_to_llama3
total 87156
drwxr-xr-x  2 mmirkina users     4096 Aug 28 10:15 .
drwxr-xr-x 74 mmirkina users     4096 Aug 28 10:15 ..
-rw-r--r--  1 mmirkina users      334 Aug 28 10:15 data_axs.json
-rw-r--r--  1 mmirkina users 89234160 Aug 28 10:15 open_orca_gpt4_tokenized_llama.sampled_24576.pkl

maria-18-git · 2024-08-28T15:25:27Z

Renamed an entry:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev$ axs byname convert_pickle_file_llama2_to_llama3
['^', 'byname', 'convert_pickle_file_llama2_to_llama3']

maria-18-git · 2024-09-03T14:24:36Z

Preprocess converted pickle file from llama2

Download llama2 dataset.
Convert pickle file for llama2 to llama3.
Run preprocess for llama3(using converted pickle file)

1. Download llama2 dataset:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/dataset_openorca_mlperf_recipe$ axs byquery downloaded,dataset_name=openorca,model_family=llama2,variant=7b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/mlperf_inference_git_master/language/llama2-70b/processorca.py --dataset_pq_path=/local/mnt/workspace/mmirkina/work_collection/downloaded_1M-GPT4-Augmented.parquet/1M-GPT4-Augmented.parquet --model_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser --seqlen_limit=1024 --export_dir=/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b --num_total_samples=24576
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tokenizing input
Loaded parquet and tokenized in 834.4697501659393 sec.
Unique sample origin datasets: ['flan' 't0' 'cot' 'niv']
Subset 'cot' has 69692 samples
Subset 'flan' has 371689 samples
Subset 'niv' has 25195 samples
Subset 't0' has 109271 samples
Sampling 6144 from cot
Sampling 6144 from flan
Sampling 6144 from niv
Sampling 6144 from t0
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.

['^', 'byname', 'downloaded_openorca_mlperf_dataset_llama2_7b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama2,variant=7b , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b
total 3992132
drwxr-xr-x  2 mmirkina users       4096 Aug 30 09:24 .
drwxr-xr-x 77 mmirkina users       4096 Aug 30 09:48 ..
-rw-r--r--  1 mmirkina users        378 Aug 30 08:20 data_axs.json
-rw-r--r--  1 mmirkina users    3708395 Aug 30 08:35 open_orca_gpt4_tokenized_llama.calibration_1000.pkl
-rw-r--r--  1 mmirkina users  163295727 Aug 30 08:35 open_orca_gpt4_tokenized_llama.cot.pkl
-rw-r--r--  1 mmirkina users 1203462167 Aug 30 08:35 open_orca_gpt4_tokenized_llama.flan.pkl
-rw-r--r--  1 mmirkina users 1996603812 Aug 30 08:35 open_orca_gpt4_tokenized_llama.full.pkl
-rw-r--r--  1 mmirkina users  109943881 Aug 30 08:35 open_orca_gpt4_tokenized_llama.niv.pkl
-rw-r--r--  1 mmirkina users   90970516 Aug 30 08:35 open_orca_gpt4_tokenized_llama.sampled_24576.pkl
-rw-r--r--  1 mmirkina users  519905608 Aug 30 08:35 open_orca_gpt4_tokenized_llama.t0.pkl

maria-18-git · 2024-09-03T14:25:13Z

2. Convert pickle file from llama2 to llama3:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama3,variant=8b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2qaic-dev/convert_pickle_file_llama2_to_llama3/convert_pickle_llama2_to_llama3.py --input_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl --output_pkl_path /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|start_header_id|>system<|end_header_id|>

You are an AI assistant that helps people find information. User will you give you a question. Your task is to answer as faithfully as you can. While answering think step-bystep and justify your answer.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Given the sentence "A woman with a fairy tattoo on her back is carrying a purse with a red floral print." can we conclude that "The woman's purse has red flowers on it."?
Options:
- yes
- it is not possible to tell
- no Now, let's be accurate as possible. Some thinking first:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 15592, 18328, 430, 8779, 1274, 1505, 2038, 13, 2724, 690, 499, 3041, 499, 264, 3488, 13, 4718, 3465, 374, 311, 4320, 439, 94176, 439, 499, 649, 13, 6104, 36864, 1781, 3094, 1481, 599, 752, 323, 9541, 701, 4320, 13, 128009, 198, 128006, 882, 128007, 271, 22818, 279, 11914, 330, 32, 5333, 449, 264, 45586, 32894, 389, 1077, 1203, 374, 15691, 264, 53101, 449, 264, 2579, 46119, 1194, 1210, 649, 584, 32194, 430, 330, 791, 5333, 596, 53101, 706, 2579, 19837, 389, 433, 1210, 5380, 3883, 512, 12, 10035, 198, 12, 433, 374, 539, 3284, 311, 3371, 198, 12, 912, 4800, 11, 1095, 596, 387, 13687, 439, 3284, 13, 4427, 7422, 1176, 25, 128009, 128006, 78191, 128007]
True
INFO:root:Matched Rule #1/2 produced an entry, which matches the original query.

['^', 'byname', 'downloaded_openorca_dataset_llama3_8b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery downloaded,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b
total 87160
drwxr-xr-x  2 mmirkina users     4096 Aug 30 08:58 .
drwxr-xr-x 77 mmirkina users     4096 Aug 30 09:48 ..
-rw-r--r--  1 mmirkina users      402 Aug 30 08:57 data_axs.json
-rw-r--r--  1 mmirkina users 89234160 Aug 30 08:58 open_orca_gpt4_tokenized_llama.sampled_24576.pkl

maria-18-git · 2024-09-03T14:26:04Z

3. Preprocess llama2

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama2,variant=7b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_mlperf_dataset_llama2_7b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama2_7b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama2_7b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama2_7b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 09:43 .
drwxr-xr-x 76 mmirkina users      4096 Aug 30 09:43 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 09:43 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 09:43 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:43 masked_tokens.bin

maria-18-git · 2024-09-03T14:26:36Z

4. Preprocess llama3

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Llama-2-7b-chat-hf_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama3_8b']

Path:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 09:48 .
drwxr-xr-x 77 mmirkina users      4096 Aug 30 09:48 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 09:48 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 09:48 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 09:48 masked_tokens.bin

maria-18-git · 2024-09-03T14:30:52Z

Should add downloading tokenizers for llama3 and use it for preprocess.

Download tokenizer

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/llm_hf_weights_recipe$ axs byquery downloaded,hf_tokeniser,model_family=llama3,variant=8b,hf_token=hf_VFQvAeybBofkPWsLDQPTgSix
crcuMZpYAb
...
        "/local/mnt/workspace/mmirkina/work_collection/huggingface_hub_package_for_python3.8/install/bin/huggingface-cli" download "meta-llama/Meta-Llama-3-8B" --include "tokenizer*" --local-dir "/local/mnt/work
space/mmirkina/work_collection/downloaded_Meta-Llama-3-8B_tokeniser" --local-dir-use-symlinks False --token=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
...
['^', 'byname', 'downloaded_Meta-Llama-3-8B_tokeniser']

Then run preprocess again:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b
...
        /usr/local/bin/python3 /local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor/main.py /local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b/open_orca_gpt4_tokenized_llama.sampled_24576.pkl /local/mnt/workspace/mmirkina/work_collection/downloaded_Meta-Llama-3-8B_tokeniser /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:root:Matched Rule #1/1 produced an entry, which matches the original query.

['^', 'byname', 'preprocessed_openorca_dataset_full_llama3_8b']

maria-18-git · 2024-09-03T14:32:08Z

Results:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ axs byquery preprocessed,dataset_name=openorca,model_family=llama3,variant=8b , get_path
/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/axs2mlperf/openorca_preprocessor$ ls -la /local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b
total 295020
drwxr-xr-x  2 mmirkina users      4096 Aug 30 12:41 .
drwxr-xr-x 79 mmirkina users      4096 Aug 30 12:41 ..
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 attention_mask.bin
-rw-r--r--  1 mmirkina users       269 Aug 30 12:41 data_axs.json
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 input_ids_padded.bin
-rw-r--r--  1 mmirkina users     98304 Aug 30 12:41 input_lengths.bin
-rw-r--r--  1 mmirkina users 100663296 Aug 30 12:41 masked_tokens.bin

md5sum:

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/preprocessed_openorca_dataset_full_llama3_8b$ md5sum *
ad85eb788057c30b577515c6a0ea9dde  attention_mask.bin
8d75c4e008272ca80b86921c3ce74c13  data_axs.json
ab2342a9d49ab1f262dda8a631c89ed3  input_ids_padded.bin
185870a3dcf544c5e8019b9253799bc5  input_lengths.bin
8f9a38edc2e0b024eb4240f7dd93a0cb  masked_tokens.bin

Gavin's results - md5sum:

gsimpson@aus121-r760-0:~/work_collection/preprocessed_openorca_dataset_full_2024.08.09_07h56m49s$ md5sum *
ad85eb788057c30b577515c6a0ea9dde  attention_mask.bin
bc8a9916b4b544ed2bc4034ca10fede3  input_ids_padded.bin
185870a3dcf544c5e8019b9253799bc5  input_lengths.bin
8f9a38edc2e0b024eb4240f7dd93a0cb  masked_tokens.bin

maria-18-git · 2024-09-03T14:32:22Z

only input_ids_padded.bin is different.

maria-18-git · 2024-09-03T14:33:09Z

Converted pickle file:

gsimpson@aus121-r760-0:~/datasets/llama3/openorca$ md5sum open_orca_gpt4_tokenized_llama.sampled_24576.pkl
526a7f803d9600d90b766f42b8a4ca75  open_orca_gpt4_tokenized_llama.sampled_24576.pkl

mmirkina@aus121-r760-0:/local/mnt/workspace/mmirkina/work_collection/downloaded_openorca_dataset_llama3_8b$ md5sum open_orca_gpt4_tokenized_llama.sampled_24576.pkl
9abc215c84747ff248d5c3e5cec4442f  open_orca_gpt4_tokenized_llama.sampled_24576.pkl

maria-18-git · 2024-09-03T14:34:20Z

Commits:
- axs2mlperf:
Added model_family, variant for llama2 in dataset_openorca_mlperf_recipe
Added tokeniser rule for llama3
Added model_family, variant to openorca_preprocessor
- axs2qaic-dev:
Added model_family and variant to convert_pickle_file_llama2_to_llama3

For Debugging:
run short accuracy experiment as in reference code using converted pickle file.

maria-18-git self-assigned this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Llama2 pickle to Llama3 #55

Convert Llama2 pickle to Llama3 #55

G4V commented Aug 12, 2024

G4V commented Aug 12, 2024

maria-18-git commented Aug 28, 2024

maria-18-git commented Aug 28, 2024 •

edited

Loading

maria-18-git commented Aug 28, 2024 •

edited

Loading

maria-18-git commented Aug 28, 2024

maria-18-git commented Sep 3, 2024 •

edited

Loading

maria-18-git commented Sep 3, 2024 •

edited

Loading

maria-18-git commented Sep 3, 2024

maria-18-git commented Sep 3, 2024

maria-18-git commented Sep 3, 2024

maria-18-git commented Sep 3, 2024

maria-18-git commented Sep 3, 2024 •

edited

Loading

maria-18-git commented Sep 3, 2024

maria-18-git commented Sep 3, 2024 •

edited

Loading

Convert Llama2 pickle to Llama3 #55

Convert Llama2 pickle to Llama3 #55

Comments

G4V commented Aug 12, 2024

G4V commented Aug 12, 2024

maria-18-git commented Aug 28, 2024

1. Download hf_tokeniser

maria-18-git commented Aug 28, 2024 • edited Loading

2. Download dataset

maria-18-git commented Aug 28, 2024 • edited Loading

3. Convert llama2 pickle file to llama3 pickle file

maria-18-git commented Aug 28, 2024

maria-18-git commented Sep 3, 2024 • edited Loading

Preprocess converted pickle file from llama2

1. Download llama2 dataset:

maria-18-git commented Sep 3, 2024 • edited Loading

2. Convert pickle file from llama2 to llama3:

maria-18-git commented Sep 3, 2024

3. Preprocess llama2

maria-18-git commented Sep 3, 2024

4. Preprocess llama3

maria-18-git commented Sep 3, 2024

maria-18-git commented Sep 3, 2024

maria-18-git commented Sep 3, 2024 • edited Loading

maria-18-git commented Sep 3, 2024

maria-18-git commented Sep 3, 2024 • edited Loading

1. Download `hf_tokeniser`

maria-18-git commented Aug 28, 2024 •

edited

Loading

maria-18-git commented Aug 28, 2024 •

edited

Loading

maria-18-git commented Sep 3, 2024 •

edited

Loading

maria-18-git commented Sep 3, 2024 •

edited

Loading

maria-18-git commented Sep 3, 2024 •

edited

Loading

maria-18-git commented Sep 3, 2024 •

edited

Loading