diff --git a/README.md b/README.md index b256fb88..71606405 100644 --- a/README.md +++ b/README.md @@ -19,18 +19,27 @@ --- ## Annoucement + +- [2024-11] ๐Ÿ”ˆ๐Ÿ”Š The `lmms-eval/v0.3.0` has been upgraded to support audio evaluations for audio models like Qwen2-Audio and Gemini_Audio across tasks such as AIR-Bench, Clotho-AQA, LibriSpeech, and more. Please refer to the [blog](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md) for more details! + +- [2024-07] ๐ŸŽ‰๐ŸŽ‰ We have released the [technical report](https://arxiv.org/abs/2407.12772) and [LiveBench](https://huggingface.co/spaces/lmms-lab/LiveBench)! + +- [2024-06] ๐ŸŽฌ๐ŸŽฌ The `lmms-eval/v0.2.0` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details! + +- [2024-03] ๐Ÿ“๐Ÿ“ We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details! + +
+We warmly welcome contributions from the open-source community! Below is a chronological list of recent tasks, models, and features added by our amazing contributors. + - [2024-10] ๐ŸŽ‰๐ŸŽ‰ We welcome the new task [NaturalBench](https://huggingface.co/datasets/BaiqiL/NaturalBench), a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery. - [2024-10] ๐ŸŽ‰๐ŸŽ‰ We welcome the new task [TemporalBench](https://huggingface.co/datasets/microsoft/TemporalBench) for fine-grained temporal understanding and reasoning for videos, which reveals a huge (>30%) human-AI gap. - [2024-10] ๐ŸŽ‰๐ŸŽ‰ We welcome the new tasks [VDC](https://rese1f.github.io/aurora-web/) for video detailed captioning, [MovieChat-1K](https://rese1f.github.io/MovieChat/) for long-form video understanding, and [Vinoground](https://vinoground.github.io/), a temporal counterfactual LMM benchmark composed of 1000 short natural video-caption pairs. We also welcome the new models: [AuroraCap](https://github.com/rese1f/aurora) and [MovieChat](https://github.com/rese1f/MovieChat). - [2024-09] ๐ŸŽ‰๐ŸŽ‰ We welcome the new tasks [MMSearch](https://mmsearch.github.io/) and [MME-RealWorld](https://mme-realworld.github.io/) for inference acceleration - [2024-09] โš™๏ธ๏ธโš™๏ธ๏ธ๏ธ๏ธ We upgrade `lmms-eval` to `0.2.3` with more tasks and features. We support a compact set of language tasks evaluations (code credit to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)), and we remove the registration logic at start (for all models and tasks) to reduce the overhead. Now `lmms-eval` only launches necessary tasks/models. Please check the [release notes](https://github.com/EvolvingLMMs-Lab/lmms-eval/releases/tag/v0.2.3) for more details. - [2024-08] ๐ŸŽ‰๐ŸŽ‰ We welcome the new model [LLaVA-OneVision](https://huggingface.co/papers/2408.03326), [Mantis](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/162), new tasks [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench), [LongVideoBench](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/117), [MMStar](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/158). We provide new feature of SGlang Runtime API for llava-onevision model, please refer the [doc](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/commands.md) for inference acceleration -- [2024-07] ๐ŸŽ‰๐ŸŽ‰ We have released the [technical report](https://arxiv.org/abs/2407.12772) and [LiveBench](https://huggingface.co/spaces/lmms-lab/LiveBench)! - [2024-07] ๐Ÿ‘จโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป The `lmms-eval/v0.2.1` has been upgraded to support more models, including [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA), [InternVL-2](https://github.com/OpenGVLab/InternVL), [VILA](https://github.com/NVlabs/VILA), and many more evaluation tasks, e.g. [Details Captions](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/136), [MLVU](https://arxiv.org/abs/2406.04264), [WildVision-Bench](https://huggingface.co/datasets/WildVision/wildvision-arena-data), [VITATECS](https://github.com/lscpku/VITATECS) and [LLaVA-Interleave-Bench](https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/). -- [2024-06] ๐ŸŽฌ๐ŸŽฌ The `lmms-eval/v0.2.0` has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.2/) for more details - -- [2024-03] ๐Ÿ“๐Ÿ“ We have released the first version of `lmms-eval`, please refer to the [blog](https://lmms-lab.github.io/posts/lmms-eval-0.1/) for more details +
## Why `lmms-eval`? diff --git a/docs/lmms-eval-0.3.md b/docs/lmms-eval-0.3.md new file mode 100644 index 00000000..b402b70c --- /dev/null +++ b/docs/lmms-eval-0.3.md @@ -0,0 +1,265 @@ +# Integration of Audio Evaluation in LMMs-Eval + + +## **Introduction** + +Humans perceive the world through both sight and sound, integrating visual cues with auditory signals such as speech, environmental sounds, and emotional tones. + +This dual sensory input enhances decision-making and overall understanding. Similarly, for multimodal models to achieve human-like comprehension, it is essential to make them process both visual and auditory data together. + +While many models have made progress in integrating audio understanding, there is still no reproducible and efficient evaluation toolkit to fairly assess their capabilities. + +To address this, we introduce an upgrade to the `lmms-eval` framework, focusing on audio understanding. Building on the success of `lmms-eval/v0.2.0`, the new `lmms-eval/v0.3.0` includes dedicated modules and designs for audio tasks, ensuring consistent evaluation across audio and visual modalities. + +This upgrade includes multiple benchmarks for audio understanding and instruction following, enabling standardized and reproducible comparisons of various audio models. + +## Audio Evaluation Pipeline + +1. **Improved Pipeline for Audio Evaluations** + + Hereโ€™s a breakdown of adding audio datasets support. + + 1. **Load Audio:** Audios are saved in HuggingFace and can be loaded via the `doc_to_audio` function. + - The code specifically demonstrates the logic of how we handle audio datasets in lmms-eval. + + ```python + def air_bench_doc_to_audio(doc): + return [doc["audio"]] + ``` + + 2. **Format questions:** Questions and instructions are defined in `/utils.py`. For some Audio Instruction Following (AIF) tasks, we create custom prompts and try to align with Qwen2-Audio's evaluation format since the default dataset instructions are sometimes not clear enough for some datasets. We can add model-specific prompts besides the default instruction. + - The code demonstrates an example of formatting the question. + + ```python + # This is the place where you format your question + def common_voice_15_doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}" + ``` + + 3. **Process results:** Model outputs are evaluated using metrics from either official dataset implementations or aligning with the implementation in [AudioBench](https://github.com/AudioLLMs/AudioBench). We primarily adopt three types of metrics: + + **a. Accuracy:** Used for tasks with definitive ground truth answers, such as multiple-choice questions + + **b. WER:** Applied to some Audio Speech Recognition (ASR) tasks. + + **c. GPT-4 Eval:** Applied to open-ended responses. We align the evaluation prompt with the implementation in [AudioBench](https://github.com/AudioLLMs/AudioBench). + + - The code specifically demonstrates an example prompt for GPT-4 Evaluation. + + ```python + eval_prompt = """ + [Question] + {question} + + [Reference Answer] + {ground_truth} + + [Model Answer] + {model_response} + + [Task] + Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details. + Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance. + Score0: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference. + Score1: The answer shows minimal alignment, often misunderstanding or providing irrelevant details unrelated to the reference. + Score2: The answer recognizes the topic but diverges significantly from the reference in accuracy or relevance. + Score3: The answer aligns with the reference generally but lacks detail or precise accuracy in some aspects. + Score4: The answer is mostly accurate and relevant, closely following the reference but could be clearer or more detailed. + Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail. + + Your response should be formatted as follows: + Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...") + Rating: (int)""" + ``` + + + + + 4. **Aggregate results:** + After evaluating each data instance, we aggregate the individual results to generate the overall evaluation metrics. Finally, we provide a summary table that consolidates all the evaluation results, similar to the one in [Googleโ€™s Gemini report](https://arxiv.org/abs/2312.11805). + 5. **Grouped Tasks:** + For tasks with multiple subsets, we group all subset tasks together. For example, the AirBench-Chat dataset includes 4 subsets: sound, music, speech, mixed. By running `--task air_bench_chat`, all 4 subsets can be evaluated together, eliminating the need to specify each subset individually. We summarize all the grouped task names in Table 1. This pipeline ensures a thorough and standardized evaluation process for Audio, facilitating consistent and reliable performance assessment across various tasks and datasets. + - The code specifically demonstrates an example yaml file of task grouping. + + ```python + group: air_bench_chat + task: + - air_bench_chat_sound + - air_bench_chat_music + - air_bench_chat_speech + - air_bench_chat_mixed + ``` + + +2. **Audio-based Capabilities** + + Our selected benchmarks collectively evaluate various essential audio-based capabilities, as inspired by [AudioBench](https://github.com/AudioLLMs/AudioBench): + + 1. **Audio Captioning:** The ability to accurately transcribe human speech and convert audio content into text. + 2. **Speech Understanding:** The capability to comprehend the semantic meaning of human speech, enabling appropriate responses to questions and audio instructions. + 3. **Audio Scene Understanding:** The ability to interpret non-human sounds, such as environment sounds. + 4. **Voice Understanding:** The capability to analyze non-speech human vocal information, including emotional states, accents, and speaker characteristics. + 5. **Specialized Audio Processing:** The ability to analyze other audio types, such as musical compositions and multilingual content. + +### **Meta Information for Audio Datasets** + +#### Table 1: Meta informantion for audio datasets + +| **Dataset** | **Year** | **Task Name in lmms-eval** | **Split** | **Task Format** | **Evaluation Metric** | **Number of QAs** | **Feature** | +| --- | --- | --- | --- | --- | --- | --- | --- | +| **AIR-Bench** | 2024 | air_bench_chat \| air_bench_foundation | chat, foundation | AIF | GPT-4 Eval (chat) \| Accuracy (foundation) | 2k (chat) \| 19k (foundation) | 1. Comprhensive tasks and audio types | +| **Alpaca Audio** | 2024 | alpaca_audio | test | AIF | GPT-4 Eval | 100 | 1. Synthetic voice | +| **Clotho-AQA** | 2022 | clotho_aqa | test \| val | AIF | Accuracy | test_v2 (2.06k), test \| val (1.44k \| 1.05k) | 1. Audio Question Answering
2. Single word answer
3. Text based question | +| **Common_voice** | 2023 | common_voice_15 | test | ASR | WER(โ†“) (align with Qwen-audio) | en (16.4k) \| fr (16.1k) \| zh (10.6k) | 1. Real people voice
2. Captioning | +| **GigaSpeech** | 2021 | gigaspeech | test \| dev | ASR | WER(โ†“)| dev (6.75k) \| test (25.6k) | 1. Transciption
2. Audio book
3. YouTube
4. Podcasts | +| **LibriSpeech** | 2015 | librispeech | dev-clean \| dev-other \| test-clean \| test-other | ASR | WER(โ†“)| dev-clean (~2.48k) \|
dev-other (~2.66k) \|
test-clean(~2.55k) \|
test-other (~2.70k) | 1. Transcription (audio book) | +| **OpenHermes** | 2024 | openhermes | test | AIF | GPT-Eval | 100 | 1. Synthetic voice | +| **MuchoMusic** | 2024 | muchomusic | test | AIF | Accuracy | 1.19k | 1. Music understanding | +| **People_speech** | 2021 | people_speech_val | val | ASR | WER(โ†“)| 18.6k | 1. Real people voice
2. Captioning | +| **Tedium v3** | 2018 | tedlium_dev_test | val | ASR | WER(โ†“)| 591 | 1. TED talk
2. Real people ASR
3. Captioning | +| **VocalSound** | 2022 | vocalsound_test | test \| val | AIF | Accuracy | test (3.59k) \| val (1.86k) | 1. Vocal sound recognition
2. Non-speech | +| **WavCaps** | 2024 | wavcaps | test | ASR | GPT-4 Eval | 1.73k | 1. Audio Captioning
2. ChatGPT-augmented captions | + +AIF refers to Audio Instruction Following, and ASR refers to Audio Speech Recognition. + +### Alignment Check for Audio Datasets + +#### Table 2: Alignment check for audio datasets + +| | | **Metric** | **Qwen2-Audio-Instruct (lmms-eval)** | **Qwen2-Audio (lmms-eval)** | +| --- | --- | --- | --- | --- | +| **AIR-Bench-Chat** | Speech | GPT-Eval | 7.16 | | +| | Sound | | 6.14 | | +| | Music | | 6.66 | | +| | Mixed | | 5.75 | | +| **AIR-Bench-Foundation** | Speech | Acc | 62.89 | | +| | Sound | | 55.42 | | +| | Music | | 56.77 | | +| **Alpaca** | test | GPT-Eval | 51.8 | | +| **Clotho_aqa** | test | GPT-Eval | 0.7587 | | +| **Common_voice** | zh |WER(โ†“)| 15.78 | 6.7 | +| | en | | 36.01 | 27.9 | +| | fr | | 39.88 | 34.8 | +| **GigaSpeech** | dev |WER(โ†“)| 19.45 | 14 | +| | test | | 22.6 | 15.01 | +| **LibriSpeech** | dev-clean |WER(โ†“)| 4.24 | 1.66 | +| | dev-others | | 6.54 | 3.66 | +| | test-clean | | 3.59 | 1.74 | +| | test-others | | 7.46 | 3.87 | +| **MuchoMusic** | test | Acc | 68.32 | 45.07 | +| **OpenHermes** | test | GPT-Eval | 46.8 | | +| **People_speech** | val |WER(โ†“)| 25.86 | 17.1 | +| **Tedium** | val |WER(โ†“)| 10.92 | 8.29 | +| **VocalSound** | test | Acc | 0.936 | 0.81 | +| | val | | 0.9288 | 0.8 | +| **WavCaps** | test | GPT-Eval | 1.73 | | + + +The result might be inconsistent with the reported result as we do not have the original prompt and we have to maintain the fair environment for all the models. For the base model, we do not test on the Chat Benchmarks. + +Certain datasets face alignment challenge: Datasets with WER, CIDEr, BLEU as metrics cannot accurately align due to their rigid output formats. Model responses are sensitive to prompt, we will investigate more deeply in the section [Robustness of the model](#robustness-of-the-model). + +## Evaluation Analysis and Thinking: + +During our implementation, we observe several interesting phenomena that may be valuable to discuss. We believe that reflecting on these aspects deeply can help accelerate the development of truly robust audio evaluations. + +### Robustness of the model + +As we trying to align the results, our investigation revealed that the choice of chat template significantly impacts model performance, even for instruction-tuned models. This finding emerged while analyzing the Qwen2 Audio model. The original Qwen2 Audio repository uses a minimal prompt format: `"<|audio_bos|><|AUDIO|><|audio_eos|>"` . + +This basic format is then combined with various question prompts for different evaluation scenarios. However, this prompt format is not in an instruction format and when applying a chat template, the performance of the model may changes significantly. + +#### Table 3: Impact of Chat Template on Qwen-7B-Instruct's Performance + +| **Impact of Chat Template** | **Split** | **Metric** | **Chat Template (Off)** | **Chat Template (On)** | +| --- | --- | --- | --- | --- | +| **LibriSpeech** | dev-clean | WER(โ†“) | 2.65 | 4.24 | +| | dev-others | | 5.36 | 6.54 | +| | test-clean | | 2.91 | 3.59 | +| | test-others | | 5.14 | 7.46 | +| **People_speech** | val | WER(โ†“) | 21.92 | 25.86 | +| **Tedium** | dev_test | WER(โ†“) | 9.56 | 10.92 | + +More specifically, we founds out that as shown in the above table, the influence of the chat template is very huge. We believe that these demonstrate the actual robustness of the model and signifies that current audio model may eventually not being stable enough when coping different text input. Also, it again leads us into another thinking: โ€œIs current metrics good at evaluating a modelโ€™s performance? + +### Rethinking the evaluation metrics + +Traditional fixed-format metrics like WER, CIDEr, and BLEU face several limitations in audio model evaluation: + +1. **Format Rigidity:** Fixed metrics struggle to properly assess responses that are semantically correct but differ in format from reference answers +2. **Prompt Sensitivity:** These metrics are highly sensitive to variations in input prompts, leading to inconsistent evaluation results + +Due to these limitations, the scores reported in `lmms-eval` might slightly differ from those reported in original papers, highlighting the challenge of maintaining consistent evaluation standards across different frameworks. + +Looking ahead, model-based evaluators such as GPT-4 could offer a more flexible and robust evaluation approach. Such evaluators can better understand semantic meaning, handle diverse response formats, and provide more consistent scoring across different implementations. This shift from rigid metrics to intelligent evaluation systems may better capture the true capabilities of audio processing models. + +## Additional Experiments + +### Batch Size + +We perform an exploratory batch inference experiment on Qwen2-Audio with the following results: + +#### Table 4: Impact of batch size + +| | **Split** | **Metric** | **Qwen2-Audio (BS=4)** | **Qwen2-Audio (BS=1)** | +| --- | --- | --- | --- | --- | +| **LibriSpeech** | dev-clean | WER(โ†“) | 1.66 | 1.66 | +| | dev-others | | 4.4 | 3.66 | +| | test-clean | | 1.75 | 1.74 | +| | test-others | | 4.06 | 3.87 | +| **Total Time** | | | 10 mins 50 seconds | 5 min 23 seconds | + +As shown in the above results, the batch inference (BS=4) can significantly saves the inference time, it could lead to evaluation inconsistencies compared to single-sample processing (BS=1). This is a known issue in the `transformers` library that currently lacks a solution. + +### More Details and Feature Updates with `v0.3.0` + +1. **Supported Audio Tasks** + 1. [AirBench](https://github.com/OFA-Sys/AIR-Bench) + 2. [Alpaca Audio](https://tango2-web.github.io/) + 3. [Clotho-AQA](https://github.com/partha2409/AquaNet) + 4. [Common_voice_15](https://github.com/common-voice/common-voice) + 5. [GigaSpeech](https://github.com/SpeechColab/GigaSpeech) + 6. [LibriSpeech](https://www.openslr.org/12) + 7. [OpenHermes](https://huggingface.co/datasets/AudioLLMs/openhermes_instruction_test) + 8. [MuchoMusic](https://github.com/mulab-mir/muchomusic) + 9. [Peoples_speech](https://mlcommons.org/datasets/peoples-speech/) + 10. [Tedium v3](https://www.openslr.org/51/) + 11. [VocalSound](https://github.com/YuanGongND/vocalsound) + 12. [WavCaps](https://github.com/XinhaoMei/WavCaps) +2. **Support Audio Models** + + 1. [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio) + 2. [Gemini_Audio](https://arxiv.org/abs/2312.11805) +3. **Supporting Multi-Round Evaluation** + 1. [Feat][Task] Add multi-round evaluation in llava-onevision; Add MMSearch Benchmark byย [@CaraJ7](https://github.com/CaraJ7)ย inย [#277](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/277) +4. **Regression Test** + 1. [Feat] add regression test and change saving logic related toย `output_path`ย byย [@Luodian](https://github.com/Luodian)ย inย [#259](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/259) +5. **Speed-Up by loading required tasks and models.** + 1. [feat] remove registeration logic and adding language evaluation tasks. byย [@Luodian](https://github.com/Luodian)ย inย [#218](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/218) +6. **LMMs-Eval Analysis Tool** + 1. Lite/Core-set Selection by Kaichen Zhang + + https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/tools/lite + + 2. LiveBench by Fanyi Pu + + https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/tools/live_bench + +7. **SGLang Evaluation** + 1. [Feat] SGLang SRT commands in one go, async input for openai server byย [@kcz358](https://github.com/kcz358)ย inย [#212](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/212) + 2. [Fix] Fix async append result in different order issue byย [@kcz358](https://github.com/kcz358)ย inย [#244](https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/244) + +## Contributors + +> Listed in order of contribution significance. +> + +**Core Contributors** + +Pengyun Wang, Cong Pham Ba, Yingluo Li, Fanyi Pu + +**Release Managers** + +Kairui Hu, Kaichen Zhang, Bo Li diff --git a/lmms_eval/evaluator.py b/lmms_eval/evaluator.py index 0646e21f..c6460e9d 100755 --- a/lmms_eval/evaluator.py +++ b/lmms_eval/evaluator.py @@ -492,7 +492,15 @@ def evaluate( metrics = task.process_results(doc, [req.filtered_resps[filter_key] for req in requests]) if log_samples: target = task.doc_to_target(doc) - saved_doc = {key: value for key, value in doc.items() if "image" not in key} + saved_doc = {} + for key, value in doc.items(): + # If image is not in key + if "image" not in key: + # If audio is also not the value + if isinstance(value, dict) and "array" in value: + continue + else: + saved_doc[key] = value filtered_arguments = [] for req in requests: # check if req.args is a list of tuples, and each item in the list is a serializable object diff --git a/lmms_eval/evaluator_utils.py b/lmms_eval/evaluator_utils.py index 48b5c978..58537d0f 100644 --- a/lmms_eval/evaluator_utils.py +++ b/lmms_eval/evaluator_utils.py @@ -346,7 +346,7 @@ def consolidate_group_results( task_root=None, show_group_table=False, task_aggregation_list=None, -) -> Tuple[dict, dict, bool, Union[None,]]: +) -> Tuple[dict, dict, bool, Union[None, dict]]: """ (Recursively) calculates groups' aggregated metrics and updates the results and versions dictionaries with this info. diff --git a/lmms_eval/models/__init__.py b/lmms_eval/models/__init__.py index 93103d28..ddf281f6 100644 --- a/lmms_eval/models/__init__.py +++ b/lmms_eval/models/__init__.py @@ -40,6 +40,7 @@ "phi3v": "Phi3v", "qwen_vl": "Qwen_VL", "qwen2_vl": "Qwen2_VL", + "qwen2_audio": "Qwen2_Audio", "qwen_vl_api": "Qwen_VL_API", "reka": "Reka", "srt_api": "SRT_API", diff --git a/lmms_eval/models/gemini_api.py b/lmms_eval/models/gemini_api.py index 65718bc5..69f520c1 100644 --- a/lmms_eval/models/gemini_api.py +++ b/lmms_eval/models/gemini_api.py @@ -4,6 +4,7 @@ import time from typing import List, Tuple +import datasets from accelerate import Accelerator, DistributedType from loguru import logger as eval_logger from PIL import Image @@ -25,16 +26,22 @@ eval_logger.error(f"Error importing generativeai: {str(e)}") genai = None +try: + import soundfile as sf +except Exception as e: + eval_logger.warning(f"Error importing soundfile, audio generation will not work: {str(e)}") + @register_model("gemini_api") class GeminiAPI(lmms): def __init__( self, model_version: str = "gemini-1.5-pro", - modality: str = "image", + # modality: str = "image", timeout: int = 120, continual_mode: bool = False, - response_persistent_folder: str = None, # We will cache the Gemini API response in this path and use it for future requests + response_persistent_folder: str = "./logs/gemini_persistent_folder", + # We will cache the Gemini API response in this path and use it for future requests **kwargs, ) -> None: super().__init__() @@ -42,12 +49,13 @@ def __init__( self.timeout = timeout self.model = genai.GenerativeModel(model_version) self.continual_mode = continual_mode - if self.continual_mode and response_persistent_folder is None: - raise ValueError("Continual mode requires a persistent path for the response. We will cache the Gemini API response in this path and use it for future requests. Please provide a valid path.") - self.response_persistent_folder = response_persistent_folder - if not os.path.exists(self.response_persistent_folder): - os.makedirs(self.response_persistent_folder) - self.response_persistent_file = os.path.join(self.response_persistent_folder, f"{self.model_version}_response.json") + # if self.continual_mode and response_persistent_folder is None: + # raise ValueError("Continual mode requires a persistent path for the response. We will cache the Gemini API response in this path and use it for future requests. Please provide a valid path.") + if self.continual_mode: + self.response_persistent_folder = response_persistent_folder + if not os.path.exists(self.response_persistent_folder): + os.makedirs(self.response_persistent_folder) + self.response_persistent_file = os.path.join(self.response_persistent_folder, f"{self.model_version}_response.json") if os.path.exists(self.response_persistent_file): with open(self.response_persistent_file, "r") as f: @@ -73,7 +81,7 @@ def __init__( self.device = self.accelerator.device - self.modality = modality + # self.modality = modality self.video_pool = [] @@ -107,9 +115,17 @@ def encode_video(self, video_path): self.video_pool.append(uploaded_obj) return uploaded_obj - def convert_video(self, images): + def encode_audio(self, audio): + audio_io = io.BytesIO() + sf.write(audio_io, audio["array"], audio["sampling_rate"], format="WAV") + return genai.upload_file(audio_io, mime_type="audio/wav") + + def convert_modality(self, images): for idx, img in enumerate(images): - if self.modality == "video" and isinstance(img, str): + if isinstance(img, dict) and "sampling_rate" in img: # audio + audio = self.encode_audio(img) + images[idx] = audio + elif isinstance(img, str): # video try: images[idx] = self.encode_video(img) except Exception as e: @@ -145,7 +161,7 @@ def get_uuid(task, split, doc_id): visuals = [doc_to_visual(self.task_dict[task][split][doc_id])] visuals = self.flatten(visuals) - visuals = self.convert_video(visuals) + visuals = self.convert_modality(visuals) message = [contexts] + visuals diff --git a/lmms_eval/models/model_utils/audio_processing.py b/lmms_eval/models/model_utils/audio_processing.py new file mode 100644 index 00000000..2c27970b --- /dev/null +++ b/lmms_eval/models/model_utils/audio_processing.py @@ -0,0 +1,7 @@ +import numpy as np +from librosa import resample + + +def downsample_audio(audio_array: np.ndarray, original_sr: int, target_sr: int) -> np.ndarray: + audio_resample_array = resample(audio_array, orig_sr=original_sr, target_sr=target_sr) + return audio_resample_array diff --git a/lmms_eval/models/qwen2_audio.py b/lmms_eval/models/qwen2_audio.py new file mode 100755 index 00000000..01f0cf22 --- /dev/null +++ b/lmms_eval/models/qwen2_audio.py @@ -0,0 +1,284 @@ +import base64 +from io import BytesIO +from typing import List, Optional, Tuple, Union + +import decord +import torch +from accelerate import Accelerator, DistributedType +from loguru import logger as eval_logger +from tqdm import tqdm +from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration + +from lmms_eval import utils +from lmms_eval.api.instance import Instance +from lmms_eval.api.model import lmms +from lmms_eval.api.registry import register_model +from lmms_eval.models.model_utils.audio_processing import downsample_audio + + +@register_model("qwen2_audio") +class Qwen2_Audio(lmms): + """ + Qwen2_Audio Model + "https://github.com/QwenLM/Qwen2-Audio" + """ + + def __init__( + self, + pretrained: str = "Qwen/Qwen2-Audio-7B", # Qwen/Qwen2-Audio-7B-Instruct + device: Optional[str] = "cuda", + device_map: Optional[str] = "cuda", + batch_size: Optional[Union[int, str]] = 1, + use_cache=True, + add_generation_prompt: bool = True, + add_system_prompt: bool = True, + simple_prompt: bool = False, + **kwargs, + ) -> None: + super().__init__() + # Do not use kwargs for now + assert kwargs == {}, f"Unexpected kwargs: {kwargs}" + + accelerator = Accelerator() + self.add_generation_prompt = add_generation_prompt + self.add_system_prompt = add_system_prompt + # If using simple prompt, only add "<|audio_bos|><|AUDIO|><|audio_eos|>" + # and then prompt to align with original Qwen2 Audio + self.simple_prompt = simple_prompt + if accelerator.num_processes > 1: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + elif accelerator.num_processes == 1 and device_map == "auto": + self._device = torch.device(device) + self.device_map = device_map + else: + self._device = torch.device(f"cuda:{accelerator.local_process_index}") + self.device_map = f"cuda:{accelerator.local_process_index}" + + self._model = Qwen2AudioForConditionalGeneration.from_pretrained( + pretrained, + torch_dtype="auto", + device_map=device_map, + ).eval() + + self.processor = AutoProcessor.from_pretrained(pretrained) + self.processor.tokenizer.padding_side = "left" + self._tokenizer = self.processor.tokenizer + + if not self.add_system_prompt: + # Overwrite chat template to exclude system prompt + self.processor.chat_template = ( + "{% set audio_count = namespace(value=0) %}" + "{% for message in messages %}" + "<|im_start|>{{ message['role'] }}\n" + "{% if message['content'] is string %}" + "{{ message['content'] }}<|im_end|>\n" + "{% else %}" + "{% for content in message['content'] %}" + "{% if 'audio' in content or 'audio_url' in content %}" + "{% set audio_count.value = audio_count.value + 1 %}" + "Audio {{ audio_count.value }}: <|audio_bos|><|AUDIO|><|audio_eos|>\n" + "{% elif 'text' in content %}" + "{{ content['text'] }}" + "{% endif %}" + "{% endfor %}" + "<|im_end|>\n" + "{% endif %}" + "{% endfor %}" + "{% if add_generation_prompt %}" + "<|im_start|>assistant\n" + "{% endif %}" + ) + + self._config = self.model.config + self.batch_size_per_gpu = int(batch_size) + self.use_cache = use_cache + + if accelerator.num_processes > 1: + assert accelerator.distributed_type in [ + DistributedType.FSDP, + DistributedType.MULTI_GPU, + ], "Unsupported distributed type provided. Only DDP and FSDP are supported." + if accelerator.distributed_type == DistributedType.FSDP: + self._model = accelerator.prepare(self.model) + else: + self._model = accelerator.prepare_model(self.model, evaluation_mode=True) + self.accelerator = accelerator + if self.accelerator.is_local_main_process: + eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism") + self._rank = self.accelerator.local_process_index + self._world_size = self.accelerator.num_processes + else: + self.model.to(self._device) + self._rank = 0 + self._word_size = 1 + + @property + def config(self): + # return the associated transformers.AutoConfig for the given pretrained model. + return self._config + + @property + def tokenizer(self): + return self._tokenizer + + @property + def model(self): + # returns the model, unwrapping it if using Accelerate + if hasattr(self, "accelerator"): + return self.accelerator.unwrap_model(self._model) + else: + return self._model + + @property + def eot_token_id(self): + return self.tokenizer.eos_token_id + + @property + def max_length(self): + return self._max_length + + @property + def batch_size(self): + return self.batch_size_per_gpu + + @property + def device(self): + return self._device + + @property + def rank(self): + return self._rank + + @property + def world_size(self): + return self._world_size + + def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: + raise NotImplementedError("Loglikelihood is not implemented for Qwen2_Audio") + + def flatten(self, input): + new_list = [] + for i in input: + for j in i: + new_list.append(j) + return new_list + + def generate_until(self, requests: List[Instance]) -> List[str]: + res = [] + + def _collate(x): + # the negative sign on len(toks) sorts descending - this has a few advantages: + # - time estimates will always be over not underestimates, which is more useful for planning + # - to know the size of a batch when going through the list, you know the first one is always the batch + # padded context length. this is useful to simplify the batching logic and more importantly to make + # automatic adaptive batches much much easier to implement + # - any OOMs will happen right away rather than near the end + toks = self.tokenizer.encode(x[0]) + return -len(toks), x[0] + + pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding") + # we group requests by their generation_kwargs, + # so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling + # in the same batch. + re_ords = utils.Collator([reg.args for reg in requests], _collate, grouping=True) + chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None) + for chunk in chunks: + contexts, all_gen_kwargs, doc_to_visual, doc_id, task, split = zip(*chunk) + task = task[0] + split = split[0] + batched_audios = [doc_to_visual[0](self.task_dict[task][split][ids]) for ids in doc_id] + flattened_audios = self.flatten(batched_audios) + + # we assume all gen kwargs in the batch are the same + # this is safe to assume because the `grouper` object ensures it. + gen_kwargs = all_gen_kwargs[0] + + # Set default values for until and max_new_tokens + until = [self.tokenizer.decode(self.eot_token_id)] + + # Update values from gen_kwargs if present + if "until" in gen_kwargs: + until = gen_kwargs.pop("until") + if isinstance(until, str): + until = [until] + elif not isinstance(until, list): + raise ValueError(f"Expected `gen_kwargs['until']` to be of type Union[str,list] but got {type(until)}") + + # contexts = "<|audio_bos|><|AUDIO|><|audio_eos|>" + contexts + + if isinstance(contexts, tuple): + contexts = list(contexts) + + if not self.simple_prompt: + conversations = [] + for idx, context in enumerate(contexts): + conv = [{"role": "user", "content": []}] + for _ in batched_audios[idx]: + # This placeholder is just use to make chat template work + # We already have the sampled audio array + conv[0]["content"].append({"type": "audio", "audio_url": "placeholder.wav"}) + conv[0]["content"].append({"type": "text", "text": context}) + conversations.append(conv) + + text = [self.processor.apply_chat_template(conversation, add_generation_prompt=self.add_generation_prompt, tokenize=False) for conversation in conversations] + else: + text = ["<|audio_bos|><|AUDIO|><|audio_eos|>" + context for context in contexts] + audios = [downsample_audio(audio["array"], audio["sampling_rate"], self.processor.feature_extractor.sampling_rate) for audio in flattened_audios] + + inputs = self.processor(text=text, audios=audios, return_tensors="pt", padding=True, sampling_rate=self.processor.feature_extractor.sampling_rate) + + if self.device_map == "auto": + inputs = inputs.to("cuda") + else: + inputs = inputs.to(self.device) + + if "max_new_tokens" not in gen_kwargs: + gen_kwargs["max_new_tokens"] = 256 + if "temperature" not in gen_kwargs: + gen_kwargs["temperature"] = 0 + if "top_p" not in gen_kwargs: + gen_kwargs["top_p"] = None + if "num_beams" not in gen_kwargs: + gen_kwargs["num_beams"] = 1 + + try: + cont = self.model.generate( + **inputs, + do_sample=True if gen_kwargs["temperature"] > 0 else False, + temperature=gen_kwargs["temperature"], + top_p=gen_kwargs["top_p"], + num_beams=gen_kwargs["num_beams"], + max_new_tokens=gen_kwargs["max_new_tokens"], + min_new_tokens=1, + use_cache=self.use_cache, + ) + + # cont = self.model.generate(**inputs, max_new_tokens=256, min_new_tokens=1, do_sample=False) + + generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, cont)] + # generated_ids_trimmed = cont[:, inputs.input_ids.size(1):] + answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False) + for i, ans in enumerate(answers): + for term in until: + if len(term) > 0: + ans = ans.split(term)[0] + answers[i] = ans + + except Exception as e: + eval_logger.debug(f"Error while generating: {e}. It is possibly due to blank audio in {contexts}") + answers = [""] * len(contexts) + + for ans, context in zip(answers, contexts): + res.append(ans) + self.cache_hook.add_partial("generate_until", (context, gen_kwargs), ans) + pbar.update(1) + + # reorder this group of results back to original unsorted form + res = re_ords.get_original(res) + + pbar.close() + return res + + def generate_until_multi_round(self, requests) -> List[str]: + raise NotImplementedError("TODO: Implement multi-round generation") diff --git a/lmms_eval/tasks/air_bench/_default_template_yaml b/lmms_eval/tasks/air_bench/_default_template_yaml new file mode 100644 index 00000000..fccfc5a4 --- /dev/null +++ b/lmms_eval/tasks/air_bench/_default_template_yaml @@ -0,0 +1,7 @@ +dataset_path: lmms-lab/AIR_Bench +dataset_kwargs: + token: True + +metadata: + gpt_eval_model_name: gpt-4o + version: 0.0 diff --git a/lmms_eval/tasks/air_bench/air_bench_chat.yaml b/lmms_eval/tasks/air_bench/air_bench_chat.yaml new file mode 100644 index 00000000..7bc686f3 --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_chat.yaml @@ -0,0 +1,6 @@ +group: air_bench_chat +task: + - air_bench_chat_sound + - air_bench_chat_music + - air_bench_chat_speech + - air_bench_chat_mixed diff --git a/lmms_eval/tasks/air_bench/air_bench_chat_mixed.yaml b/lmms_eval/tasks/air_bench/air_bench_chat_mixed.yaml new file mode 100644 index 00000000..1d831978 --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_chat_mixed.yaml @@ -0,0 +1,25 @@ +task: "air_bench_chat_mixed" +dataset_name: "Chat" +test_split: mixed +doc_to_target: "answer_gt" +doc_to_visual: !function utils.air_bench_doc_to_audio +doc_to_text: !function utils.air_bench_doc_to_text_chat + +generation_kwargs: + max_new_tokens: 1024 + temperature: 0.2 + top_p: 1.0 + num_beams: 1 + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "Give a detail answer to the question in English." +metric_list: + - metric: gpt_eval + aggregation: !function utils.air_bench_aggregate_results_chat + higher_is_better: true + +process_results: !function utils.air_bench_process_results_chat + +include: _default_template_yaml diff --git a/lmms_eval/tasks/air_bench/air_bench_chat_music.yaml b/lmms_eval/tasks/air_bench/air_bench_chat_music.yaml new file mode 100644 index 00000000..e4fd4a10 --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_chat_music.yaml @@ -0,0 +1,25 @@ +task: "air_bench_chat_music" +dataset_name: "Chat" +test_split: music +doc_to_target: "answer_gt" +doc_to_visual: !function utils.air_bench_doc_to_audio +doc_to_text: !function utils.air_bench_doc_to_text_chat + +generation_kwargs: + max_new_tokens: 1024 + temperature: 0.2 + top_p: 1.0 + num_beams: 1 + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "Give a detail answer to the question in English." +metric_list: + - metric: gpt_eval + aggregation: !function utils.air_bench_aggregate_results_chat + higher_is_better: true + +process_results: !function utils.air_bench_process_results_chat + +include: _default_template_yaml diff --git a/lmms_eval/tasks/air_bench/air_bench_chat_sound.yaml b/lmms_eval/tasks/air_bench/air_bench_chat_sound.yaml new file mode 100644 index 00000000..fb281452 --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_chat_sound.yaml @@ -0,0 +1,25 @@ +task: "air_bench_chat_sound" +dataset_name: "Chat" +test_split: sound +doc_to_target: "answer_gt" +doc_to_visual: !function utils.air_bench_doc_to_audio +doc_to_text: !function utils.air_bench_doc_to_text_chat + +generation_kwargs: + max_new_tokens: 1024 + temperature: 0.2 + top_p: 1.0 + num_beams: 1 + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "Give a detail answer to the question in English." +metric_list: + - metric: gpt_eval + aggregation: !function utils.air_bench_aggregate_results_chat + higher_is_better: true + +process_results: !function utils.air_bench_process_results_chat + +include: _default_template_yaml diff --git a/lmms_eval/tasks/air_bench/air_bench_chat_speech.yaml b/lmms_eval/tasks/air_bench/air_bench_chat_speech.yaml new file mode 100644 index 00000000..9a488ba4 --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_chat_speech.yaml @@ -0,0 +1,25 @@ +task: "air_bench_chat_speech" +dataset_name: "Chat" +test_split: speech +doc_to_target: "answer_gt" +doc_to_visual: !function utils.air_bench_doc_to_audio +doc_to_text: !function utils.air_bench_doc_to_text_chat + +generation_kwargs: + max_new_tokens: 1024 + temperature: 0.2 + top_p: 1.0 + num_beams: 1 + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "Give a detail answer to the question in English." +metric_list: + - metric: gpt_eval + aggregation: !function utils.air_bench_aggregate_results_chat + higher_is_better: true + +process_results: !function utils.air_bench_process_results_chat + +include: _default_template_yaml diff --git a/lmms_eval/tasks/air_bench/air_bench_foundation.yaml b/lmms_eval/tasks/air_bench/air_bench_foundation.yaml new file mode 100644 index 00000000..bdcbe038 --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_foundation.yaml @@ -0,0 +1,5 @@ +group: air_bench_foundation +task: + - air_bench_foundation_sound + - air_bench_foundation_music + - air_bench_foundation_speech diff --git a/lmms_eval/tasks/air_bench/air_bench_foundation_music.yaml b/lmms_eval/tasks/air_bench/air_bench_foundation_music.yaml new file mode 100644 index 00000000..566ad7a2 --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_foundation_music.yaml @@ -0,0 +1,26 @@ +task: "air_bench_foundation_music" +dataset_name: "Foundation" +test_split: music +doc_to_target: !function utils.air_bench_doc_to_target_foundation +doc_to_visual: !function utils.air_bench_doc_to_audio +doc_to_text: !function utils.air_bench_doc_to_text_foundation +doc_to_choice: !function utils.air_bench_doc_to_choice_foundation +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + do_sample: False +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "\nAnswer with the option's letter from the given choices directly." +metric_list: + - metric: accuracy + aggregation: !function utils.air_bench_aggregate_results_foundation + higher_is_better: true + - metric: submission + aggregation: !function utils.air_bench_aggregate_results_for_submission + higher_is_better: true + +process_results: !function utils.air_bench_process_results_foundation + +include: _default_template_yaml diff --git a/lmms_eval/tasks/air_bench/air_bench_foundation_sound.yaml b/lmms_eval/tasks/air_bench/air_bench_foundation_sound.yaml new file mode 100644 index 00000000..f92160e9 --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_foundation_sound.yaml @@ -0,0 +1,26 @@ +task: "air_bench_foundation_sound" +dataset_name: "Foundation" +test_split: sound +doc_to_target: !function utils.air_bench_doc_to_target_foundation +doc_to_visual: !function utils.air_bench_doc_to_audio +doc_to_text: !function utils.air_bench_doc_to_text_foundation +doc_to_choice: !function utils.air_bench_doc_to_choice_foundation +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + do_sample: False +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "\nAnswer with the option's letter from the given choices directly." +metric_list: + - metric: accuracy + aggregation: !function utils.air_bench_aggregate_results_foundation + higher_is_better: true + - metric: submission + aggregation: !function utils.air_bench_aggregate_results_for_submission + higher_is_better: true + +process_results: !function utils.air_bench_process_results_foundation + +include: _default_template_yaml diff --git a/lmms_eval/tasks/air_bench/air_bench_foundation_speech.yaml b/lmms_eval/tasks/air_bench/air_bench_foundation_speech.yaml new file mode 100644 index 00000000..9e1dba8f --- /dev/null +++ b/lmms_eval/tasks/air_bench/air_bench_foundation_speech.yaml @@ -0,0 +1,26 @@ +task: "air_bench_foundation_speech" +dataset_name: "Foundation" +test_split: speech +doc_to_target: !function utils.air_bench_doc_to_target_foundation +doc_to_visual: !function utils.air_bench_doc_to_audio +doc_to_text: !function utils.air_bench_doc_to_text_foundation +doc_to_choice: !function utils.air_bench_doc_to_choice_foundation +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + do_sample: False +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "\nAnswer with the option's letter from the given choices directly." +metric_list: + - metric: accuracy + aggregation: !function utils.air_bench_aggregate_results_foundation + higher_is_better: true + - metric: submission + aggregation: !function utils.air_bench_aggregate_results_for_submission + higher_is_better: true + +process_results: !function utils.air_bench_process_results_foundation + +include: _default_template_yaml diff --git a/lmms_eval/tasks/air_bench/utils.py b/lmms_eval/tasks/air_bench/utils.py new file mode 100644 index 00000000..32ea79c5 --- /dev/null +++ b/lmms_eval/tasks/air_bench/utils.py @@ -0,0 +1,267 @@ +import datetime +import json +import os +import random +import re +import sys +import time +from pathlib import Path + +import numpy as np +import requests +import yaml +from loguru import logger as eval_logger + +from lmms_eval.tasks._task_utils.file_utils import generate_submission_file + + +def air_bench_doc_to_audio(doc): + return [doc["audio"]] + + +def air_bench_doc_to_text_chat(doc, lmms_eval_specific_kwargs): + question = doc["question"] + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}{question}{post_prompt}" + + +with open(Path(__file__).parent / "_default_template_yaml", "r") as f: + raw_data = f.readlines() + safe_data = [] + for i, line in enumerate(raw_data): + # remove function definition since yaml load cannot handle it + if "!function" not in line: + safe_data.append(line) + + config = yaml.safe_load("".join(safe_data)) + +# specify api type and key in .env +GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"] +API_TYPE = os.getenv("API_TYPE", "azure") + +if API_TYPE == "openai": + API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") + API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") + headers = { + "Authorization": f"Bearer {API_KEY}", + "Content-Type": "application/json", + } +elif API_TYPE == "azure": + API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken") + API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY") + headers = { + "api-key": API_KEY, + "Content-Type": "application/json", + } + +# prompt taken from the AIR-Bench repo +eval_prompt = ( + "You are a helpful and precise assistant for checking the quality of the answer.\n" + "[Detailed Audio Description]\nXAudioX\n[Question]\nXQuestionX\n" + "[The Start of Assistant 1s Answer]\nXAssistant1X\n[The End of Assistant 1s Answer]\n" + "[The Start of Assistant 2s Answer]\nXAssistant2X\n[The End of Assistant 2s Answer]\n[System]\n" + "We would like to request your feedback on the performance of two AI assistants in response to the user question " + "and audio description displayed above. AI assistants are provided with detailed audio descriptions and questions.\n" + "Please rate the helpfulness, relevance, accuracy, and comprehensiveness of their responses. " + "Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. " + "Please output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. " + "The two scores are separated by a space. Please only output the 2 required number and no text." +) + +retries = 3 +NUM_SECONDS_TO_SLEEP = 5 + + +def get_eval(max_tokens: int, content: str, retries: int = retries): + global headers + + messages = [ + {"role": "user", "content": content}, + ] + + payload = { + "model": GPT_EVAL_MODEL_NAME, + "messages": messages, + "temperature": 0, + "max_tokens": max_tokens, + } + + for attempt in range(retries): + try: + response = requests.post(API_URL, headers=headers, json=payload, timeout=60) + response.raise_for_status() + response_data = response.json() + + content = response_data["choices"][0]["message"]["content"].strip() + if content != "": + return content, response_data["model"] + break # If successful, break out of the loop + + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}") + if attempt < retries: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All {retries} attempts failed. Last error message: {e}") + return "", "" + return "", "" + + +def air_bench_process_results_chat(doc, result): + path = doc["path"] + question = doc["question"] + answer_gt = doc["answer_gt"] + task_name = doc["task_name"] + dataset_name = doc["dataset_name"] + response = result[0] + + if response == None: + exit(1) + + if doc["meta_info"] == None: + print("lack meta info") + exit(1) + else: + meta_info = doc["meta_info"] + + # Get the evaluation score 2 times: one with ourmodel as assistant 1 and the other with our model as assistant 2 to prevent position bias + content = eval_prompt.replace("XAudioX", meta_info).replace("XQuestionX", question).replace("XAssistant1X", answer_gt).replace("XAssistant2X", response) + eval_answer, model_name = get_eval(max_tokens=1024, content=content) + content = eval_prompt.replace("XAudioX", meta_info).replace("XQuestionX", question).replace("XAssistant1X", response).replace("XAssistant2X", answer_gt) + eval_answer2, model_name2 = get_eval(max_tokens=1024, content=content) + + return { + "gpt_eval": {"eval_answer": [eval_answer, eval_answer2], "model_name": model_name}, + } + + +def air_bench_aggregate_results_chat(results): + score = 0 + for result in results: + eval_answer = result["eval_answer"] + eval_answer = [eval_answer[i].strip().replace(".", "") for i in range(len(eval_answer))] + pattern = r"\b(?:[1-9]|10)\b" + + # Find all matches + try: + matches1 = re.findall(pattern, eval_answer[0]) + matches2 = re.findall(pattern, eval_answer[1]) + # Get the first two occurrences + eval_score = float(matches1[1]) + float(matches2[0]) + score += eval_score + except Exception as e: + eval_logger.error(f"Error parsing eval_score: {e}") + continue + + return score / (2 * len(results)) + + +# Functions for Foundation tasks + + +def air_bench_doc_to_text_foundation(doc, lmms_eval_specific_kwargs): + question = doc["question"] + answers = [doc["choice_a"], doc["choice_b"], doc.get("choice_c", None), doc.get("choice_d", None)] + question = f"{question}\nA. {answers[0]}\nB. {answers[1]}\nC. {answers[2]}\nD. {answers[3]}\n" + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}{question}{post_prompt}" + + +def air_bench_doc_to_target_foundation(doc): + return get_gt(doc) + + +def air_bench_doc_to_choice_foundation(doc): + choices = [] + for option in ["choice_a", "choice_b", "choice_c", "choice_d"]: + if doc.get(option) and doc[option].strip(): + choices.append(option[-1].upper()) + return choices + + +def air_bench_process_results_foundation(doc, result): + response = result[0].strip() + all_choices = [choice[-1].upper() for choice in ["choice_a", "choice_b", "choice_c", "choice_d"] if doc.get(choice)] + pred = parse_multi_choice_response(response, all_choices) # AdaptfromMMMU + gt_ans = get_gt(doc) + score = 1.0 if pred == gt_ans else 0.0 + submission_dict = {} + submission_dict = {doc.get("uniq_id", "unknown"): pred} + return {"accuracy": {"score": score, "task": doc["task_name"]}, "submission": submission_dict} + + +def air_bench_aggregate_results_for_submission(results, args): + path = generate_submission_file("air_bench_test_submission.json", args) + with open(path, "w") as f: + json.dump(results, f) + eval_logger.info(f"Results saved to {path}.") + + +def air_bench_aggregate_results_foundation(results): + score = 0 + categorical_correct = {} + categorical_total = {} + for result in results: + score += result["score"] + if result["task"] not in categorical_correct.keys(): + categorical_correct[result["task"]] = 0 + categorical_total[result["task"]] = 0 + categorical_correct[result["task"]] += result["score"] + categorical_total[result["task"]] += 1 + + return {"overall_accuracy": score / len(results), "categorical_accuracy": {task: categorical_correct[task] / categorical_total[task] for task in categorical_correct.keys()}} + + +def parse_multi_choice_response(response, all_choices): + """ + Parse the prediction from the generated response. + Return the predicted choice letter e.g., A, B, C, D. + """ + # Clean response of unwanted characters + for char in [",", ".", "!", "?", ";", ":", "'"]: + response = response.strip(char) + response = " " + response + " " # Add space to avoid partial match + + candidates = [] + # Look for choices with parentheses, e.g., (A) + for choice in all_choices: + if f"({choice})" in response: + candidates.append(choice) + + # Look for simple choices, e.g., A, B, C + if len(candidates) == 0: + for choice in all_choices: + if f" {choice} " in response: + candidates.append(choice) + + # Look for choices with periods, e.g., A., B., C. + if len(candidates) == 0: + for choice in all_choices: + if f"{choice}." in response: + candidates.append(choice) + + # If no candidates, randomly choose one + if len(candidates) == 0: + pred_index = random.choice(all_choices) + elif len(candidates) > 1: + # If more than one candidate, choose the last one found + start_indexes = [response.rfind(f" {can} ") for can in candidates] + pred_index = candidates[np.argmax(start_indexes)] + else: + # If only one candidate, use it + pred_index = candidates[0] + + return pred_index + + +def get_gt(doc): + if doc["answer_gt"] == doc["choice_a"]: + return "A" + elif doc["answer_gt"] == doc["choice_b"]: + return "B" + elif doc["answer_gt"] == doc.get("choice_c", None): + return "C" + elif doc["answer_gt"] == doc.get("choice_d", None): + return "D" diff --git a/lmms_eval/tasks/alpaca_audio/alpaca_audio.yaml b/lmms_eval/tasks/alpaca_audio/alpaca_audio.yaml new file mode 100644 index 00000000..737ba1e2 --- /dev/null +++ b/lmms_eval/tasks/alpaca_audio/alpaca_audio.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/alpaca_audio +dataset_kwargs: + token: True + +task: "alpaca_audio" +test_split: test +doc_to_target: "answer" +doc_to_visual: !function utils.doc_to_audio +doc_to_text: !function utils.doc_to_text + +generation_kwargs: + max_new_tokens: 1024 + temperature: 0.2 + top_p: 1.0 + num_beams: 1 + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "\nPlease give a detail answer to the question in the audio." +metric_list: + - metric: gpt_eval + aggregation: !function utils.alpaca_audio_aggregate_results + higher_is_better: true + +process_results: !function utils.alpaca_audio_process_results + +metadata: + gpt_eval_model_name: gpt-4o + version: 0.0 diff --git a/lmms_eval/tasks/alpaca_audio/utils.py b/lmms_eval/tasks/alpaca_audio/utils.py new file mode 100644 index 00000000..1ba1719a --- /dev/null +++ b/lmms_eval/tasks/alpaca_audio/utils.py @@ -0,0 +1,138 @@ +import datetime +import json +import os +import random +import re +import sys +import time +from pathlib import Path + +import requests +import yaml +from loguru import logger as eval_logger + +import lmms_eval.tasks._task_utils.file_utils as file_utils +from lmms_eval.filters.extraction import ExtendedRegexFilter + + +def doc_to_audio(doc): + return [doc["context"]] + + +def doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}{post_prompt}" + + +with open(Path(__file__).parent / "alpaca_audio.yaml", "r") as f: + raw_data = f.readlines() + safe_data = [] + for i, line in enumerate(raw_data): + # remove function definition since yaml load cannot handle it + if "!function" not in line: + safe_data.append(line) + + config = yaml.safe_load("".join(safe_data)) + +# specify api type and key in .env +GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"] +API_TYPE = os.getenv("API_TYPE", "azure") + +if API_TYPE == "openai": + API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") + API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") + headers = { + "Authorization": f"Bearer {API_KEY}", + "Content-Type": "application/json", + } + +elif API_TYPE == "azure": + API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken") + API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY") + headers = { + "api-key": API_KEY, + "Content-Type": "application/json", + } + +eval_prompt = """ + [Question] + {question} + + [Reference Answer] + {ground_truth} + + [Model Answer] + {model_response} + + [Task] + Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details. + Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance. + Score0: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference. + Score1: The answer shows minimal alignment, often misunderstanding or providing irrelevant details unrelated to the reference. + Score2: The answer recognizes the topic but diverges significantly from the reference in accuracy or relevance. + Score3: The answer aligns with the reference generally but lacks detail or precise accuracy in some aspects. + Score4: The answer is mostly accurate and relevant, closely following the reference but could be clearer or more detailed. + Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail. + + Your response should be formatted as follows: + Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...") + Rating: (int)""" + +retries = 3 +NUM_SECONDS_TO_SLEEP = 5 + + +def get_eval(max_tokens: int, content: str, retries: int = retries): + global headers + + messages = [ + {"role": "user", "content": content}, + ] + + payload = {"model": GPT_EVAL_MODEL_NAME, "messages": messages, "temperature": 0.7, "max_tokens": max_tokens, "top_p": 0.95, "frequency_penalty": 0, "presence_penalty": 0, "stop": None} + + for attempt in range(retries): + try: + response = requests.post(API_URL, headers=headers, json=payload, timeout=60) + response.raise_for_status() + response_data = response.json() + + content = response_data["choices"][0]["message"]["content"].strip() + if content != "": + return content, response_data["model"] + break # If successful, break out of the loop + + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}") + if attempt < retries: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All {retries} attempts failed. Last error message: {e}") + return "", "" + return "", "" + + +def alpaca_audio_process_results(doc, result): + pred = result[0] + ground_truth_str = doc["answer"] + content = eval_prompt.format(model_response=pred, ground_truth=ground_truth_str, question=doc["speech_instruction"]) + eval_answer, model_name = get_eval(max_tokens=1024, content=content) + return { + "gpt_eval": {"eval_answer": eval_answer, "model_name": model_name}, + } + + +def alpaca_audio_aggregate_results(results): + score = 0 + for result in results: + try: + eval_answer = result["eval_answer"] + eval_score = re.search(r"([0-5])", eval_answer).group(1) + eval_score = float(eval_score) + except Exception as e: + eval_logger.error(f"Error parsing eval_score: {e}") + eval_score = 0.0 + score += eval_score + + return score / len(results) * 20 diff --git a/lmms_eval/tasks/clotho_aqa/_default_template_yaml b/lmms_eval/tasks/clotho_aqa/_default_template_yaml new file mode 100644 index 00000000..3d167621 --- /dev/null +++ b/lmms_eval/tasks/clotho_aqa/_default_template_yaml @@ -0,0 +1,10 @@ +dataset_path: lmms-lab/ClothoAQA +dataset_kwargs: + token: True +doc_to_target: "answer" +doc_to_visual: !function utils.clotho_aqa_doc_to_audio +doc_to_text: !function utils.clotho_aqa_doc_to_text + +metadata: + gpt_eval_model_name: gpt-4o + version: 0.0 diff --git a/lmms_eval/tasks/clotho_aqa/clotho_aqa.yaml b/lmms_eval/tasks/clotho_aqa/clotho_aqa.yaml new file mode 100644 index 00000000..ed6e3799 --- /dev/null +++ b/lmms_eval/tasks/clotho_aqa/clotho_aqa.yaml @@ -0,0 +1,4 @@ +group: clotho_aqa +task: + - clotho_aqa_val + - clotho_aqa_test diff --git a/lmms_eval/tasks/clotho_aqa/clotho_aqa_test.yaml b/lmms_eval/tasks/clotho_aqa/clotho_aqa_test.yaml new file mode 100644 index 00000000..47dfcfda --- /dev/null +++ b/lmms_eval/tasks/clotho_aqa/clotho_aqa_test.yaml @@ -0,0 +1,19 @@ +task: "clotho_aqa_test" +dataset_name: "clotho_aqa" +test_split: clotho_aqa_test_filtered +generation_kwargs: + max_new_tokens: 8 + temperature: 0 + do_sample: False +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "\nAnswer the question using a single word only. " +metric_list: + - metric: exact_match + aggregation: mean + higher_is_better: true + ignore_case: true + ignore_punctuation: true + +include: _default_template_yaml diff --git a/lmms_eval/tasks/clotho_aqa/clotho_aqa_val.yaml b/lmms_eval/tasks/clotho_aqa/clotho_aqa_val.yaml new file mode 100644 index 00000000..008472c6 --- /dev/null +++ b/lmms_eval/tasks/clotho_aqa/clotho_aqa_val.yaml @@ -0,0 +1,19 @@ +task: "clotho_aqa_val" +dataset_name: "clotho_aqa" +test_split: clotho_aqa_val_filtered +generation_kwargs: + max_new_tokens: 8 + temperature: 0 + do_sample: False +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "\nAnswer the question using a single word only. " +metric_list: + - metric: exact_match + aggregation: mean + higher_is_better: true + ignore_case: true + ignore_punctuation: true + +include: _default_template_yaml diff --git a/lmms_eval/tasks/clotho_aqa/clotho_asqa_test_v2.yaml b/lmms_eval/tasks/clotho_aqa/clotho_asqa_test_v2.yaml new file mode 100644 index 00000000..d8de65ec --- /dev/null +++ b/lmms_eval/tasks/clotho_aqa/clotho_asqa_test_v2.yaml @@ -0,0 +1,19 @@ +task: "clotho_asqa_test_v2" +dataset_name: "clotho_asqa_test_v2" +test_split: test +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + do_sample: False +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" +metric_list: + - metric: gpt_eval + aggregation: !function utils.clotho_aqa_v2_aggregate_results + higher_is_better: true + +process_results: !function utils.clotho_aqa_v2_process_results + +include: _default_template_yaml diff --git a/lmms_eval/tasks/clotho_aqa/utils.py b/lmms_eval/tasks/clotho_aqa/utils.py new file mode 100644 index 00000000..b7148003 --- /dev/null +++ b/lmms_eval/tasks/clotho_aqa/utils.py @@ -0,0 +1,142 @@ +import datetime +import json +import os +import re +import sys +import time +from pathlib import Path + +import requests +import yaml +from loguru import logger as eval_logger + +import lmms_eval.tasks._task_utils.file_utils as file_utils +from lmms_eval.filters.extraction import ExtendedRegexFilter + + +def clotho_aqa_doc_to_audio(doc): + return [doc["audio"]] + + +def clotho_aqa_doc_to_text(doc, lmms_eval_specific_kwargs): + question = doc["question"] + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}{question}{post_prompt}" + + +# functions for the clotho_asqa_v2 task, need to be tested later + +with open(Path(__file__).parent / "_default_template_yaml", "r") as f: + raw_data = f.readlines() + safe_data = [] + for i, line in enumerate(raw_data): + # remove function definition since yaml load cannot handle it + if "!function" not in line: + safe_data.append(line) + + config = yaml.safe_load("".join(safe_data)) + + +NUM_SECONDS_TO_SLEEP = 2 +GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"] +API_TYPE = os.getenv("API_TYPE", "azure") + +if API_TYPE == "openai": + API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") + API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") + headers = { + "Authorization": f"Bearer {API_KEY}", + "Content-Type": "application/json", + } + +elif API_TYPE == "azure": + API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken") + API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY") + headers = { + "api-key": API_KEY, + "Content-Type": "application/json", + } + +eval_prompt = """ + [Question] + {question} + + [Reference Answer] + {ground_truth} + + [Model Answer] + {model_response} + + [Task] + Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details. + Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance. + Score0: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference. + Score1: The answer shows minimal alignment, often misunderstanding or providing irrelevant details unrelated to the reference. + Score2: The answer recognizes the topic but diverges significantly from the reference in accuracy or relevance. + Score3: The answer aligns with the reference generally but lacks detail or precise accuracy in some aspects. + Score4: The answer is mostly accurate and relevant, closely following the reference but could be clearer or more detailed. + Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail. + + Your response should be formatted as follows: + Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...") + Rating: (int)""" + + +retries = 3 +NUM_SECONDS_TO_SLEEP = 5 + + +def get_eval(max_tokens: int, content: str, retries: int = retries): + global headers + + messages = [ + {"role": "user", "content": content}, + ] + + payload = {"model": GPT_EVAL_MODEL_NAME, "messages": messages, "temperature": 0.7, "max_tokens": max_tokens, "top_p": 0.95, "frequency_penalty": 0, "presence_penalty": 0, "stop": None} + + for attempt in range(retries): + try: + response = requests.post(API_URL, headers=headers, json=payload, timeout=60) + response.raise_for_status() + response_data = response.json() + + content = response_data["choices"][0]["message"]["content"].strip() + if content != "": + return content, response_data["model"] + break # If successful, break out of the loop + + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}") + if attempt < retries: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All {retries} attempts failed. Last error message: {e}") + return "", "" + return "", "" + + +def clotho_aqa_v2_process_results(doc, result): + pred = result[0] + ground_truth_str = doc["answer"] + content = eval_prompt.format(model_response=pred, ground_truth=ground_truth_str, question=doc["question"]) + eval_answer, model_name = get_eval(max_tokens=1024, content=content) + return { + "gpt_eval": {"eval_answer": eval_answer, "model_name": model_name}, + } + + +def clotho_aqa_v2_aggregate_results(results): + score = 0 + for result in results: + eval_answer = result["eval_answer"] + eval_score = re.search(r"([0-5])", eval_answer).group(1) + try: + eval_score = float(eval_score) + except Exception as e: + eval_logger.error(f"Error parsing eval_score: {e}") + eval_score = 0.0 + score += eval_score + + return score / len(results) * 20 diff --git a/lmms_eval/tasks/common_voice_15/common_voice_15.yaml b/lmms_eval/tasks/common_voice_15/common_voice_15.yaml new file mode 100644 index 00000000..4ae38dfd --- /dev/null +++ b/lmms_eval/tasks/common_voice_15/common_voice_15.yaml @@ -0,0 +1,5 @@ +group: common_voice_15 +task: +- common_voice_15_zh-CN +- common_voice_15_en +- common_voice_15_fr \ No newline at end of file diff --git a/lmms_eval/tasks/common_voice_15/common_voice_15_en.yaml b/lmms_eval/tasks/common_voice_15/common_voice_15_en.yaml new file mode 100644 index 00000000..b8e2fb36 --- /dev/null +++ b/lmms_eval/tasks/common_voice_15/common_voice_15_en.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/common_voice_15 +dataset_kwargs: + token: True +task : "common_voice_15_en" +test_split: test +dataset_name: en +output_type: generate_until +doc_to_visual: !function utils.common_voice_15_doc_to_audio +doc_to_text: !function utils.common_voice_15_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.common_voice_15_process_result +metric_list: + - metric: wer + aggregation : !function utils.common_voice_15_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/common_voice_15/common_voice_15_fr.yaml b/lmms_eval/tasks/common_voice_15/common_voice_15_fr.yaml new file mode 100644 index 00000000..1254c2bc --- /dev/null +++ b/lmms_eval/tasks/common_voice_15/common_voice_15_fr.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/common_voice_15 +dataset_kwargs: + token: True +task : "common_voice_15_fr" +test_split: test +dataset_name: fr +output_type: generate_until +doc_to_visual: !function utils.common_voice_15_doc_to_audio +doc_to_text: !function utils.common_voice_15_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.common_voice_15_process_result +metric_list: + - metric: wer + aggregation : !function utils.common_voice_15_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|fr|>" \ No newline at end of file diff --git a/lmms_eval/tasks/common_voice_15/common_voice_15_zh-CN.yaml b/lmms_eval/tasks/common_voice_15/common_voice_15_zh-CN.yaml new file mode 100644 index 00000000..92dba1d5 --- /dev/null +++ b/lmms_eval/tasks/common_voice_15/common_voice_15_zh-CN.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/common_voice_15 +dataset_kwargs: + token: True +task : "common_voice_15_zh-CN" +test_split: test +dataset_name: zh-CN +output_type: generate_until +doc_to_visual: !function utils.common_voice_15_doc_to_audio +doc_to_text: !function utils.common_voice_15_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.common_voice_15_process_result +metric_list: + - metric: wer + aggregation : !function utils.common_voice_15_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|zh|>" \ No newline at end of file diff --git a/lmms_eval/tasks/common_voice_15/utils.py b/lmms_eval/tasks/common_voice_15/utils.py new file mode 100644 index 00000000..312bfa17 --- /dev/null +++ b/lmms_eval/tasks/common_voice_15/utils.py @@ -0,0 +1,182 @@ +import os +import re +import unicodedata + +import editdistance as ed +import zhconv + +from lmms_eval.tasks.librispeech.cn_tn import TextNorm +from lmms_eval.tasks.librispeech.whisper_normalizer.basic import BasicTextNormalizer +from lmms_eval.tasks.librispeech.whisper_normalizer.english import EnglishTextNormalizer + +# ImportError: To support decoding audio files, please install 'librosa' and 'soundfile'. +english_normalizer = EnglishTextNormalizer() +chinese_normalizer = TextNorm( + to_banjiao=False, + to_upper=False, + to_lower=False, + remove_fillers=False, + remove_erhua=False, + check_chars=False, + remove_space=False, + cc_mode="", +) +basic_normalizer = BasicTextNormalizer() + +dir_name = os.path.dirname(os.path.abspath(__file__)) + + +def common_voice_15_doc_to_audio(doc): + return [doc["audio"]] + + +def common_voice_15_doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}" + + +def common_voice_15_process_result(doc, result): + pred = result[0] if len(result) > 0 else "" + + gt = doc["gt"] + source = doc["source"] + task = doc["task"] + + data_dict = {"gt": gt, "pred": pred, "source": source, "task": task} + + return {"wer": data_dict} + + +PUNCS = "!,.?;:" + + +def remove_sp(text, language): + gt = re.sub(r"<\|.*?\|>", " ", text) + gt = re.sub(rf"\s+", r" ", gt) # Replace consecutive spaces in the text with a single space. + gt = re.sub(f" ?([{PUNCS}])", r"\1", gt) + gt = gt.lstrip(" ") + if language == "zh": + gt = re.sub(rf"\s+", r"", gt) + return gt + + +class EvaluationTokenizer(object): + """A generic evaluation-time tokenizer, which leverages built-in tokenizers + in sacreBLEU (https://github.com/mjpost/sacrebleu). It additionally provides + lowercasing, punctuation removal and character tokenization, which are + applied after sacreBLEU tokenization. + + Args: + tokenizer_type (str): the type of sacreBLEU tokenizer to apply. + lowercase (bool): lowercase the text. + punctuation_removal (bool): remove punctuation (based on unicode + category) from text. + character_tokenization (bool): tokenize the text to characters. + """ + + SPACE = chr(32) + SPACE_ESCAPE = chr(9601) + # ALL_TOKENIZER_TYPES = ChoiceEnum(["none", "13a", "intl", "zh", "ja-mecab"]) + + def __init__( + self, + tokenizer_type: str = "13a", + lowercase: bool = False, + punctuation_removal: bool = False, + character_tokenization: bool = False, + ): + from sacrebleu.tokenizers.tokenizer_13a import Tokenizer13a + from sacrebleu.tokenizers.tokenizer_char import TokenizerChar + from sacrebleu.tokenizers.tokenizer_intl import TokenizerV14International + from sacrebleu.tokenizers.tokenizer_ja_mecab import TokenizerJaMecab + from sacrebleu.tokenizers.tokenizer_none import NoneTokenizer + from sacrebleu.tokenizers.tokenizer_zh import TokenizerZh + + TOKENIZERS = { + "none": NoneTokenizer, + "13a": Tokenizer13a, + "intl": TokenizerV14International, + "zh": TokenizerZh, + "ja-mecab": TokenizerJaMecab, + "char": TokenizerChar, + } + + assert tokenizer_type in TOKENIZERS, f"{tokenizer_type}, {TOKENIZERS}" + self.lowercase = lowercase + self.punctuation_removal = punctuation_removal + self.character_tokenization = character_tokenization + self.tokenizer = TOKENIZERS[tokenizer_type] + # self.tokenizer = tokenizer_none + + @classmethod + def remove_punctuation(cls, sent: str): + """Remove punctuation based on Unicode category.""" + return cls.SPACE.join(t for t in sent.split(cls.SPACE) if not all(unicodedata.category(c)[0] == "P" for c in t)) + + def tokenize(self, sent: str): + tokenized = self.tokenizer()(sent) + + if self.punctuation_removal: + tokenized = self.remove_punctuation(tokenized) + + if self.character_tokenization: + tokenized = self.SPACE.join(list(tokenized.replace(self.SPACE, self.SPACE_ESCAPE))) + + if self.lowercase: + tokenized = tokenized.lower() + + return tokenized + + +def compute_wer(refs, hyps, language): + distance = 0 + ref_length = 0 + tokenizer = EvaluationTokenizer( + tokenizer_type="none", + lowercase=True, + punctuation_removal=True, + character_tokenization=False, + ) + for i in range(len(refs)): + ref = refs[i] + pred = hyps[i] + if language in ["yue"]: + ref = zhconv.convert(ref, "zh-cn") + pred = zhconv.convert(pred, "zh-cn") + if language in ["en"]: + ref = english_normalizer(ref) + pred = english_normalizer(pred) + if language in ["zh"]: + ref = chinese_normalizer(ref) + pred = chinese_normalizer(pred) + else: + ref = basic_normalizer(ref) + pred = basic_normalizer(pred) + ref_items = tokenizer.tokenize(ref).split() + pred_items = tokenizer.tokenize(pred).split() + if language in ["zh", "yue"]: + ref_items = [x for x in "".join(ref_items)] + pred_items = [x for x in "".join(pred_items)] + if i == 0: + print(f"ref: {ref}") + print(f"pred: {pred}") + print(f"ref_items:\n{ref_items}\n{len(ref_items)}\n{ref_items[0]}") + print(f"pred_items:\n{pred_items}\n{len(ref_items)}\n{ref_items[0]}") + distance += ed.eval(ref_items, pred_items) + ref_length += len(ref_items) + return distance / ref_length + + +def common_voice_15_wer(results, args): + refs, hyps = [], [] + for result in results: + lan = result["task"][4:] + gt = result["gt"] + response = result["pred"] + gt = remove_sp(gt, lan) + response = remove_sp(response, lan) + refs.append(gt) + hyps.append(response) + wer = compute_wer(refs, hyps, lan) + return wer * 100 diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech.yaml new file mode 100644 index 00000000..55d33b41 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech.yaml @@ -0,0 +1,4 @@ +group: gigaspeech +task: + - gigaspeech_dev + - gigaspeech_test diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_dev.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_dev.yaml new file mode 100644 index 00000000..d4ed3346 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_dev.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: dev +dataset_kwargs: + token: True +task : "gigaspeech_dev" +test_split: validation +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_l_dev.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_l_dev.yaml new file mode 100644 index 00000000..97fa7f4e --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_l_dev.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: l +dataset_kwargs: + token: True +task : "gigaspeech_l_dev" +test_split: validation +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_l_test.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_l_test.yaml new file mode 100644 index 00000000..75bb14f0 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_l_test.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: l +dataset_kwargs: + token: True +task : "gigaspeech_l_test" +test_split: test +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_m_dev.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_m_dev.yaml new file mode 100644 index 00000000..29f0bbfd --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_m_dev.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: m +dataset_kwargs: + token: True +task : "gigaspeech_m_dev" +test_split: validation +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_m_test.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_m_test.yaml new file mode 100644 index 00000000..9e9dfbcb --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_m_test.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: m +dataset_kwargs: + token: True +task : "gigaspeech_m_test" +test_split: test +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_s_dev.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_s_dev.yaml new file mode 100644 index 00000000..cbdf8fe9 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_s_dev.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: s +dataset_kwargs: + token: True +task : "gigaspeech_s_dev" +test_split: validation +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_s_test.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_s_test.yaml new file mode 100644 index 00000000..d878a0cc --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_s_test.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: s +dataset_kwargs: + token: True +task : "gigaspeech_s_test" +test_split: test +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_test.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_test.yaml new file mode 100644 index 00000000..d6f98db1 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_test.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: test +dataset_kwargs: + token: True +task : "gigaspeech_test" +test_split: test +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_xl_dev.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_xl_dev.yaml new file mode 100644 index 00000000..86d39b69 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_xl_dev.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: xl +dataset_kwargs: + token: True +task : "gigaspeech_xl_dev" +test_split: validation +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_xl_test.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_xl_test.yaml new file mode 100644 index 00000000..67e1f463 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_xl_test.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: xl +dataset_kwargs: + token: True +task : "gigaspeech_xl_test" +test_split: test +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_xs_dev.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_xs_dev.yaml new file mode 100644 index 00000000..558a154e --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_xs_dev.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: xs +dataset_kwargs: + token: True +task : "gigaspeech_xs_dev" +test_split: validation +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/gigaspeech_xs_test.yaml b/lmms_eval/tasks/gigaspeech/gigaspeech_xs_test.yaml new file mode 100644 index 00000000..744a0260 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/gigaspeech_xs_test.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/gigaspeech +dataset_name: xs +dataset_kwargs: + token: True +task : "gigaspeech_xs_test" +test_split: test +output_type: generate_until +doc_to_visual: !function utils.gigaspeech_doc_to_audio +doc_to_text: !function utils.gigaspeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.gigaspeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.gigaspeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/utils.py b/lmms_eval/tasks/gigaspeech/utils.py new file mode 100755 index 00000000..c45a9d61 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/utils.py @@ -0,0 +1,162 @@ +import os +import re +import unicodedata + +import editdistance as ed # TODO: new package + +from lmms_eval.tasks.gigaspeech.whisper_normalizer.basic import BasicTextNormalizer +from lmms_eval.tasks.gigaspeech.whisper_normalizer.english import EnglishTextNormalizer + +# ImportError: To support decoding audio files, please install 'librosa' and 'soundfile'. +english_normalizer = EnglishTextNormalizer() + +basic_normalizer = BasicTextNormalizer() + +dir_name = os.path.dirname(os.path.abspath(__file__)) + + +def gigaspeech_doc_to_audio(doc): + return [doc["audio"]] + + +def gigaspeech_doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}" + + +def gigaspeech_process_result(doc, result): + pred = result[0] if len(result) > 0 else "" + gt = doc["gt"] + data_dict = {"gt": gt, "pred": pred} + + return {"wer": data_dict} + + +def gigaspeech_xl_process_result(doc, result): + pred = result[0] if len(result) > 0 else "" + gt = doc["text"] + data_dict = {"gt": gt, "pred": pred} + + return {"wer": data_dict} + + +PUNCS = "!,.?;:" + + +def remove_sp(text): + gt = re.sub(r"<\|.*?\|>", " ", text) + gt = re.sub(rf"\s+", r" ", gt) # Replace consecutive spaces in the text with a single space. + gt = re.sub(f" ?([{PUNCS}])", r"\1", gt) + gt = gt.lstrip(" ") + return gt + + +class EvaluationTokenizer(object): + """A generic evaluation-time tokenizer, which leverages built-in tokenizers + in sacreBLEU (https://github.com/mjpost/sacrebleu). It additionally provides + lowercasing, punctuation removal and character tokenization, which are + applied after sacreBLEU tokenization. + + Args: + tokenizer_type (str): the type of sacreBLEU tokenizer to apply. + lowercase (bool): lowercase the text. + punctuation_removal (bool): remove punctuation (based on unicode + category) from text. + character_tokenization (bool): tokenize the text to characters. + """ + + SPACE = chr(32) + SPACE_ESCAPE = chr(9601) + # ALL_TOKENIZER_TYPES = ChoiceEnum(["none", "13a", "intl", "zh", "ja-mecab"]) + + def __init__( + self, + tokenizer_type: str = "13a", + lowercase: bool = False, + punctuation_removal: bool = False, + character_tokenization: bool = False, + ): + # from sacrebleu.tokenizers import TOKENIZERS + # from sacrebleu.tokenizers import tokenizer_none + from sacrebleu.tokenizers.tokenizer_13a import Tokenizer13a + from sacrebleu.tokenizers.tokenizer_char import TokenizerChar + from sacrebleu.tokenizers.tokenizer_intl import TokenizerV14International + from sacrebleu.tokenizers.tokenizer_ja_mecab import TokenizerJaMecab + from sacrebleu.tokenizers.tokenizer_none import NoneTokenizer + from sacrebleu.tokenizers.tokenizer_zh import TokenizerZh + + TOKENIZERS = { + "none": NoneTokenizer, + "13a": Tokenizer13a, + "intl": TokenizerV14International, + "zh": TokenizerZh, + "ja-mecab": TokenizerJaMecab, + "char": TokenizerChar, + } + + assert tokenizer_type in TOKENIZERS, f"{tokenizer_type}, {TOKENIZERS}" + self.lowercase = lowercase + self.punctuation_removal = punctuation_removal + self.character_tokenization = character_tokenization + self.tokenizer = TOKENIZERS[tokenizer_type] + # self.tokenizer = tokenizer_none + + @classmethod + def remove_punctuation(cls, sent: str): + """Remove punctuation based on Unicode category.""" + return cls.SPACE.join(t for t in sent.split(cls.SPACE) if not all(unicodedata.category(c)[0] == "P" for c in t)) + + def tokenize(self, sent: str): + tokenized = self.tokenizer()(sent) + + if self.punctuation_removal: + tokenized = self.remove_punctuation(tokenized) + + if self.character_tokenization: + tokenized = self.SPACE.join(list(tokenized.replace(self.SPACE, self.SPACE_ESCAPE))) + + if self.lowercase: + tokenized = tokenized.lower() + + return tokenized + + +def compute_wer(refs, hyps): + distance = 0 + ref_length = 0 + tokenizer = EvaluationTokenizer( + tokenizer_type="none", + lowercase=True, + punctuation_removal=True, + character_tokenization=False, + ) + for i in range(len(refs)): + ref = refs[i] + pred = hyps[i] + ref = english_normalizer(ref) + pred = english_normalizer(pred) + ref_items = tokenizer.tokenize(ref).split() + pred_items = tokenizer.tokenize(pred).split() + if i == 0: + print(f"ref: {ref}") + print(f"pred: {pred}") + print(f"ref_items:\n{ref_items}\n{len(ref_items)}\n{ref_items[0]}") + print(f"pred_items:\n{pred_items}\n{len(ref_items)}\n{ref_items[0]}") + distance += ed.eval(ref_items, pred_items) + ref_length += len(ref_items) + return distance / ref_length + + +def gigaspeech_wer(results, args): + refs, hyps = [], [] + for result in results: + gt = result["gt"] + response = result["pred"] + gt = remove_sp(gt) + response = remove_sp(response) + refs.append(gt) + hyps.append(response) + wer = compute_wer(refs, hyps) + # print(f"source: {source} cnt: {len(refs)} wer: {wer:.4f}") + return wer * 100 diff --git a/lmms_eval/tasks/gigaspeech/whisper_normalizer/basic.py b/lmms_eval/tasks/gigaspeech/whisper_normalizer/basic.py new file mode 100644 index 00000000..00a54dcc --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/whisper_normalizer/basic.py @@ -0,0 +1,58 @@ +import re +import unicodedata + +import regex + +# non-ASCII letters that are not separated by "NFKD" normalization +ADDITIONAL_DIACRITICS = { + "ล“": "oe", + "ล’": "OE", + "รธ": "o", + "ร˜": "O", + "รฆ": "ae", + "ร†": "AE", + "รŸ": "ss", + "แบž": "SS", + "ฤ‘": "d", + "ฤ": "D", + "รฐ": "d", + "ร": "D", + "รพ": "th", + "รž": "th", + "ล‚": "l", + "ล": "L", +} + + +def remove_symbols_and_diacritics(s: str, keep=""): + """ + Replace any other markers, symbols, and punctuations with a space, + and drop any diacritics (category 'Mn' and some manual mappings) + """ + return "".join(c if c in keep else ADDITIONAL_DIACRITICS[c] if c in ADDITIONAL_DIACRITICS else "" if unicodedata.category(c) == "Mn" else " " if unicodedata.category(c)[0] in "MSP" else c for c in unicodedata.normalize("NFKD", s)) + + +def remove_symbols(s: str): + """ + Replace any other markers, symbols, punctuations with a space, keeping diacritics + """ + return "".join(" " if unicodedata.category(c)[0] in "MSP" else c for c in unicodedata.normalize("NFKC", s)) + + +class BasicTextNormalizer: + def __init__(self, remove_diacritics: bool = False, split_letters: bool = False): + self.clean = remove_symbols_and_diacritics if remove_diacritics else remove_symbols + self.split_letters = split_letters + + def __call__(self, s: str): + s = s.lower() + s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets + s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis + s = self.clean(s).lower() + + if self.split_letters: + s = " ".join(regex.findall(r"\X", s, regex.U)) + + s = re.sub(r"\s+", " ", s) # replace any successive whitespace characters with a space + + return s diff --git a/lmms_eval/tasks/gigaspeech/whisper_normalizer/english.json b/lmms_eval/tasks/gigaspeech/whisper_normalizer/english.json new file mode 100644 index 00000000..566e4812 --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/whisper_normalizer/english.json @@ -0,0 +1,1741 @@ +{ + "accessorise": "accessorize", + "accessorised": "accessorized", + "accessorises": "accessorizes", + "accessorising": "accessorizing", + "acclimatisation": "acclimatization", + "acclimatise": "acclimatize", + "acclimatised": "acclimatized", + "acclimatises": "acclimatizes", + "acclimatising": "acclimatizing", + "accoutrements": "accouterments", + "aeon": "eon", + "aeons": "eons", + "aerogramme": "aerogram", + "aerogrammes": "aerograms", + "aeroplane": "airplane", + "aeroplanes": "airplanes", + "aesthete": "esthete", + "aesthetes": "esthetes", + "aesthetic": "esthetic", + "aesthetically": "esthetically", + "aesthetics": "esthetics", + "aetiology": "etiology", + "ageing": "aging", + "aggrandisement": "aggrandizement", + "agonise": "agonize", + "agonised": "agonized", + "agonises": "agonizes", + "agonising": "agonizing", + "agonisingly": "agonizingly", + "almanack": "almanac", + "almanacks": "almanacs", + "aluminium": "aluminum", + "amortisable": "amortizable", + "amortisation": "amortization", + "amortisations": "amortizations", + "amortise": "amortize", + "amortised": "amortized", + "amortises": "amortizes", + "amortising": "amortizing", + "amphitheatre": "amphitheater", + "amphitheatres": "amphitheaters", + "anaemia": "anemia", + "anaemic": "anemic", + "anaesthesia": "anesthesia", + "anaesthetic": "anesthetic", + "anaesthetics": "anesthetics", + "anaesthetise": "anesthetize", + "anaesthetised": "anesthetized", + "anaesthetises": "anesthetizes", + "anaesthetising": "anesthetizing", + "anaesthetist": "anesthetist", + "anaesthetists": "anesthetists", + "anaesthetize": "anesthetize", + "anaesthetized": "anesthetized", + "anaesthetizes": "anesthetizes", + "anaesthetizing": "anesthetizing", + "analogue": "analog", + "analogues": "analogs", + "analyse": "analyze", + "analysed": "analyzed", + "analyses": "analyzes", + "analysing": "analyzing", + "anglicise": "anglicize", + "anglicised": "anglicized", + "anglicises": "anglicizes", + "anglicising": "anglicizing", + "annualised": "annualized", + "antagonise": "antagonize", + "antagonised": "antagonized", + "antagonises": "antagonizes", + "antagonising": "antagonizing", + "apologise": "apologize", + "apologised": "apologized", + "apologises": "apologizes", + "apologising": "apologizing", + "appal": "appall", + "appals": "appalls", + "appetiser": "appetizer", + "appetisers": "appetizers", + "appetising": "appetizing", + "appetisingly": "appetizingly", + "arbour": "arbor", + "arbours": "arbors", + "archeological": "archaeological", + "archaeologically": "archeologically", + "archaeologist": "archeologist", + "archaeologists": "archeologists", + "archaeology": "archeology", + "ardour": "ardor", + "armour": "armor", + "armoured": "armored", + "armourer": "armorer", + "armourers": "armorers", + "armouries": "armories", + "armoury": "armory", + "artefact": "artifact", + "artefacts": "artifacts", + "authorise": "authorize", + "authorised": "authorized", + "authorises": "authorizes", + "authorising": "authorizing", + "axe": "ax", + "backpedalled": "backpedaled", + "backpedalling": "backpedaling", + "bannister": "banister", + "bannisters": "banisters", + "baptise": "baptize", + "baptised": "baptized", + "baptises": "baptizes", + "baptising": "baptizing", + "bastardise": "bastardize", + "bastardised": "bastardized", + "bastardises": "bastardizes", + "bastardising": "bastardizing", + "battleax": "battleaxe", + "baulk": "balk", + "baulked": "balked", + "baulking": "balking", + "baulks": "balks", + "bedevilled": "bedeviled", + "bedevilling": "bedeviling", + "behaviour": "behavior", + "behavioural": "behavioral", + "behaviourism": "behaviorism", + "behaviourist": "behaviorist", + "behaviourists": "behaviorists", + "behaviours": "behaviors", + "behove": "behoove", + "behoved": "behooved", + "behoves": "behooves", + "bejewelled": "bejeweled", + "belabour": "belabor", + "belaboured": "belabored", + "belabouring": "belaboring", + "belabours": "belabors", + "bevelled": "beveled", + "bevvies": "bevies", + "bevvy": "bevy", + "biassed": "biased", + "biassing": "biasing", + "bingeing": "binging", + "bougainvillaea": "bougainvillea", + "bougainvillaeas": "bougainvilleas", + "bowdlerise": "bowdlerize", + "bowdlerised": "bowdlerized", + "bowdlerises": "bowdlerizes", + "bowdlerising": "bowdlerizing", + "breathalyse": "breathalyze", + "breathalysed": "breathalyzed", + "breathalyser": "breathalyzer", + "breathalysers": "breathalyzers", + "breathalyses": "breathalyzes", + "breathalysing": "breathalyzing", + "brutalise": "brutalize", + "brutalised": "brutalized", + "brutalises": "brutalizes", + "brutalising": "brutalizing", + "busses": "buses", + "bussing": "busing", + "caesarean": "cesarean", + "caesareans": "cesareans", + "calibre": "caliber", + "calibres": "calibers", + "calliper": "caliper", + "callipers": "calipers", + "callisthenics": "calisthenics", + "canalise": "canalize", + "canalised": "canalized", + "canalises": "canalizes", + "canalising": "canalizing", + "cancelation": "cancellation", + "cancelations": "cancellations", + "cancelled": "canceled", + "cancelling": "canceling", + "candour": "candor", + "cannibalise": "cannibalize", + "cannibalised": "cannibalized", + "cannibalises": "cannibalizes", + "cannibalising": "cannibalizing", + "canonise": "canonize", + "canonised": "canonized", + "canonises": "canonizes", + "canonising": "canonizing", + "capitalise": "capitalize", + "capitalised": "capitalized", + "capitalises": "capitalizes", + "capitalising": "capitalizing", + "caramelise": "caramelize", + "caramelised": "caramelized", + "caramelises": "caramelizes", + "caramelising": "caramelizing", + "carbonise": "carbonize", + "carbonised": "carbonized", + "carbonises": "carbonizes", + "carbonising": "carbonizing", + "carolled": "caroled", + "carolling": "caroling", + "catalogue": "catalog", + "catalogued": "cataloged", + "catalogues": "catalogs", + "cataloguing": "cataloging", + "catalyse": "catalyze", + "catalysed": "catalyzed", + "catalyses": "catalyzes", + "catalysing": "catalyzing", + "categorise": "categorize", + "categorised": "categorized", + "categorises": "categorizes", + "categorising": "categorizing", + "cauterise": "cauterize", + "cauterised": "cauterized", + "cauterises": "cauterizes", + "cauterising": "cauterizing", + "cavilled": "caviled", + "cavilling": "caviling", + "centigramme": "centigram", + "centigrammes": "centigrams", + "centilitre": "centiliter", + "centilitres": "centiliters", + "centimetre": "centimeter", + "centimetres": "centimeters", + "centralise": "centralize", + "centralised": "centralized", + "centralises": "centralizes", + "centralising": "centralizing", + "centre": "center", + "centred": "centered", + "centrefold": "centerfold", + "centrefolds": "centerfolds", + "centrepiece": "centerpiece", + "centrepieces": "centerpieces", + "centres": "centers", + "channelled": "channeled", + "channelling": "channeling", + "characterise": "characterize", + "characterised": "characterized", + "characterises": "characterizes", + "characterising": "characterizing", + "cheque": "check", + "chequebook": "checkbook", + "chequebooks": "checkbooks", + "chequered": "checkered", + "cheques": "checks", + "chilli": "chili", + "chimaera": "chimera", + "chimaeras": "chimeras", + "chiselled": "chiseled", + "chiselling": "chiseling", + "circularise": "circularize", + "circularised": "circularized", + "circularises": "circularizes", + "circularising": "circularizing", + "civilise": "civilize", + "civilised": "civilized", + "civilises": "civilizes", + "civilising": "civilizing", + "clamour": "clamor", + "clamoured": "clamored", + "clamouring": "clamoring", + "clamours": "clamors", + "clangour": "clangor", + "clarinettist": "clarinetist", + "clarinettists": "clarinetists", + "collectivise": "collectivize", + "collectivised": "collectivized", + "collectivises": "collectivizes", + "collectivising": "collectivizing", + "colonisation": "colonization", + "colonise": "colonize", + "colonised": "colonized", + "coloniser": "colonizer", + "colonisers": "colonizers", + "colonises": "colonizes", + "colonising": "colonizing", + "colour": "color", + "colourant": "colorant", + "colourants": "colorants", + "coloured": "colored", + "coloureds": "coloreds", + "colourful": "colorful", + "colourfully": "colorfully", + "colouring": "coloring", + "colourize": "colorize", + "colourized": "colorized", + "colourizes": "colorizes", + "colourizing": "colorizing", + "colourless": "colorless", + "colours": "colors", + "commercialise": "commercialize", + "commercialised": "commercialized", + "commercialises": "commercializes", + "commercialising": "commercializing", + "compartmentalise": "compartmentalize", + "compartmentalised": "compartmentalized", + "compartmentalises": "compartmentalizes", + "compartmentalising": "compartmentalizing", + "computerise": "computerize", + "computerised": "computerized", + "computerises": "computerizes", + "computerising": "computerizing", + "conceptualise": "conceptualize", + "conceptualised": "conceptualized", + "conceptualises": "conceptualizes", + "conceptualising": "conceptualizing", + "connexion": "connection", + "connexions": "connections", + "contextualise": "contextualize", + "contextualised": "contextualized", + "contextualises": "contextualizes", + "contextualising": "contextualizing", + "cosier": "cozier", + "cosies": "cozies", + "cosiest": "coziest", + "cosily": "cozily", + "cosiness": "coziness", + "cosy": "cozy", + "councillor": "councilor", + "councillors": "councilors", + "counselled": "counseled", + "counselling": "counseling", + "counsellor": "counselor", + "counsellors": "counselors", + "crenelated": "crenellated", + "criminalise": "criminalize", + "criminalised": "criminalized", + "criminalises": "criminalizes", + "criminalising": "criminalizing", + "criticise": "criticize", + "criticised": "criticized", + "criticises": "criticizes", + "criticising": "criticizing", + "crueller": "crueler", + "cruellest": "cruelest", + "crystallisation": "crystallization", + "crystallise": "crystallize", + "crystallised": "crystallized", + "crystallises": "crystallizes", + "crystallising": "crystallizing", + "cudgelled": "cudgeled", + "cudgelling": "cudgeling", + "customise": "customize", + "customised": "customized", + "customises": "customizes", + "customising": "customizing", + "cypher": "cipher", + "cyphers": "ciphers", + "decentralisation": "decentralization", + "decentralise": "decentralize", + "decentralised": "decentralized", + "decentralises": "decentralizes", + "decentralising": "decentralizing", + "decriminalisation": "decriminalization", + "decriminalise": "decriminalize", + "decriminalised": "decriminalized", + "decriminalises": "decriminalizes", + "decriminalising": "decriminalizing", + "defence": "defense", + "defenceless": "defenseless", + "defences": "defenses", + "dehumanisation": "dehumanization", + "dehumanise": "dehumanize", + "dehumanised": "dehumanized", + "dehumanises": "dehumanizes", + "dehumanising": "dehumanizing", + "demeanour": "demeanor", + "demilitarisation": "demilitarization", + "demilitarise": "demilitarize", + "demilitarised": "demilitarized", + "demilitarises": "demilitarizes", + "demilitarising": "demilitarizing", + "demobilisation": "demobilization", + "demobilise": "demobilize", + "demobilised": "demobilized", + "demobilises": "demobilizes", + "demobilising": "demobilizing", + "democratisation": "democratization", + "democratise": "democratize", + "democratised": "democratized", + "democratises": "democratizes", + "democratising": "democratizing", + "demonise": "demonize", + "demonised": "demonized", + "demonises": "demonizes", + "demonising": "demonizing", + "demoralisation": "demoralization", + "demoralise": "demoralize", + "demoralised": "demoralized", + "demoralises": "demoralizes", + "demoralising": "demoralizing", + "denationalisation": "denationalization", + "denationalise": "denationalize", + "denationalised": "denationalized", + "denationalises": "denationalizes", + "denationalising": "denationalizing", + "deodorise": "deodorize", + "deodorised": "deodorized", + "deodorises": "deodorizes", + "deodorising": "deodorizing", + "depersonalise": "depersonalize", + "depersonalised": "depersonalized", + "depersonalises": "depersonalizes", + "depersonalising": "depersonalizing", + "deputise": "deputize", + "deputised": "deputized", + "deputises": "deputizes", + "deputising": "deputizing", + "desensitisation": "desensitization", + "desensitise": "desensitize", + "desensitised": "desensitized", + "desensitises": "desensitizes", + "desensitising": "desensitizing", + "destabilisation": "destabilization", + "destabilise": "destabilize", + "destabilised": "destabilized", + "destabilises": "destabilizes", + "destabilising": "destabilizing", + "dialled": "dialed", + "dialling": "dialing", + "dialogue": "dialog", + "dialogues": "dialogs", + "diarrhoea": "diarrhea", + "digitise": "digitize", + "digitised": "digitized", + "digitises": "digitizes", + "digitising": "digitizing", + "disc": "disk", + "discolour": "discolor", + "discoloured": "discolored", + "discolouring": "discoloring", + "discolours": "discolors", + "discs": "disks", + "disembowelled": "disemboweled", + "disembowelling": "disemboweling", + "disfavour": "disfavor", + "dishevelled": "disheveled", + "dishonour": "dishonor", + "dishonourable": "dishonorable", + "dishonourably": "dishonorably", + "dishonoured": "dishonored", + "dishonouring": "dishonoring", + "dishonours": "dishonors", + "disorganisation": "disorganization", + "disorganised": "disorganized", + "distil": "distill", + "distils": "distills", + "dramatisation": "dramatization", + "dramatisations": "dramatizations", + "dramatise": "dramatize", + "dramatised": "dramatized", + "dramatises": "dramatizes", + "dramatising": "dramatizing", + "draught": "draft", + "draughtboard": "draftboard", + "draughtboards": "draftboards", + "draughtier": "draftier", + "draughtiest": "draftiest", + "draughts": "drafts", + "draughtsman": "draftsman", + "draughtsmanship": "draftsmanship", + "draughtsmen": "draftsmen", + "draughtswoman": "draftswoman", + "draughtswomen": "draftswomen", + "draughty": "drafty", + "drivelled": "driveled", + "drivelling": "driveling", + "duelled": "dueled", + "duelling": "dueling", + "economise": "economize", + "economised": "economized", + "economises": "economizes", + "economising": "economizing", + "edoema": "edema", + "editorialise": "editorialize", + "editorialised": "editorialized", + "editorialises": "editorializes", + "editorialising": "editorializing", + "empathise": "empathize", + "empathised": "empathized", + "empathises": "empathizes", + "empathising": "empathizing", + "emphasise": "emphasize", + "emphasised": "emphasized", + "emphasises": "emphasizes", + "emphasising": "emphasizing", + "enamelled": "enameled", + "enamelling": "enameling", + "enamoured": "enamored", + "encyclopaedia": "encyclopedia", + "encyclopaedias": "encyclopedias", + "encyclopaedic": "encyclopedic", + "endeavour": "endeavor", + "endeavoured": "endeavored", + "endeavouring": "endeavoring", + "endeavours": "endeavors", + "energise": "energize", + "energised": "energized", + "energises": "energizes", + "energising": "energizing", + "enrol": "enroll", + "enrols": "enrolls", + "enthral": "enthrall", + "enthrals": "enthralls", + "epaulette": "epaulet", + "epaulettes": "epaulets", + "epicentre": "epicenter", + "epicentres": "epicenters", + "epilogue": "epilog", + "epilogues": "epilogs", + "epitomise": "epitomize", + "epitomised": "epitomized", + "epitomises": "epitomizes", + "epitomising": "epitomizing", + "equalisation": "equalization", + "equalise": "equalize", + "equalised": "equalized", + "equaliser": "equalizer", + "equalisers": "equalizers", + "equalises": "equalizes", + "equalising": "equalizing", + "eulogise": "eulogize", + "eulogised": "eulogized", + "eulogises": "eulogizes", + "eulogising": "eulogizing", + "evangelise": "evangelize", + "evangelised": "evangelized", + "evangelises": "evangelizes", + "evangelising": "evangelizing", + "exorcise": "exorcize", + "exorcised": "exorcized", + "exorcises": "exorcizes", + "exorcising": "exorcizing", + "extemporisation": "extemporization", + "extemporise": "extemporize", + "extemporised": "extemporized", + "extemporises": "extemporizes", + "extemporising": "extemporizing", + "externalisation": "externalization", + "externalisations": "externalizations", + "externalise": "externalize", + "externalised": "externalized", + "externalises": "externalizes", + "externalising": "externalizing", + "factorise": "factorize", + "factorised": "factorized", + "factorises": "factorizes", + "factorising": "factorizing", + "faecal": "fecal", + "faeces": "feces", + "familiarisation": "familiarization", + "familiarise": "familiarize", + "familiarised": "familiarized", + "familiarises": "familiarizes", + "familiarising": "familiarizing", + "fantasise": "fantasize", + "fantasised": "fantasized", + "fantasises": "fantasizes", + "fantasising": "fantasizing", + "favour": "favor", + "favourable": "favorable", + "favourably": "favorably", + "favoured": "favored", + "favouring": "favoring", + "favourite": "favorite", + "favourites": "favorites", + "favouritism": "favoritism", + "favours": "favors", + "feminise": "feminize", + "feminised": "feminized", + "feminises": "feminizes", + "feminising": "feminizing", + "fertilisation": "fertilization", + "fertilise": "fertilize", + "fertilised": "fertilized", + "fertiliser": "fertilizer", + "fertilisers": "fertilizers", + "fertilises": "fertilizes", + "fertilising": "fertilizing", + "fervour": "fervor", + "fibre": "fiber", + "fibreglass": "fiberglass", + "fibres": "fibers", + "fictionalisation": "fictionalization", + "fictionalisations": "fictionalizations", + "fictionalise": "fictionalize", + "fictionalised": "fictionalized", + "fictionalises": "fictionalizes", + "fictionalising": "fictionalizing", + "fillet": "filet", + "filleted": "fileted", + "filleting": "fileting", + "fillets": "filets", + "finalisation": "finalization", + "finalise": "finalize", + "finalised": "finalized", + "finalises": "finalizes", + "finalising": "finalizing", + "flautist": "flutist", + "flautists": "flutists", + "flavour": "flavor", + "flavoured": "flavored", + "flavouring": "flavoring", + "flavourings": "flavorings", + "flavourless": "flavorless", + "flavours": "flavors", + "flavoursome": "flavorsome", + "flyer / flier": "flier / flyer", + "foetal": "fetal", + "foetid": "fetid", + "foetus": "fetus", + "foetuses": "fetuses", + "formalisation": "formalization", + "formalise": "formalize", + "formalised": "formalized", + "formalises": "formalizes", + "formalising": "formalizing", + "fossilisation": "fossilization", + "fossilise": "fossilize", + "fossilised": "fossilized", + "fossilises": "fossilizes", + "fossilising": "fossilizing", + "fraternisation": "fraternization", + "fraternise": "fraternize", + "fraternised": "fraternized", + "fraternises": "fraternizes", + "fraternising": "fraternizing", + "fulfil": "fulfill", + "fulfilment": "fulfillment", + "fulfils": "fulfills", + "funnelled": "funneled", + "funnelling": "funneling", + "galvanise": "galvanize", + "galvanised": "galvanized", + "galvanises": "galvanizes", + "galvanising": "galvanizing", + "gambolled": "gamboled", + "gambolling": "gamboling", + "gaol": "jail", + "gaolbird": "jailbird", + "gaolbirds": "jailbirds", + "gaolbreak": "jailbreak", + "gaolbreaks": "jailbreaks", + "gaoled": "jailed", + "gaoler": "jailer", + "gaolers": "jailers", + "gaoling": "jailing", + "gaols": "jails", + "gasses": "gases", + "gage": "gauge", + "gaged": "gauged", + "gages": "gauges", + "gaging": "gauging", + "generalisation": "generalization", + "generalisations": "generalizations", + "generalise": "generalize", + "generalised": "generalized", + "generalises": "generalizes", + "generalising": "generalizing", + "ghettoise": "ghettoize", + "ghettoised": "ghettoized", + "ghettoises": "ghettoizes", + "ghettoising": "ghettoizing", + "gipsies": "gypsies", + "glamorise": "glamorize", + "glamorised": "glamorized", + "glamorises": "glamorizes", + "glamorising": "glamorizing", + "glamor": "glamour", + "globalisation": "globalization", + "globalise": "globalize", + "globalised": "globalized", + "globalises": "globalizes", + "globalising": "globalizing", + "glueing": "gluing", + "goitre": "goiter", + "goitres": "goiters", + "gonorrhoea": "gonorrhea", + "gramme": "gram", + "grammes": "grams", + "gravelled": "graveled", + "grey": "gray", + "greyed": "grayed", + "greying": "graying", + "greyish": "grayish", + "greyness": "grayness", + "greys": "grays", + "grovelled": "groveled", + "grovelling": "groveling", + "groyne": "groin", + "groynes": "groins", + "gruelling": "grueling", + "gruellingly": "gruelingly", + "gryphon": "griffin", + "gryphons": "griffins", + "gynaecological": "gynecological", + "gynaecologist": "gynecologist", + "gynaecologists": "gynecologists", + "gynaecology": "gynecology", + "haematological": "hematological", + "haematologist": "hematologist", + "haematologists": "hematologists", + "haematology": "hematology", + "haemoglobin": "hemoglobin", + "haemophilia": "hemophilia", + "haemophiliac": "hemophiliac", + "haemophiliacs": "hemophiliacs", + "haemorrhage": "hemorrhage", + "haemorrhaged": "hemorrhaged", + "haemorrhages": "hemorrhages", + "haemorrhaging": "hemorrhaging", + "haemorrhoids": "hemorrhoids", + "harbour": "harbor", + "harboured": "harbored", + "harbouring": "harboring", + "harbours": "harbors", + "harmonisation": "harmonization", + "harmonise": "harmonize", + "harmonised": "harmonized", + "harmonises": "harmonizes", + "harmonising": "harmonizing", + "homoeopath": "homeopath", + "homoeopathic": "homeopathic", + "homoeopaths": "homeopaths", + "homoeopathy": "homeopathy", + "homogenise": "homogenize", + "homogenised": "homogenized", + "homogenises": "homogenizes", + "homogenising": "homogenizing", + "honour": "honor", + "honourable": "honorable", + "honourably": "honorably", + "honoured": "honored", + "honouring": "honoring", + "honours": "honors", + "hospitalisation": "hospitalization", + "hospitalise": "hospitalize", + "hospitalised": "hospitalized", + "hospitalises": "hospitalizes", + "hospitalising": "hospitalizing", + "humanise": "humanize", + "humanised": "humanized", + "humanises": "humanizes", + "humanising": "humanizing", + "humour": "humor", + "humoured": "humored", + "humouring": "humoring", + "humourless": "humorless", + "humours": "humors", + "hybridise": "hybridize", + "hybridised": "hybridized", + "hybridises": "hybridizes", + "hybridising": "hybridizing", + "hypnotise": "hypnotize", + "hypnotised": "hypnotized", + "hypnotises": "hypnotizes", + "hypnotising": "hypnotizing", + "hypothesise": "hypothesize", + "hypothesised": "hypothesized", + "hypothesises": "hypothesizes", + "hypothesising": "hypothesizing", + "idealisation": "idealization", + "idealise": "idealize", + "idealised": "idealized", + "idealises": "idealizes", + "idealising": "idealizing", + "idolise": "idolize", + "idolised": "idolized", + "idolises": "idolizes", + "idolising": "idolizing", + "immobilisation": "immobilization", + "immobilise": "immobilize", + "immobilised": "immobilized", + "immobiliser": "immobilizer", + "immobilisers": "immobilizers", + "immobilises": "immobilizes", + "immobilising": "immobilizing", + "immortalise": "immortalize", + "immortalised": "immortalized", + "immortalises": "immortalizes", + "immortalising": "immortalizing", + "immunisation": "immunization", + "immunise": "immunize", + "immunised": "immunized", + "immunises": "immunizes", + "immunising": "immunizing", + "impanelled": "impaneled", + "impanelling": "impaneling", + "imperilled": "imperiled", + "imperilling": "imperiling", + "individualise": "individualize", + "individualised": "individualized", + "individualises": "individualizes", + "individualising": "individualizing", + "industrialise": "industrialize", + "industrialised": "industrialized", + "industrialises": "industrializes", + "industrialising": "industrializing", + "inflexion": "inflection", + "inflexions": "inflections", + "initialise": "initialize", + "initialised": "initialized", + "initialises": "initializes", + "initialising": "initializing", + "initialled": "initialed", + "initialling": "initialing", + "instal": "install", + "instalment": "installment", + "instalments": "installments", + "instals": "installs", + "instil": "instill", + "instils": "instills", + "institutionalisation": "institutionalization", + "institutionalise": "institutionalize", + "institutionalised": "institutionalized", + "institutionalises": "institutionalizes", + "institutionalising": "institutionalizing", + "intellectualise": "intellectualize", + "intellectualised": "intellectualized", + "intellectualises": "intellectualizes", + "intellectualising": "intellectualizing", + "internalisation": "internalization", + "internalise": "internalize", + "internalised": "internalized", + "internalises": "internalizes", + "internalising": "internalizing", + "internationalisation": "internationalization", + "internationalise": "internationalize", + "internationalised": "internationalized", + "internationalises": "internationalizes", + "internationalising": "internationalizing", + "ionisation": "ionization", + "ionise": "ionize", + "ionised": "ionized", + "ioniser": "ionizer", + "ionisers": "ionizers", + "ionises": "ionizes", + "ionising": "ionizing", + "italicise": "italicize", + "italicised": "italicized", + "italicises": "italicizes", + "italicising": "italicizing", + "itemise": "itemize", + "itemised": "itemized", + "itemises": "itemizes", + "itemising": "itemizing", + "jeopardise": "jeopardize", + "jeopardised": "jeopardized", + "jeopardises": "jeopardizes", + "jeopardising": "jeopardizing", + "jewelled": "jeweled", + "jeweller": "jeweler", + "jewellers": "jewelers", + "jewellery": "jewelry", + "judgement": "judgment", + "kilogramme": "kilogram", + "kilogrammes": "kilograms", + "kilometre": "kilometer", + "kilometres": "kilometers", + "labelled": "labeled", + "labelling": "labeling", + "labour": "labor", + "laboured": "labored", + "labourer": "laborer", + "labourers": "laborers", + "labouring": "laboring", + "labours": "labors", + "lacklustre": "lackluster", + "legalisation": "legalization", + "legalise": "legalize", + "legalised": "legalized", + "legalises": "legalizes", + "legalising": "legalizing", + "legitimise": "legitimize", + "legitimised": "legitimized", + "legitimises": "legitimizes", + "legitimising": "legitimizing", + "leukaemia": "leukemia", + "levelled": "leveled", + "leveller": "leveler", + "levellers": "levelers", + "levelling": "leveling", + "libelled": "libeled", + "libelling": "libeling", + "libellous": "libelous", + "liberalisation": "liberalization", + "liberalise": "liberalize", + "liberalised": "liberalized", + "liberalises": "liberalizes", + "liberalising": "liberalizing", + "licence": "license", + "licenced": "licensed", + "licences": "licenses", + "licencing": "licensing", + "likeable": "likable", + "lionisation": "lionization", + "lionise": "lionize", + "lionised": "lionized", + "lionises": "lionizes", + "lionising": "lionizing", + "liquidise": "liquidize", + "liquidised": "liquidized", + "liquidiser": "liquidizer", + "liquidisers": "liquidizers", + "liquidises": "liquidizes", + "liquidising": "liquidizing", + "litre": "liter", + "litres": "liters", + "localise": "localize", + "localised": "localized", + "localises": "localizes", + "localising": "localizing", + "louvre": "louver", + "louvred": "louvered", + "louvres": "louvers", + "lustre": "luster", + "magnetise": "magnetize", + "magnetised": "magnetized", + "magnetises": "magnetizes", + "magnetising": "magnetizing", + "manoeuvrability": "maneuverability", + "manoeuvrable": "maneuverable", + "manoeuvre": "maneuver", + "manoeuvred": "maneuvered", + "manoeuvres": "maneuvers", + "manoeuvring": "maneuvering", + "manoeuvrings": "maneuverings", + "marginalisation": "marginalization", + "marginalise": "marginalize", + "marginalised": "marginalized", + "marginalises": "marginalizes", + "marginalising": "marginalizing", + "marshalled": "marshaled", + "marshalling": "marshaling", + "marvelled": "marveled", + "marvelling": "marveling", + "marvellous": "marvelous", + "marvellously": "marvelously", + "materialisation": "materialization", + "materialise": "materialize", + "materialised": "materialized", + "materialises": "materializes", + "materialising": "materializing", + "maximisation": "maximization", + "maximise": "maximize", + "maximised": "maximized", + "maximises": "maximizes", + "maximising": "maximizing", + "meagre": "meager", + "mechanisation": "mechanization", + "mechanise": "mechanize", + "mechanised": "mechanized", + "mechanises": "mechanizes", + "mechanising": "mechanizing", + "mediaeval": "medieval", + "memorialise": "memorialize", + "memorialised": "memorialized", + "memorialises": "memorializes", + "memorialising": "memorializing", + "memorise": "memorize", + "memorised": "memorized", + "memorises": "memorizes", + "memorising": "memorizing", + "mesmerise": "mesmerize", + "mesmerised": "mesmerized", + "mesmerises": "mesmerizes", + "mesmerising": "mesmerizing", + "metabolise": "metabolize", + "metabolised": "metabolized", + "metabolises": "metabolizes", + "metabolising": "metabolizing", + "metre": "meter", + "metres": "meters", + "micrometre": "micrometer", + "micrometres": "micrometers", + "militarise": "militarize", + "militarised": "militarized", + "militarises": "militarizes", + "militarising": "militarizing", + "milligramme": "milligram", + "milligrammes": "milligrams", + "millilitre": "milliliter", + "millilitres": "milliliters", + "millimetre": "millimeter", + "millimetres": "millimeters", + "miniaturisation": "miniaturization", + "miniaturise": "miniaturize", + "miniaturised": "miniaturized", + "miniaturises": "miniaturizes", + "miniaturising": "miniaturizing", + "minibusses": "minibuses", + "minimise": "minimize", + "minimised": "minimized", + "minimises": "minimizes", + "minimising": "minimizing", + "misbehaviour": "misbehavior", + "misdemeanour": "misdemeanor", + "misdemeanours": "misdemeanors", + "misspelt": "misspelled", + "mitre": "miter", + "mitres": "miters", + "mobilisation": "mobilization", + "mobilise": "mobilize", + "mobilised": "mobilized", + "mobilises": "mobilizes", + "mobilising": "mobilizing", + "modelled": "modeled", + "modeller": "modeler", + "modellers": "modelers", + "modelling": "modeling", + "modernise": "modernize", + "modernised": "modernized", + "modernises": "modernizes", + "modernising": "modernizing", + "moisturise": "moisturize", + "moisturised": "moisturized", + "moisturiser": "moisturizer", + "moisturisers": "moisturizers", + "moisturises": "moisturizes", + "moisturising": "moisturizing", + "monologue": "monolog", + "monologues": "monologs", + "monopolisation": "monopolization", + "monopolise": "monopolize", + "monopolised": "monopolized", + "monopolises": "monopolizes", + "monopolising": "monopolizing", + "moralise": "moralize", + "moralised": "moralized", + "moralises": "moralizes", + "moralising": "moralizing", + "motorised": "motorized", + "mould": "mold", + "moulded": "molded", + "moulder": "molder", + "mouldered": "moldered", + "mouldering": "moldering", + "moulders": "molders", + "mouldier": "moldier", + "mouldiest": "moldiest", + "moulding": "molding", + "mouldings": "moldings", + "moulds": "molds", + "mouldy": "moldy", + "moult": "molt", + "moulted": "molted", + "moulting": "molting", + "moults": "molts", + "moustache": "mustache", + "moustached": "mustached", + "moustaches": "mustaches", + "moustachioed": "mustachioed", + "multicoloured": "multicolored", + "nationalisation": "nationalization", + "nationalisations": "nationalizations", + "nationalise": "nationalize", + "nationalised": "nationalized", + "nationalises": "nationalizes", + "nationalising": "nationalizing", + "naturalisation": "naturalization", + "naturalise": "naturalize", + "naturalised": "naturalized", + "naturalises": "naturalizes", + "naturalising": "naturalizing", + "neighbour": "neighbor", + "neighbourhood": "neighborhood", + "neighbourhoods": "neighborhoods", + "neighbouring": "neighboring", + "neighbourliness": "neighborliness", + "neighbourly": "neighborly", + "neighbours": "neighbors", + "neutralisation": "neutralization", + "neutralise": "neutralize", + "neutralised": "neutralized", + "neutralises": "neutralizes", + "neutralising": "neutralizing", + "normalisation": "normalization", + "normalise": "normalize", + "normalised": "normalized", + "normalises": "normalizes", + "normalising": "normalizing", + "odour": "odor", + "odourless": "odorless", + "odours": "odors", + "oesophagus": "esophagus", + "oesophaguses": "esophaguses", + "oestrogen": "estrogen", + "offence": "offense", + "offences": "offenses", + "omelette": "omelet", + "omelettes": "omelets", + "optimise": "optimize", + "optimised": "optimized", + "optimises": "optimizes", + "optimising": "optimizing", + "organisation": "organization", + "organisational": "organizational", + "organisations": "organizations", + "organise": "organize", + "organised": "organized", + "organiser": "organizer", + "organisers": "organizers", + "organises": "organizes", + "organising": "organizing", + "orthopaedic": "orthopedic", + "orthopaedics": "orthopedics", + "ostracise": "ostracize", + "ostracised": "ostracized", + "ostracises": "ostracizes", + "ostracising": "ostracizing", + "outmanoeuvre": "outmaneuver", + "outmanoeuvred": "outmaneuvered", + "outmanoeuvres": "outmaneuvers", + "outmanoeuvring": "outmaneuvering", + "overemphasise": "overemphasize", + "overemphasised": "overemphasized", + "overemphasises": "overemphasizes", + "overemphasising": "overemphasizing", + "oxidisation": "oxidization", + "oxidise": "oxidize", + "oxidised": "oxidized", + "oxidises": "oxidizes", + "oxidising": "oxidizing", + "paederast": "pederast", + "paederasts": "pederasts", + "paediatric": "pediatric", + "paediatrician": "pediatrician", + "paediatricians": "pediatricians", + "paediatrics": "pediatrics", + "paedophile": "pedophile", + "paedophiles": "pedophiles", + "paedophilia": "pedophilia", + "palaeolithic": "paleolithic", + "palaeontologist": "paleontologist", + "palaeontologists": "paleontologists", + "palaeontology": "paleontology", + "panelled": "paneled", + "panelling": "paneling", + "panellist": "panelist", + "panellists": "panelists", + "paralyse": "paralyze", + "paralysed": "paralyzed", + "paralyses": "paralyzes", + "paralysing": "paralyzing", + "parcelled": "parceled", + "parcelling": "parceling", + "parlour": "parlor", + "parlours": "parlors", + "particularise": "particularize", + "particularised": "particularized", + "particularises": "particularizes", + "particularising": "particularizing", + "passivisation": "passivization", + "passivise": "passivize", + "passivised": "passivized", + "passivises": "passivizes", + "passivising": "passivizing", + "pasteurisation": "pasteurization", + "pasteurise": "pasteurize", + "pasteurised": "pasteurized", + "pasteurises": "pasteurizes", + "pasteurising": "pasteurizing", + "patronise": "patronize", + "patronised": "patronized", + "patronises": "patronizes", + "patronising": "patronizing", + "patronisingly": "patronizingly", + "pedalled": "pedaled", + "pedalling": "pedaling", + "pedestrianisation": "pedestrianization", + "pedestrianise": "pedestrianize", + "pedestrianised": "pedestrianized", + "pedestrianises": "pedestrianizes", + "pedestrianising": "pedestrianizing", + "penalise": "penalize", + "penalised": "penalized", + "penalises": "penalizes", + "penalising": "penalizing", + "pencilled": "penciled", + "pencilling": "penciling", + "personalise": "personalize", + "personalised": "personalized", + "personalises": "personalizes", + "personalising": "personalizing", + "pharmacopoeia": "pharmacopeia", + "pharmacopoeias": "pharmacopeias", + "philosophise": "philosophize", + "philosophised": "philosophized", + "philosophises": "philosophizes", + "philosophising": "philosophizing", + "philtre": "filter", + "philtres": "filters", + "phoney": "phony", + "plagiarise": "plagiarize", + "plagiarised": "plagiarized", + "plagiarises": "plagiarizes", + "plagiarising": "plagiarizing", + "plough": "plow", + "ploughed": "plowed", + "ploughing": "plowing", + "ploughman": "plowman", + "ploughmen": "plowmen", + "ploughs": "plows", + "ploughshare": "plowshare", + "ploughshares": "plowshares", + "polarisation": "polarization", + "polarise": "polarize", + "polarised": "polarized", + "polarises": "polarizes", + "polarising": "polarizing", + "politicisation": "politicization", + "politicise": "politicize", + "politicised": "politicized", + "politicises": "politicizes", + "politicising": "politicizing", + "popularisation": "popularization", + "popularise": "popularize", + "popularised": "popularized", + "popularises": "popularizes", + "popularising": "popularizing", + "pouffe": "pouf", + "pouffes": "poufs", + "practise": "practice", + "practised": "practiced", + "practises": "practices", + "practising": "practicing", + "praesidium": "presidium", + "praesidiums": "presidiums", + "pressurisation": "pressurization", + "pressurise": "pressurize", + "pressurised": "pressurized", + "pressurises": "pressurizes", + "pressurising": "pressurizing", + "pretence": "pretense", + "pretences": "pretenses", + "primaeval": "primeval", + "prioritisation": "prioritization", + "prioritise": "prioritize", + "prioritised": "prioritized", + "prioritises": "prioritizes", + "prioritising": "prioritizing", + "privatisation": "privatization", + "privatisations": "privatizations", + "privatise": "privatize", + "privatised": "privatized", + "privatises": "privatizes", + "privatising": "privatizing", + "professionalisation": "professionalization", + "professionalise": "professionalize", + "professionalised": "professionalized", + "professionalises": "professionalizes", + "professionalising": "professionalizing", + "programme": "program", + "programmes": "programs", + "prologue": "prolog", + "prologues": "prologs", + "propagandise": "propagandize", + "propagandised": "propagandized", + "propagandises": "propagandizes", + "propagandising": "propagandizing", + "proselytise": "proselytize", + "proselytised": "proselytized", + "proselytiser": "proselytizer", + "proselytisers": "proselytizers", + "proselytises": "proselytizes", + "proselytising": "proselytizing", + "psychoanalyse": "psychoanalyze", + "psychoanalysed": "psychoanalyzed", + "psychoanalyses": "psychoanalyzes", + "psychoanalysing": "psychoanalyzing", + "publicise": "publicize", + "publicised": "publicized", + "publicises": "publicizes", + "publicising": "publicizing", + "pulverisation": "pulverization", + "pulverise": "pulverize", + "pulverised": "pulverized", + "pulverises": "pulverizes", + "pulverising": "pulverizing", + "pummelled": "pummel", + "pummelling": "pummeled", + "pyjama": "pajama", + "pyjamas": "pajamas", + "pzazz": "pizzazz", + "quarrelled": "quarreled", + "quarrelling": "quarreling", + "radicalise": "radicalize", + "radicalised": "radicalized", + "radicalises": "radicalizes", + "radicalising": "radicalizing", + "rancour": "rancor", + "randomise": "randomize", + "randomised": "randomized", + "randomises": "randomizes", + "randomising": "randomizing", + "rationalisation": "rationalization", + "rationalisations": "rationalizations", + "rationalise": "rationalize", + "rationalised": "rationalized", + "rationalises": "rationalizes", + "rationalising": "rationalizing", + "ravelled": "raveled", + "ravelling": "raveling", + "realisable": "realizable", + "realisation": "realization", + "realisations": "realizations", + "realise": "realize", + "realised": "realized", + "realises": "realizes", + "realising": "realizing", + "recognisable": "recognizable", + "recognisably": "recognizably", + "recognisance": "recognizance", + "recognise": "recognize", + "recognised": "recognized", + "recognises": "recognizes", + "recognising": "recognizing", + "reconnoitre": "reconnoiter", + "reconnoitred": "reconnoitered", + "reconnoitres": "reconnoiters", + "reconnoitring": "reconnoitering", + "refuelled": "refueled", + "refuelling": "refueling", + "regularisation": "regularization", + "regularise": "regularize", + "regularised": "regularized", + "regularises": "regularizes", + "regularising": "regularizing", + "remodelled": "remodeled", + "remodelling": "remodeling", + "remould": "remold", + "remoulded": "remolded", + "remoulding": "remolding", + "remoulds": "remolds", + "reorganisation": "reorganization", + "reorganisations": "reorganizations", + "reorganise": "reorganize", + "reorganised": "reorganized", + "reorganises": "reorganizes", + "reorganising": "reorganizing", + "revelled": "reveled", + "reveller": "reveler", + "revellers": "revelers", + "revelling": "reveling", + "revitalise": "revitalize", + "revitalised": "revitalized", + "revitalises": "revitalizes", + "revitalising": "revitalizing", + "revolutionise": "revolutionize", + "revolutionised": "revolutionized", + "revolutionises": "revolutionizes", + "revolutionising": "revolutionizing", + "rhapsodise": "rhapsodize", + "rhapsodised": "rhapsodized", + "rhapsodises": "rhapsodizes", + "rhapsodising": "rhapsodizing", + "rigour": "rigor", + "rigours": "rigors", + "ritualised": "ritualized", + "rivalled": "rivaled", + "rivalling": "rivaling", + "romanticise": "romanticize", + "romanticised": "romanticized", + "romanticises": "romanticizes", + "romanticising": "romanticizing", + "rumour": "rumor", + "rumoured": "rumored", + "rumours": "rumors", + "sabre": "saber", + "sabres": "sabers", + "saltpetre": "saltpeter", + "sanitise": "sanitize", + "sanitised": "sanitized", + "sanitises": "sanitizes", + "sanitising": "sanitizing", + "satirise": "satirize", + "satirised": "satirized", + "satirises": "satirizes", + "satirising": "satirizing", + "saviour": "savior", + "saviours": "saviors", + "savour": "savor", + "savoured": "savored", + "savouries": "savories", + "savouring": "savoring", + "savours": "savors", + "savoury": "savory", + "scandalise": "scandalize", + "scandalised": "scandalized", + "scandalises": "scandalizes", + "scandalising": "scandalizing", + "sceptic": "skeptic", + "sceptical": "skeptical", + "sceptically": "skeptically", + "scepticism": "skepticism", + "sceptics": "skeptics", + "sceptre": "scepter", + "sceptres": "scepters", + "scrutinise": "scrutinize", + "scrutinised": "scrutinized", + "scrutinises": "scrutinizes", + "scrutinising": "scrutinizing", + "secularisation": "secularization", + "secularise": "secularize", + "secularised": "secularized", + "secularises": "secularizes", + "secularising": "secularizing", + "sensationalise": "sensationalize", + "sensationalised": "sensationalized", + "sensationalises": "sensationalizes", + "sensationalising": "sensationalizing", + "sensitise": "sensitize", + "sensitised": "sensitized", + "sensitises": "sensitizes", + "sensitising": "sensitizing", + "sentimentalise": "sentimentalize", + "sentimentalised": "sentimentalized", + "sentimentalises": "sentimentalizes", + "sentimentalising": "sentimentalizing", + "sepulchre": "sepulcher", + "sepulchres": "sepulchers", + "serialisation": "serialization", + "serialisations": "serializations", + "serialise": "serialize", + "serialised": "serialized", + "serialises": "serializes", + "serialising": "serializing", + "sermonise": "sermonize", + "sermonised": "sermonized", + "sermonises": "sermonizes", + "sermonising": "sermonizing", + "sheikh": "sheik", + "shovelled": "shoveled", + "shovelling": "shoveling", + "shrivelled": "shriveled", + "shrivelling": "shriveling", + "signalise": "signalize", + "signalised": "signalized", + "signalises": "signalizes", + "signalising": "signalizing", + "signalled": "signaled", + "signalling": "signaling", + "smoulder": "smolder", + "smouldered": "smoldered", + "smouldering": "smoldering", + "smoulders": "smolders", + "snivelled": "sniveled", + "snivelling": "sniveling", + "snorkelled": "snorkeled", + "snorkelling": "snorkeling", + "snowplough": "snowplow", + "snowploughs": "snowplow", + "socialisation": "socialization", + "socialise": "socialize", + "socialised": "socialized", + "socialises": "socializes", + "socialising": "socializing", + "sodomise": "sodomize", + "sodomised": "sodomized", + "sodomises": "sodomizes", + "sodomising": "sodomizing", + "solemnise": "solemnize", + "solemnised": "solemnized", + "solemnises": "solemnizes", + "solemnising": "solemnizing", + "sombre": "somber", + "specialisation": "specialization", + "specialisations": "specializations", + "specialise": "specialize", + "specialised": "specialized", + "specialises": "specializes", + "specialising": "specializing", + "spectre": "specter", + "spectres": "specters", + "spiralled": "spiraled", + "spiralling": "spiraling", + "splendour": "splendor", + "splendours": "splendors", + "squirrelled": "squirreled", + "squirrelling": "squirreling", + "stabilisation": "stabilization", + "stabilise": "stabilize", + "stabilised": "stabilized", + "stabiliser": "stabilizer", + "stabilisers": "stabilizers", + "stabilises": "stabilizes", + "stabilising": "stabilizing", + "standardisation": "standardization", + "standardise": "standardize", + "standardised": "standardized", + "standardises": "standardizes", + "standardising": "standardizing", + "stencilled": "stenciled", + "stencilling": "stenciling", + "sterilisation": "sterilization", + "sterilisations": "sterilizations", + "sterilise": "sterilize", + "sterilised": "sterilized", + "steriliser": "sterilizer", + "sterilisers": "sterilizers", + "sterilises": "sterilizes", + "sterilising": "sterilizing", + "stigmatisation": "stigmatization", + "stigmatise": "stigmatize", + "stigmatised": "stigmatized", + "stigmatises": "stigmatizes", + "stigmatising": "stigmatizing", + "storey": "story", + "storeys": "stories", + "subsidisation": "subsidization", + "subsidise": "subsidize", + "subsidised": "subsidized", + "subsidiser": "subsidizer", + "subsidisers": "subsidizers", + "subsidises": "subsidizes", + "subsidising": "subsidizing", + "succour": "succor", + "succoured": "succored", + "succouring": "succoring", + "succours": "succors", + "sulphate": "sulfate", + "sulphates": "sulfates", + "sulphide": "sulfide", + "sulphides": "sulfides", + "sulphur": "sulfur", + "sulphurous": "sulfurous", + "summarise": "summarize", + "summarised": "summarized", + "summarises": "summarizes", + "summarising": "summarizing", + "swivelled": "swiveled", + "swivelling": "swiveling", + "symbolise": "symbolize", + "symbolised": "symbolized", + "symbolises": "symbolizes", + "symbolising": "symbolizing", + "sympathise": "sympathize", + "sympathised": "sympathized", + "sympathiser": "sympathizer", + "sympathisers": "sympathizers", + "sympathises": "sympathizes", + "sympathising": "sympathizing", + "synchronisation": "synchronization", + "synchronise": "synchronize", + "synchronised": "synchronized", + "synchronises": "synchronizes", + "synchronising": "synchronizing", + "synthesise": "synthesize", + "synthesised": "synthesized", + "synthesiser": "synthesizer", + "synthesisers": "synthesizers", + "synthesises": "synthesizes", + "synthesising": "synthesizing", + "syphon": "siphon", + "syphoned": "siphoned", + "syphoning": "siphoning", + "syphons": "siphons", + "systematisation": "systematization", + "systematise": "systematize", + "systematised": "systematized", + "systematises": "systematizes", + "systematising": "systematizing", + "tantalise": "tantalize", + "tantalised": "tantalized", + "tantalises": "tantalizes", + "tantalising": "tantalizing", + "tantalisingly": "tantalizingly", + "tasselled": "tasseled", + "technicolour": "technicolor", + "temporise": "temporize", + "temporised": "temporized", + "temporises": "temporizes", + "temporising": "temporizing", + "tenderise": "tenderize", + "tenderised": "tenderized", + "tenderises": "tenderizes", + "tenderising": "tenderizing", + "terrorise": "terrorize", + "terrorised": "terrorized", + "terrorises": "terrorizes", + "terrorising": "terrorizing", + "theatre": "theater", + "theatregoer": "theatergoer", + "theatregoers": "theatergoers", + "theatres": "theaters", + "theorise": "theorize", + "theorised": "theorized", + "theorises": "theorizes", + "theorising": "theorizing", + "tonne": "ton", + "tonnes": "tons", + "towelled": "toweled", + "towelling": "toweling", + "toxaemia": "toxemia", + "tranquillise": "tranquilize", + "tranquillised": "tranquilized", + "tranquilliser": "tranquilizer", + "tranquillisers": "tranquilizers", + "tranquillises": "tranquilizes", + "tranquillising": "tranquilizing", + "tranquillity": "tranquility", + "tranquillize": "tranquilize", + "tranquillized": "tranquilized", + "tranquillizer": "tranquilizer", + "tranquillizers": "tranquilizers", + "tranquillizes": "tranquilizes", + "tranquillizing": "tranquilizing", + "tranquilly": "tranquility", + "transistorised": "transistorized", + "traumatise": "traumatize", + "traumatised": "traumatized", + "traumatises": "traumatizes", + "traumatising": "traumatizing", + "travelled": "traveled", + "traveller": "traveler", + "travellers": "travelers", + "travelling": "traveling", + "travelog": "travelogue", + "travelogs": "travelogues", + "trialled": "trialed", + "trialling": "trialing", + "tricolour": "tricolor", + "tricolours": "tricolors", + "trivialise": "trivialize", + "trivialised": "trivialized", + "trivialises": "trivializes", + "trivialising": "trivializing", + "tumour": "tumor", + "tumours": "tumors", + "tunnelled": "tunneled", + "tunnelling": "tunneling", + "tyrannise": "tyrannize", + "tyrannised": "tyrannized", + "tyrannises": "tyrannizes", + "tyrannising": "tyrannizing", + "tyre": "tire", + "tyres": "tires", + "unauthorised": "unauthorized", + "uncivilised": "uncivilized", + "underutilised": "underutilized", + "unequalled": "unequaled", + "unfavourable": "unfavorable", + "unfavourably": "unfavorably", + "unionisation": "unionization", + "unionise": "unionize", + "unionised": "unionized", + "unionises": "unionizes", + "unionising": "unionizing", + "unorganised": "unorganized", + "unravelled": "unraveled", + "unravelling": "unraveling", + "unrecognisable": "unrecognizable", + "unrecognised": "unrecognized", + "unrivalled": "unrivaled", + "unsavoury": "unsavory", + "untrammelled": "untrammeled", + "urbanisation": "urbanization", + "urbanise": "urbanize", + "urbanised": "urbanized", + "urbanises": "urbanizes", + "urbanising": "urbanizing", + "utilisable": "utilizable", + "utilisation": "utilization", + "utilise": "utilize", + "utilised": "utilized", + "utilises": "utilizes", + "utilising": "utilizing", + "valour": "valor", + "vandalise": "vandalize", + "vandalised": "vandalized", + "vandalises": "vandalizes", + "vandalising": "vandalizing", + "vaporisation": "vaporization", + "vaporise": "vaporize", + "vaporised": "vaporized", + "vaporises": "vaporizes", + "vaporising": "vaporizing", + "vapour": "vapor", + "vapours": "vapors", + "verbalise": "verbalize", + "verbalised": "verbalized", + "verbalises": "verbalizes", + "verbalising": "verbalizing", + "victimisation": "victimization", + "victimise": "victimize", + "victimised": "victimized", + "victimises": "victimizes", + "victimising": "victimizing", + "videodisc": "videodisk", + "videodiscs": "videodisks", + "vigour": "vigor", + "visualisation": "visualization", + "visualisations": "visualizations", + "visualise": "visualize", + "visualised": "visualized", + "visualises": "visualizes", + "visualising": "visualizing", + "vocalisation": "vocalization", + "vocalisations": "vocalizations", + "vocalise": "vocalize", + "vocalised": "vocalized", + "vocalises": "vocalizes", + "vocalising": "vocalizing", + "vulcanised": "vulcanized", + "vulgarisation": "vulgarization", + "vulgarise": "vulgarize", + "vulgarised": "vulgarized", + "vulgarises": "vulgarizes", + "vulgarising": "vulgarizing", + "waggon": "wagon", + "waggons": "wagons", + "watercolour": "watercolor", + "watercolours": "watercolors", + "weaselled": "weaseled", + "weaselling": "weaseling", + "westernisation": "westernization", + "westernise": "westernize", + "westernised": "westernized", + "westernises": "westernizes", + "westernising": "westernizing", + "womanise": "womanize", + "womanised": "womanized", + "womaniser": "womanizer", + "womanisers": "womanizers", + "womanises": "womanizes", + "womanising": "womanizing", + "woollen": "woolen", + "woollens": "woolens", + "woollies": "woolies", + "woolly": "wooly", + "worshipped": "worshiped", + "worshipping": "worshiping", + "worshipper": "worshiper", + "yodelled": "yodeled", + "yodelling": "yodeling", + "yoghourt": "yogurt", + "yoghourts": "yogurts", + "yoghurt": "yogurt", + "yoghurts": "yogurts", + "mhm": "hmm", + "mmm": "hmm" +} \ No newline at end of file diff --git a/lmms_eval/tasks/gigaspeech/whisper_normalizer/english.py b/lmms_eval/tasks/gigaspeech/whisper_normalizer/english.py new file mode 100644 index 00000000..7102f2ec --- /dev/null +++ b/lmms_eval/tasks/gigaspeech/whisper_normalizer/english.py @@ -0,0 +1,529 @@ +import json +import os +import re +from fractions import Fraction +from typing import Iterator, List, Match, Optional, Union + +from more_itertools import windowed # TODO: new package + +from .basic import remove_symbols_and_diacritics + + +class EnglishNumberNormalizer: + """ + Convert any spelled-out numbers into arabic numbers, while handling: + + - remove any commas + - keep the suffixes such as: `1960s`, `274th`, `32nd`, etc. + - spell out currency symbols after the number. e.g. `$20 million` -> `20000000 dollars` + - spell out `one` and `ones` + - interpret successive single-digit numbers as nominal: `one oh one` -> `101` + """ + + def __init__(self): + super().__init__() + + self.zeros = {"o", "oh", "zero"} + self.ones = { + name: i + for i, name in enumerate( + [ + "one", + "two", + "three", + "four", + "five", + "six", + "seven", + "eight", + "nine", + "ten", + "eleven", + "twelve", + "thirteen", + "fourteen", + "fifteen", + "sixteen", + "seventeen", + "eighteen", + "nineteen", + ], + start=1, + ) + } + self.ones_plural = {"sixes" if name == "six" else name + "s": (value, "s") for name, value in self.ones.items()} + self.ones_ordinal = { + "zeroth": (0, "th"), + "first": (1, "st"), + "second": (2, "nd"), + "third": (3, "rd"), + "fifth": (5, "th"), + "twelfth": (12, "th"), + **{name + ("h" if name.endswith("t") else "th"): (value, "th") for name, value in self.ones.items() if value > 3 and value != 5 and value != 12}, + } + self.ones_suffixed = {**self.ones_plural, **self.ones_ordinal} + + self.tens = { + "twenty": 20, + "thirty": 30, + "forty": 40, + "fifty": 50, + "sixty": 60, + "seventy": 70, + "eighty": 80, + "ninety": 90, + } + self.tens_plural = {name.replace("y", "ies"): (value, "s") for name, value in self.tens.items()} + self.tens_ordinal = {name.replace("y", "ieth"): (value, "th") for name, value in self.tens.items()} + self.tens_suffixed = {**self.tens_plural, **self.tens_ordinal} + + self.multipliers = { + "hundred": 100, + "thousand": 1_000, + "million": 1_000_000, + "billion": 1_000_000_000, + "trillion": 1_000_000_000_000, + "quadrillion": 1_000_000_000_000_000, + "quintillion": 1_000_000_000_000_000_000, + "sextillion": 1_000_000_000_000_000_000_000, + "septillion": 1_000_000_000_000_000_000_000_000, + "octillion": 1_000_000_000_000_000_000_000_000_000, + "nonillion": 1_000_000_000_000_000_000_000_000_000_000, + "decillion": 1_000_000_000_000_000_000_000_000_000_000_000, + } + self.multipliers_plural = {name + "s": (value, "s") for name, value in self.multipliers.items()} + self.multipliers_ordinal = {name + "th": (value, "th") for name, value in self.multipliers.items()} + self.multipliers_suffixed = { + **self.multipliers_plural, + **self.multipliers_ordinal, + } + self.decimals = {*self.ones, *self.tens, *self.zeros} + + self.preceding_prefixers = { + "minus": "-", + "negative": "-", + "plus": "+", + "positive": "+", + } + self.following_prefixers = { + "pound": "ยฃ", + "pounds": "ยฃ", + "euro": "โ‚ฌ", + "euros": "โ‚ฌ", + "dollar": "$", + "dollars": "$", + "cent": "ยข", + "cents": "ยข", + } + self.prefixes = set(list(self.preceding_prefixers.values()) + list(self.following_prefixers.values())) + self.suffixers = { + "per": {"cent": "%"}, + "percent": "%", + } + self.specials = {"and", "double", "triple", "point"} + + self.words = set( + [ + key + for mapping in [ + self.zeros, + self.ones, + self.ones_suffixed, + self.tens, + self.tens_suffixed, + self.multipliers, + self.multipliers_suffixed, + self.preceding_prefixers, + self.following_prefixers, + self.suffixers, + self.specials, + ] + for key in mapping + ] + ) + self.literal_words = {"one", "ones"} + + def process_words(self, words: List[str]) -> Iterator[str]: + prefix: Optional[str] = None + value: Optional[Union[str, int]] = None + skip = False + + def to_fraction(s: str): + try: + return Fraction(s) + except ValueError: + return None + + def output(result: Union[str, int]): + nonlocal prefix, value + result = str(result) + if prefix is not None: + result = prefix + result + value = None + prefix = None + return result + + if len(words) == 0: + return + + for prev, current, next in windowed([None] + words + [None], 3): + if skip: + skip = False + continue + + next_is_numeric = next is not None and re.match(r"^\d+(\.\d+)?$", next) + has_prefix = current[0] in self.prefixes + current_without_prefix = current[1:] if has_prefix else current + if re.match(r"^\d+(\.\d+)?$", current_without_prefix): + # arabic numbers (potentially with signs and fractions) + f = to_fraction(current_without_prefix) + assert f is not None + if value is not None: + if isinstance(value, str) and value.endswith("."): + # concatenate decimals / ip address components + value = str(value) + str(current) + continue + else: + yield output(value) + + prefix = current[0] if has_prefix else prefix + if f.denominator == 1: + value = f.numerator # store integers as int + else: + value = current_without_prefix + elif current not in self.words: + # non-numeric words + if value is not None: + yield output(value) + yield output(current) + elif current in self.zeros: + value = str(value or "") + "0" + elif current in self.ones: + ones = self.ones[current] + + if value is None: + value = ones + elif isinstance(value, str) or prev in self.ones: + if prev in self.tens and ones < 10: # replace the last zero with the digit + assert value[-1] == "0" + value = value[:-1] + str(ones) + else: + value = str(value) + str(ones) + elif ones < 10: + if value % 10 == 0: + value += ones + else: + value = str(value) + str(ones) + else: # eleven to nineteen + if value % 100 == 0: + value += ones + else: + value = str(value) + str(ones) + elif current in self.ones_suffixed: + # ordinal or cardinal; yield the number right away + ones, suffix = self.ones_suffixed[current] + if value is None: + yield output(str(ones) + suffix) + elif isinstance(value, str) or prev in self.ones: + if prev in self.tens and ones < 10: + assert value[-1] == "0" + yield output(value[:-1] + str(ones) + suffix) + else: + yield output(str(value) + str(ones) + suffix) + elif ones < 10: + if value % 10 == 0: + yield output(str(value + ones) + suffix) + else: + yield output(str(value) + str(ones) + suffix) + else: # eleven to nineteen + if value % 100 == 0: + yield output(str(value + ones) + suffix) + else: + yield output(str(value) + str(ones) + suffix) + value = None + elif current in self.tens: + tens = self.tens[current] + if value is None: + value = tens + elif isinstance(value, str): + value = str(value) + str(tens) + else: + if value % 100 == 0: + value += tens + else: + value = str(value) + str(tens) + elif current in self.tens_suffixed: + # ordinal or cardinal; yield the number right away + tens, suffix = self.tens_suffixed[current] + if value is None: + yield output(str(tens) + suffix) + elif isinstance(value, str): + yield output(str(value) + str(tens) + suffix) + else: + if value % 100 == 0: + yield output(str(value + tens) + suffix) + else: + yield output(str(value) + str(tens) + suffix) + elif current in self.multipliers: + multiplier = self.multipliers[current] + if value is None: + value = multiplier + elif isinstance(value, str) or value == 0: + f = to_fraction(value) + p = f * multiplier if f is not None else None + if f is not None and p.denominator == 1: + value = p.numerator + else: + yield output(value) + value = multiplier + else: + before = value // 1000 * 1000 + residual = value % 1000 + value = before + residual * multiplier + elif current in self.multipliers_suffixed: + multiplier, suffix = self.multipliers_suffixed[current] + if value is None: + yield output(str(multiplier) + suffix) + elif isinstance(value, str): + f = to_fraction(value) + p = f * multiplier if f is not None else None + if f is not None and p.denominator == 1: + yield output(str(p.numerator) + suffix) + else: + yield output(value) + yield output(str(multiplier) + suffix) + else: # int + before = value // 1000 * 1000 + residual = value % 1000 + value = before + residual * multiplier + yield output(str(value) + suffix) + value = None + elif current in self.preceding_prefixers: + # apply prefix (positive, minus, etc.) if it precedes a number + if value is not None: + yield output(value) + + if next in self.words or next_is_numeric: + prefix = self.preceding_prefixers[current] + else: + yield output(current) + elif current in self.following_prefixers: + # apply prefix (dollars, cents, etc.) only after a number + if value is not None: + prefix = self.following_prefixers[current] + yield output(value) + else: + yield output(current) + elif current in self.suffixers: + # apply suffix symbols (percent -> '%') + if value is not None: + suffix = self.suffixers[current] + if isinstance(suffix, dict): + if next in suffix: + yield output(str(value) + suffix[next]) + skip = True + else: + yield output(value) + yield output(current) + else: + yield output(str(value) + suffix) + else: + yield output(current) + elif current in self.specials: + if next not in self.words and not next_is_numeric: + # apply special handling only if the next word can be numeric + if value is not None: + yield output(value) + yield output(current) + elif current == "and": + # ignore "and" after hundreds, thousands, etc. + if prev not in self.multipliers: + if value is not None: + yield output(value) + yield output(current) + elif current == "double" or current == "triple": + if next in self.ones or next in self.zeros: + repeats = 2 if current == "double" else 3 + ones = self.ones.get(next, 0) + value = str(value or "") + str(ones) * repeats + skip = True + else: + if value is not None: + yield output(value) + yield output(current) + elif current == "point": + if next in self.decimals or next_is_numeric: + value = str(value or "") + "." + else: + # should all have been covered at this point + raise ValueError(f"Unexpected token: {current}") + else: + # all should have been covered at this point + raise ValueError(f"Unexpected token: {current}") + + if value is not None: + yield output(value) + + def preprocess(self, s: str): + # replace " and a half" with " point five" + results = [] + + segments = re.split(r"\band\s+a\s+half\b", s) + for i, segment in enumerate(segments): + if len(segment.strip()) == 0: + continue + if i == len(segments) - 1: + results.append(segment) + else: + results.append(segment) + last_word = segment.rsplit(maxsplit=2)[-1] + if last_word in self.decimals or last_word in self.multipliers: + results.append("point five") + else: + results.append("and a half") + + s = " ".join(results) + + # put a space at number/letter boundary + s = re.sub(r"([a-z])([0-9])", r"\1 \2", s) + s = re.sub(r"([0-9])([a-z])", r"\1 \2", s) + + # but remove spaces which could be a suffix + s = re.sub(r"([0-9])\s+(st|nd|rd|th|s)\b", r"\1\2", s) + + return s + + def postprocess(self, s: str): + def combine_cents(m: Match): + try: + currency = m.group(1) + integer = m.group(2) + cents = int(m.group(3)) + return f"{currency}{integer}.{cents:02d}" + except ValueError: + return m.string + + def extract_cents(m: Match): + try: + return f"ยข{int(m.group(1))}" + except ValueError: + return m.string + + # apply currency postprocessing; "$2 and ยข7" -> "$2.07" + s = re.sub(r"([โ‚ฌยฃ$])([0-9]+) (?:and )?ยข([0-9]{1,2})\b", combine_cents, s) + s = re.sub(r"[โ‚ฌยฃ$]0.([0-9]{1,2})\b", extract_cents, s) + + # write "one(s)" instead of "1(s)", just for the readability + s = re.sub(r"\b1(s?)\b", r"one\1", s) + + return s + + def __call__(self, s: str): + s = self.preprocess(s) + s = " ".join(word for word in self.process_words(s.split()) if word is not None) + s = self.postprocess(s) + + return s + + +class EnglishSpellingNormalizer: + """ + Applies British-American spelling mappings as listed in [1]. + + [1] https://www.tysto.com/uk-us-spelling-list.html + """ + + def __init__(self): + mapping_path = os.path.join(os.path.dirname(__file__), "english.json") + self.mapping = json.load(open(mapping_path)) + + def __call__(self, s: str): + return " ".join(self.mapping.get(word, word) for word in s.split()) + + +class EnglishTextNormalizer: + def __init__(self): + self.ignore_patterns = r"\b(hmm|mm|mhm|mmm|uh|um)\b" + self.replacers = { + # common contractions + r"\bwon't\b": "will not", + r"\bcan't\b": "can not", + r"\blet's\b": "let us", + r"\bain't\b": "aint", + r"\by'all\b": "you all", + r"\bwanna\b": "want to", + r"\bgotta\b": "got to", + r"\bgonna\b": "going to", + r"\bi'ma\b": "i am going to", + r"\bimma\b": "i am going to", + r"\bwoulda\b": "would have", + r"\bcoulda\b": "could have", + r"\bshoulda\b": "should have", + r"\bma'am\b": "madam", + # contractions in titles/prefixes + r"\bmr\b": "mister ", + r"\bmrs\b": "missus ", + r"\bst\b": "saint ", + r"\bdr\b": "doctor ", + r"\bprof\b": "professor ", + r"\bcapt\b": "captain ", + r"\bgov\b": "governor ", + r"\bald\b": "alderman ", + r"\bgen\b": "general ", + r"\bsen\b": "senator ", + r"\brep\b": "representative ", + r"\bpres\b": "president ", + r"\brev\b": "reverend ", + r"\bhon\b": "honorable ", + r"\basst\b": "assistant ", + r"\bassoc\b": "associate ", + r"\blt\b": "lieutenant ", + r"\bcol\b": "colonel ", + r"\bjr\b": "junior ", + r"\bsr\b": "senior ", + r"\besq\b": "esquire ", + # prefect tenses, ideally it should be any past participles, but it's harder.. + r"'d been\b": " had been", + r"'s been\b": " has been", + r"'d gone\b": " had gone", + r"'s gone\b": " has gone", + r"'d done\b": " had done", # "'s done" is ambiguous + r"'s got\b": " has got", + # general contractions + r"n't\b": " not", + r"'re\b": " are", + r"'s\b": " is", + r"'d\b": " would", + r"'ll\b": " will", + r"'t\b": " not", + r"'ve\b": " have", + r"'m\b": " am", + } + self.standardize_numbers = EnglishNumberNormalizer() + self.standardize_spellings = EnglishSpellingNormalizer() + + def __call__(self, s: str): + s = s.lower() + + s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets + s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis + s = re.sub(self.ignore_patterns, "", s) + s = re.sub(r"\s+'", "'", s) # when there's a space before an apostrophe + + for pattern, replacement in self.replacers.items(): + s = re.sub(pattern, replacement, s) + + s = re.sub(r"(\d),(\d)", r"\1\2", s) # remove commas between digits + s = re.sub(r"\.([^0-9]|$)", r" \1", s) # remove periods not followed by numbers + s = remove_symbols_and_diacritics(s, keep=".%$ยขโ‚ฌยฃ") # keep numeric symbols + + s = self.standardize_numbers(s) + s = self.standardize_spellings(s) + + # now remove prefix/suffix symbols that are not preceded/followed by numbers + s = re.sub(r"[.$ยขโ‚ฌยฃ]([^0-9])", r" \1", s) + s = re.sub(r"([^0-9])%", r"\1 ", s) + + s = re.sub(r"\s+", " ", s) # replace any successive whitespaces with a space + + return s diff --git a/lmms_eval/tasks/librispeech/cn_tn.py b/lmms_eval/tasks/librispeech/cn_tn.py new file mode 100644 index 00000000..18128d17 --- /dev/null +++ b/lmms_eval/tasks/librispeech/cn_tn.py @@ -0,0 +1,1174 @@ +#!/usr/bin/env python3 +# coding=utf-8 +# copied from https://github.com/speechio/chinese_text_normalization/blob/master/python/cn_tn.py +# Authors: +# 2019.5 Zhiyang Zhou (https://github.com/Joee1995/chn_text_norm.git) +# 2019.9 - 2022 Jiayu DU +# +# requirements: +# - python 3.X +# notes: python 2.X WILL fail or produce misleading results + +import argparse +import csv +import os +import re +import string +import sys + +# ================================================================================ # +# basic constant +# ================================================================================ # +CHINESE_DIGIS = "้›ถไธ€ไบŒไธ‰ๅ››ไบ”ๅ…ญไธƒๅ…ซไน" +BIG_CHINESE_DIGIS_SIMPLIFIED = "้›ถๅฃน่ดฐๅ่‚†ไผ้™†ๆŸ’ๆŒ็Ž–" +BIG_CHINESE_DIGIS_TRADITIONAL = "้›ถๅฃน่ฒณๅƒ่‚†ไผ้™ธๆŸ’ๆŒ็Ž–" +SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = "ๅ็™พๅƒไธ‡" +SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = "ๆ‹พไฝฐไปŸ่ฌ" +LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = "ไบฟๅ…†ไบฌๅž“็งญ็ฉฐๆฒŸๆถงๆญฃ่ฝฝ" +LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = "ๅ„„ๅ…†ไบฌๅž“็งญ็ฉฐๆบๆพ—ๆญฃ่ผ‰" +SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = "ๅ็™พๅƒไธ‡" +SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = "ๆ‹พไฝฐไปŸ่ฌ" + +ZERO_ALT = "ใ€‡" +ONE_ALT = "ๅนบ" +TWO_ALTS = ["ไธค", "ๅ…ฉ"] + +POSITIVE = ["ๆญฃ", "ๆญฃ"] +NEGATIVE = ["่ดŸ", "่ฒ "] +POINT = ["็‚น", "้ปž"] +# PLUS = [u'ๅŠ ', u'ๅŠ '] +# SIL = [u'ๆ ', u'ๆง“'] + +FILLER_CHARS = ["ๅ‘ƒ", "ๅ•Š"] + +ER_WHITELIST = "(ๅ„ฟๅฅณ|ๅ„ฟๅญ|ๅ„ฟๅญ™|ๅฅณๅ„ฟ|ๅ„ฟๅชณ|ๅฆปๅ„ฟ|" "่ƒŽๅ„ฟ|ๅฉดๅ„ฟ|ๆ–ฐ็”Ÿๅ„ฟ|ๅฉดๅนผๅ„ฟ|ๅนผๅ„ฟ|ๅฐ‘ๅ„ฟ|ๅฐๅ„ฟ|ๅ„ฟๆญŒ|ๅ„ฟ็ซฅ|ๅ„ฟ็ง‘|ๆ‰˜ๅ„ฟๆ‰€|ๅญคๅ„ฟ|" "ๅ„ฟๆˆ|ๅ„ฟๅŒ–|ๅฐๅ„ฟๅบ„|้นฟๅ„ฟๅฒ›|ๆญฃๅ„ฟๅ…ซ็ป|ๅŠๅ„ฟ้ƒŽๅฝ“|็”Ÿๅ„ฟ่‚ฒๅฅณ|ๆ‰˜ๅ„ฟๅธฆๅฅณ|ๅ…ปๅ„ฟ้˜ฒ่€|็—ดๅ„ฟๅ‘†ๅฅณ|" "ไฝณๅ„ฟไฝณๅฆ‡|ๅ„ฟๆ€œๅ…ฝๆ‰ฐ|ๅ„ฟๆ— ๅธธ็ˆถ|ๅ„ฟไธๅซŒๆฏไธ‘|ๅ„ฟ่กŒๅƒ้‡Œๆฏๆ‹…ๅฟง|ๅ„ฟๅคงไธ็”ฑ็ˆท|่‹ไนžๅ„ฟ)" +ER_WHITELIST_PATTERN = re.compile(ER_WHITELIST) + +# ไธญๆ–‡ๆ•ฐๅญ—็ณป็ปŸ็ฑปๅž‹ +NUMBERING_TYPES = ["low", "mid", "high"] + +CURRENCY_NAMES = "(ไบบๆฐ‘ๅธ|็พŽๅ…ƒ|ๆ—ฅๅ…ƒ|่‹ฑ้•‘|ๆฌงๅ…ƒ|้ฉฌๅ…‹|ๆณ•้ƒŽ|ๅŠ ๆ‹ฟๅคงๅ…ƒ|ๆพณๅ…ƒ|ๆธฏๅธ|ๅ…ˆไปค|่Šฌๅ…ฐ้ฉฌๅ…‹|็ˆฑๅฐ”ๅ…ฐ้•‘|" "้‡Œๆ‹‰|่ทๅ…ฐ็›พ|ๅŸƒๆ–ฏๅบ“ๅคš|ๆฏ”ๅกžๅก”|ๅฐๅฐผ็›พ|ๆž—ๅ‰็‰น|ๆ–ฐ่ฅฟๅ…ฐๅ…ƒ|ๆฏ”็ดข|ๅขๅธƒ|ๆ–ฐๅŠ ๅกๅ…ƒ|้Ÿฉๅ…ƒ|ๆณฐ้“ข)" +CURRENCY_UNITS = "((ไบฟ|ๅƒไธ‡|็™พไธ‡|ไธ‡|ๅƒ|็™พ)|(ไบฟ|ๅƒไธ‡|็™พไธ‡|ไธ‡|ๅƒ|็™พ|)ๅ…ƒ|(ไบฟ|ๅƒไธ‡|็™พไธ‡|ไธ‡|ๅƒ|็™พ|)ๅ—|่ง’|ๆฏ›|ๅˆ†)" +COM_QUANTIFIERS = ( + "(ๅŒน|ๅผ |ๅบง|ๅ›ž|ๅœบ|ๅฐพ|ๆก|ไธช|้ฆ–|้˜™|้˜ต|็ฝ‘|็‚ฎ|้กถ|ไธ˜|ๆฃต|ๅช|ๆ”ฏ|่ขญ|่พ†|ๆŒ‘|ๆ‹…|้ข—|ๅฃณ|็ช |ๆ›ฒ|ๅข™|็พค|่…”|" + "็ ฃ|ๅบง|ๅฎข|่ดฏ|ๆ‰Ž|ๆ†|ๅˆ€|ไปค|ๆ‰“|ๆ‰‹|็ฝ—|ๅก|ๅฑฑ|ๅฒญ|ๆฑŸ|ๆบช|้’Ÿ|้˜Ÿ|ๅ•|ๅŒ|ๅฏน|ๅ‡บ|ๅฃ|ๅคด|่„š|ๆฟ|่ทณ|ๆž|ไปถ|่ดด|" + "้’ˆ|็บฟ|็ฎก|ๅ|ไฝ|่บซ|ๅ ‚|่ฏพ|ๆœฌ|้กต|ๅฎถ|ๆˆท|ๅฑ‚|ไธ|ๆฏซ|ๅŽ˜|ๅˆ†|้’ฑ|ไธค|ๆ–ค|ๆ‹…|้“ข|็Ÿณ|้’ง|้”ฑ|ๅฟฝ|(ๅƒ|ๆฏซ|ๅพฎ)ๅ…‹|" + "ๆฏซ|ๅŽ˜|ๅˆ†|ๅฏธ|ๅฐบ|ไธˆ|้‡Œ|ๅฏป|ๅธธ|้“บ|็จ‹|(ๅƒ|ๅˆ†|ๅŽ˜|ๆฏซ|ๅพฎ)็ฑณ|ๆ’ฎ|ๅ‹บ|ๅˆ|ๅ‡|ๆ–—|็Ÿณ|็›˜|็ข—|็ขŸ|ๅ |ๆกถ|็ฌผ|็›†|" + "็›’|ๆฏ|้’Ÿ|ๆ–›|้”…|็ฐ‹|็ฏฎ|็›˜|ๆกถ|็ฝ|็“ถ|ๅฃถ|ๅฎ|็›|็ฎฉ|็ฎฑ|็…ฒ|ๅ•–|่ข‹|้’ต|ๅนด|ๆœˆ|ๆ—ฅ|ๅญฃ|ๅˆป|ๆ—ถ|ๅ‘จ|ๅคฉ|็ง’|ๅˆ†|ๆ—ฌ|" + "็บช|ๅฒ|ไธ–|ๆ›ด|ๅคœ|ๆ˜ฅ|ๅค|็ง‹|ๅ†ฌ|ไปฃ|ไผ|่พˆ|ไธธ|ๆณก|็ฒ’|้ข—|ๅนข|ๅ †|ๆก|ๆ น|ๆ”ฏ|้“|้ข|็‰‡|ๅผ |้ข—|ๅ—)" +) + + +# Punctuation information are based on Zhon project (https://github.com/tsroten/zhon.git) +CN_PUNCS_STOP = "๏ผ๏ผŸ๏ฝกใ€‚" +CN_PUNCS_NONSTOP = "๏ผ‚๏ผƒ๏ผ„๏ผ…๏ผ†๏ผ‡๏ผˆ๏ผ‰๏ผŠ๏ผ‹๏ผŒ๏ผ๏ผ๏ผš๏ผ›๏ผœ๏ผ๏ผž๏ผ ๏ผป๏ผผ๏ผฝ๏ผพ๏ผฟ๏ฝ€๏ฝ›๏ฝœ๏ฝ๏ฝž๏ฝŸ๏ฝ ๏ฝข๏ฝฃ๏ฝคใ€ใ€ƒใ€Šใ€‹ใ€Œใ€ใ€Žใ€ใ€ใ€‘ใ€”ใ€•ใ€–ใ€—ใ€˜ใ€™ใ€šใ€›ใ€œใ€ใ€žใ€Ÿใ€ฐใ€พใ€ฟโ€“โ€”โ€˜โ€™โ€›โ€œโ€โ€žโ€Ÿโ€ฆโ€ง๏นยทใ€ˆใ€‰-" +CN_PUNCS = CN_PUNCS_STOP + CN_PUNCS_NONSTOP + +PUNCS = CN_PUNCS + string.punctuation +PUNCS_TRANSFORM = str.maketrans(PUNCS, " " * len(PUNCS), "") # replace puncs with space + + +# https://zh.wikipedia.org/wiki/ๅ…จ่กŒๅ’ŒๅŠ่กŒ +QJ2BJ = { + "ใ€€": " ", + "๏ผ": "!", + "๏ผ‚": '"', + "๏ผƒ": "#", + "๏ผ„": "$", + "๏ผ…": "%", + "๏ผ†": "&", + "๏ผ‡": "'", + "๏ผˆ": "(", + "๏ผ‰": ")", + "๏ผŠ": "*", + "๏ผ‹": "+", + "๏ผŒ": ",", + "๏ผ": "-", + "๏ผŽ": ".", + "๏ผ": "/", + "๏ผ": "0", + "๏ผ‘": "1", + "๏ผ’": "2", + "๏ผ“": "3", + "๏ผ”": "4", + "๏ผ•": "5", + "๏ผ–": "6", + "๏ผ—": "7", + "๏ผ˜": "8", + "๏ผ™": "9", + "๏ผš": ":", + "๏ผ›": ";", + "๏ผœ": "<", + "๏ผ": "=", + "๏ผž": ">", + "๏ผŸ": "?", + "๏ผ ": "@", + "๏ผก": "A", + "๏ผข": "B", + "๏ผฃ": "C", + "๏ผค": "D", + "๏ผฅ": "E", + "๏ผฆ": "F", + "๏ผง": "G", + "๏ผจ": "H", + "๏ผฉ": "I", + "๏ผช": "J", + "๏ผซ": "K", + "๏ผฌ": "L", + "๏ผญ": "M", + "๏ผฎ": "N", + "๏ผฏ": "O", + "๏ผฐ": "P", + "๏ผฑ": "Q", + "๏ผฒ": "R", + "๏ผณ": "S", + "๏ผด": "T", + "๏ผต": "U", + "๏ผถ": "V", + "๏ผท": "W", + "๏ผธ": "X", + "๏ผน": "Y", + "๏ผบ": "Z", + "๏ผป": "[", + "๏ผผ": "\\", + "๏ผฝ": "]", + "๏ผพ": "^", + "๏ผฟ": "_", + "๏ฝ€": "`", + "๏ฝ": "a", + "๏ฝ‚": "b", + "๏ฝƒ": "c", + "๏ฝ„": "d", + "๏ฝ…": "e", + "๏ฝ†": "f", + "๏ฝ‡": "g", + "๏ฝˆ": "h", + "๏ฝ‰": "i", + "๏ฝŠ": "j", + "๏ฝ‹": "k", + "๏ฝŒ": "l", + "๏ฝ": "m", + "๏ฝŽ": "n", + "๏ฝ": "o", + "๏ฝ": "p", + "๏ฝ‘": "q", + "๏ฝ’": "r", + "๏ฝ“": "s", + "๏ฝ”": "t", + "๏ฝ•": "u", + "๏ฝ–": "v", + "๏ฝ—": "w", + "๏ฝ˜": "x", + "๏ฝ™": "y", + "๏ฝš": "z", + "๏ฝ›": "{", + "๏ฝœ": "|", + "๏ฝ": "}", + "๏ฝž": "~", +} +QJ2BJ_TRANSFORM = str.maketrans("".join(QJ2BJ.keys()), "".join(QJ2BJ.values()), "") + + +# 2013 China National Standard: https://zh.wikipedia.org/wiki/้€š็”จ่ง„่Œƒๆฑ‰ๅญ—่กจ, raw resources: +# https://github.com/mozillazg/pinyin-data/blob/master/kMandarin_8105.txt with 8105 chinese chars in total +CN_CHARS_COMMON = ( + "ไธ€ไธไธƒไธ‡ไธˆไธ‰ไธŠไธ‹ไธไธŽไธไธไธ‘ไธ“ไธ”ไธ•ไธ–ไธ˜ไธ™ไธšไธ›ไธœไธไธžไธขไธคไธฅไธงไธชไธซไธญไธฐไธฒไธดไธธไธนไธบไธปไธฝไธพ" + "ไน‚ไนƒไน…ไนˆไน‰ไน‹ไนŒไนไนŽไนไนไน’ไน“ไน”ไน–ไน˜ไน™ไนœไนไนžไนŸไน ไนกไนฆไนฉไนฐไนฑไนณไนธไนพไบ†ไบˆไบ‰ไบ‹ไบŒไบไบŽไบไบ‘ไบ’" + "ไบ“ไบ”ไบ•ไบ˜ไบšไบ›ไบŸไบกไบขไบคไบฅไบฆไบงไบจไบฉไบซไบฌไบญไบฎไบฒไบณไบตไบถไบธไบนไบบไบฟไป€ไปไป‚ไปƒไป„ไป…ไป†ไป‡ไป‰ไปŠไป‹ไปไปŽ" + "ไป‘ไป“ไป”ไป•ไป–ไป—ไป˜ไป™ไปไปžไปŸไปกไปฃไปคไปฅไปจไปชไปซไปฌไปฐไปฒไปณไปตไปถไปทไปปไปฝไปฟไผไผˆไผ‰ไผŠไผ‹ไผไผŽไผไผไผ‘ไผ—ไผ˜" + "ไผ™ไผšไผ›ไผžไผŸไผ ไผขไผฃไผคไผฅไผฆไผงไผชไผซไผญไผฏไผฐไผฒไผดไผถไผธไผบไผผไผฝไผพไฝไฝƒไฝ†ไฝไฝŽไฝไฝไฝ‘ไฝ“ไฝ•ไฝ–ไฝ—ไฝ˜ไฝ™ไฝš" + "ไฝ›ไฝœไฝไฝžไฝŸไฝ ไฝฃไฝคไฝฅไฝฉไฝฌไฝฏไฝฐไฝณไฝดไฝถไฝธไฝบไฝปไฝผไฝฝไฝพไฝฟไพไพ‚ไพƒไพ„ไพˆไพ‰ไพ‹ไพไพไพ‘ไพ”ไพ—ไพ˜ไพ›ไพไพ ไพฃ" + "ไพฅไพฆไพงไพจไพฉไพชไพฌไพฎไพฏไพดไพตไพนไพฟไฟƒไฟ„ไฟ…ไฟŠไฟไฟŽไฟไฟไฟ‘ไฟ—ไฟ˜ไฟ™ไฟšไฟœไฟไฟžไฟŸไฟกไฟฃไฟฆไฟจไฟฉไฟชไฟซไฟญไฟฎไฟฏ" + "ไฟฑไฟณไฟตไฟถไฟธไฟบไฟพๅ€Œๅ€ๅ€ๅ€’ๅ€“ๅ€”ๅ€•ๅ€˜ๅ€™ๅ€šๅ€œๅ€žๅ€Ÿๅ€กๅ€ฅๅ€ฆๅ€งๅ€จๅ€ฉๅ€ชๅ€ฌๅ€ญๅ€ฎๅ€ดๅ€บๅ€ปๅ€ผๅ€พๅๅƒๅ‡ๅˆๅŒ" + "ๅŽๅๅ“ๅ•ๅšๅœๅกๅฅๅฌๅญๅฐๅฒๅถๅทๅปๅพๅฟๅ‚€ๅ‚ƒๅ‚…ๅ‚ˆๅ‚‰ๅ‚ๅ‚’ๅ‚•ๅ‚ฃๅ‚ฅๅ‚งๅ‚จๅ‚ฉๅ‚ฌๅ‚ฒๅ‚บๅ‚ปๅƒ‡ๅƒŽๅƒๅƒ”ๅƒ–ๅƒš" + "ๅƒฆๅƒงๅƒฌๅƒญๅƒฎๅƒฐๅƒณๅƒตๅƒปๅ„†ๅ„‡ๅ„‹ๅ„’ๅ„กๅ„ฆๅ„ณๅ„ดๅ„ฟๅ…€ๅ…ๅ…ƒๅ…„ๅ……ๅ…†ๅ…ˆๅ…‰ๅ…‹ๅ…ๅ…‘ๅ…”ๅ…•ๅ…–ๅ…šๅ…œๅ…ขๅ…ฅๅ…จๅ…ซๅ…ฌๅ…ญ" + "ๅ…ฎๅ…ฐๅ…ฑๅ…ณๅ…ดๅ…ตๅ…ถๅ…ทๅ…ธๅ…นๅ…ปๅ…ผๅ…ฝๅ†€ๅ†ๅ†…ๅ†ˆๅ†‰ๅ†Œๅ†ๅ†ๅ†’ๅ†”ๅ†•ๅ†—ๅ†™ๅ†›ๅ†œๅ† ๅ†ขๅ†คๅ†ฅๅ†ฌๅ†ฎๅ†ฏๅ†ฐๅ†ฑๅ†ฒๅ†ณๅ†ต" + "ๅ†ถๅ†ทๅ†ปๅ†ผๅ†ฝๅ‡€ๅ‡„ๅ‡†ๅ‡‡ๅ‡‰ๅ‡‹ๅ‡Œๅ‡ๅ‡‘ๅ‡“ๅ‡˜ๅ‡›ๅ‡ๅ‡ ๅ‡กๅ‡คๅ‡ซๅ‡ญๅ‡ฏๅ‡ฐๅ‡ณๅ‡ถๅ‡ธๅ‡นๅ‡บๅ‡ปๅ‡ผๅ‡ฝๅ‡ฟๅˆ€ๅˆๅˆƒๅˆ†ๅˆ‡ๅˆˆ" + "ๅˆŠๅˆๅˆŽๅˆ‘ๅˆ’ๅˆ–ๅˆ—ๅˆ˜ๅˆ™ๅˆšๅˆ›ๅˆๅˆ ๅˆคๅˆจๅˆฉๅˆซๅˆฌๅˆญๅˆฎๅˆฐๅˆณๅˆถๅˆทๅˆธๅˆนๅˆบๅˆปๅˆฝๅˆฟๅ‰€ๅ‰ๅ‰‚ๅ‰ƒๅ‰…ๅ‰Šๅ‰‹ๅ‰Œๅ‰ๅ‰" + "ๅ‰‘ๅ‰”ๅ‰•ๅ‰–ๅ‰œๅ‰žๅ‰Ÿๅ‰กๅ‰ฅๅ‰งๅ‰ฉๅ‰ชๅ‰ฏๅ‰ฒๅ‰ฝๅ‰ฟๅŠๅŠ‚ๅŠ„ๅŠˆๅŠๅŠ“ๅŠ›ๅŠๅŠžๅŠŸๅŠ ๅŠกๅŠขๅŠฃๅŠจๅŠฉๅŠชๅŠซๅŠฌๅŠญๅŠฑๅŠฒๅŠณๅŠผ" + "ๅŠพๅŠฟๅ‹ƒๅ‹‡ๅ‹‰ๅ‹‹ๅ‹ๅ‹ๅ‹’ๅ‹”ๅ‹–ๅ‹˜ๅ‹šๅ‹Ÿๅ‹ ๅ‹คๅ‹ฐๅ‹บๅ‹พๅ‹ฟๅŒ€ๅŒ…ๅŒ†ๅŒˆๅŒๅŒๅŒๅŒ•ๅŒ–ๅŒ—ๅŒ™ๅŒœๅŒๅŒ ๅŒกๅŒฃๅŒฆๅŒชๅŒฎๅŒน" + "ๅŒบๅŒปๅŒผๅŒพๅŒฟๅๅƒๅ…ๅ‡ๅˆๅ‰ๅŠๅŽๅๅ‘ๅ’ๅ“ๅ•ๅ–ๅ—ๅšๅœๅžๅŸๅ ๅกๅขๅฃๅคๅฆๅงๅซๅฌๅฎๅฏๅฐๅฑๅณๅดๅต" + "ๅทๅธๅบๅฟๅŽ‚ๅŽ„ๅŽ…ๅŽ†ๅŽ‰ๅŽ‹ๅŽŒๅŽๅŽ•ๅŽ–ๅŽ˜ๅŽšๅŽๅŽŸๅŽขๅŽฃๅŽฅๅŽฆๅŽจๅŽฉๅŽฎๅŽปๅŽพๅŽฟๅๅ‚ๅ†ๅ‡ๅˆๅ‰ๅŠๅ‹ๅŒๅๅ‘ๅ”" + "ๅ•ๅ–ๅ—ๅ˜ๅ™ๅšๅ›ๅŸๅ ๅฃๅคๅฅๅฆๅจๅฉๅชๅซๅฌๅญๅฎๅฏๅฐๅฑๅฒๅณๅตๅถๅทๅธๅนๅปๅผๅฝๅๅƒๅ„ๅ†ๅˆๅ‰ๅŠ" + "ๅŒๅๅŽๅๅๅ‘ๅ’ๅ“ๅ•ๅ–ๅ—ๅ›ๅๅžๅŸๅ ๅกๅฃๅฆๅงๅจๅฉๅซๅฌๅญๅฎๅฏๅฑๅฒๅดๅตๅธๅนๅปๅผๅฝๅพๅ‘€ๅ‘ƒๅ‘†" + "ๅ‘‡ๅ‘ˆๅ‘Šๅ‘‹ๅ‘ๅ‘’ๅ‘“ๅ‘”ๅ‘•ๅ‘–ๅ‘—ๅ‘˜ๅ‘™ๅ‘›ๅ‘œๅ‘ขๅ‘ฃๅ‘คๅ‘ฆๅ‘จๅ‘ฑๅ‘ฒๅ‘ณๅ‘ตๅ‘ถๅ‘ทๅ‘ธๅ‘ปๅ‘ผๅ‘ฝๅ’€ๅ’‚ๅ’„ๅ’†ๅ’‡ๅ’‰ๅ’‹ๅ’Œๅ’ๅ’Ž" + "ๅ’ๅ’ๅ’’ๅ’”ๅ’•ๅ’–ๅ’™ๅ’šๅ’›ๅ’ๅ’กๅ’ฃๅ’คๅ’ฅๅ’ฆๅ’งๅ’จๅ’ฉๅ’ชๅ’ซๅ’ฌๅ’ฏๅ’ฑๅ’ณๅ’ดๅ’ธๅ’บๅ’ปๅ’ฝๅ’ฟๅ“€ๅ“ๅ“‚ๅ“ƒๅ“„ๅ“†ๅ“‡ๅ“ˆๅ“‰ๅ“Œ" + "ๅ“ๅ“Žๅ“ๅ“ๅ“‘ๅ“’ๅ““ๅ“”ๅ“•ๅ“—ๅ“™ๅ“šๅ“ๅ“žๅ“Ÿๅ“ขๅ“ฅๅ“ฆๅ“งๅ“จๅ“ฉๅ“ชๅ“ญๅ“ฎๅ“ฑๅ“ฒๅ“ณๅ“บๅ“ผๅ“ฝๅ“ฟๅ”ๅ”†ๅ”‡ๅ”‰ๅ”ๅ”ๅ”‘ๅ””ๅ”›" + "ๅ”ๅ” ๅ”ขๅ”ฃๅ”คๅ”งๅ”ชๅ”ฌๅ”ฎๅ”ฏๅ”ฐๅ”ฑๅ”ณๅ”ตๅ”ทๅ”ผๅ”พๅ”ฟๅ•ๅ•ƒๅ•„ๅ•†ๅ•‰ๅ•Šๅ•ๅ••ๅ•–ๅ•œๅ•กๅ•คๅ•ฅๅ•ฆๅ•งๅ•ชๅ•ซๅ•ฌๅ•ญๅ•ฎๅ•ฐๅ•ด" + "ๅ•ตๅ•ถๅ•ทๅ•ธๅ•ปๅ•ผๅ•พๅ–€ๅ–ๅ–‚ๅ–ƒๅ–„ๅ–†ๅ–‡ๅ–ˆๅ–‰ๅ–Šๅ–‹ๅ–ๅ–‘ๅ–”ๅ–˜ๅ–™ๅ–œๅ–ๅ–Ÿๅ–คๅ–งๅ–ฑๅ–ณๅ–ตๅ–ทๅ–นๅ–ปๅ–ฝๅ–พๅ—„ๅ—…ๅ—‰ๅ—Œ" + "ๅ—ๅ—ๅ—‘ๅ—’ๅ—“ๅ—”ๅ—–ๅ—œๅ—ๅ—žๅ—Ÿๅ—กๅ—ฃๅ—คๅ—ฅๅ—ฆๅ—จๅ—ชๅ—ซๅ—ฌๅ—ฏๅ—ฒๅ—ณๅ—ตๅ—ทๅ—ฝๅ—พๅ˜€ๅ˜ๅ˜ˆๅ˜‰ๅ˜Œๅ˜Žๅ˜ๅ˜˜ๅ˜šๅ˜›ๅ˜žๅ˜Ÿๅ˜ก" + "ๅ˜ฃๅ˜คๅ˜งๅ˜ฌๅ˜ญๅ˜ฑๅ˜ฒๅ˜ดๅ˜ถๅ˜นๅ˜ปๅ˜ฟๅ™€ๅ™‚ๅ™‡ๅ™Œๅ™ๅ™Žๅ™”ๅ™—ๅ™˜ๅ™™ๅ™œๅ™ขๅ™คๅ™จๅ™ฉๅ™ชๅ™ซๅ™ฌๅ™ฑๅ™ถๅ™ปๅ™ผๅš„ๅš…ๅš†ๅšŽๅšๅš“" + "ๅššๅšฃๅšญๅšฏๅšทๅšผๅ›Šๅ›”ๅ›šๅ››ๅ›žๅ›Ÿๅ› ๅ›กๅ›ขๅ›คๅ›ซๅ›ญๅ›ฐๅ›ฑๅ›ดๅ›ตๅ›ทๅ›นๅ›บๅ›ฝๅ›พๅ›ฟๅœƒๅœ„ๅœ†ๅœˆๅœ‰ๅœŠๅœŒๅœๅœ™ๅœœๅœŸๅœข" + "ๅœฃๅœจๅœฉๅœชๅœซๅœฌๅœญๅœฎๅœฏๅœฐๅœฒๅœณๅœนๅœบๅœปๅœพๅ€ๅ‚ๅ‡ๅ‰ๅŠๅ‹ๅŒๅๅŽๅๅๅ‘ๅ’ๅ—ๅšๅ›ๅœๅๅžๅŸๅ ๅกๅคๅฅ" + "ๅฆๅจๅฉๅชๅซๅฌๅญๅฏๅฐๅณๅทๅปๅผๅฝๅž‚ๅžƒๅž„ๅž†ๅžˆๅž‹ๅžŒๅžๅžŽๅžๅž’ๅž“ๅž•ๅž™ๅžšๅž›ๅžžๅžŸๅž ๅžกๅžขๅžฃๅžคๅžฆๅžงๅžฉ" + "ๅžซๅžญๅžฎๅžฏๅžฑๅžฒๅžดๅžตๅžธๅžบๅžพๅžฟๅŸ‚ๅŸƒๅŸ†ๅŸ‡ๅŸ‹ๅŸŒๅŸŽๅŸๅŸ’ๅŸ”ๅŸ•ๅŸ—ๅŸ˜ๅŸ™ๅŸšๅŸๅŸŸๅŸ ๅŸคๅŸชๅŸซๅŸญๅŸฏๅŸดๅŸตๅŸธๅŸนๅŸบ" + "ๅŸผๅŸฝๅ ‚ๅ ƒๅ †ๅ ‡ๅ ‰ๅ ‹ๅ Œๅ ๅ Žๅ ๅ ‘ๅ •ๅ ™ๅ žๅ  ๅ กๅ คๅ งๅ จๅ ชๅ ฐๅ ฒๅ ตๅ ผๅ ฝๅ พๅก„ๅก…ๅก†ๅกŒๅกๅก‘ๅก”ๅก˜ๅกๅกžๅกฅๅกซ" + "ๅกฌๅกฑๅกพๅข€ๅขๅขƒๅข…ๅขˆๅข‰ๅขๅข’ๅข“ๅข•ๅข˜ๅข™ๅขšๅขžๅขŸๅขกๅขฃๅขฆๅขจๅขฉๅขผๅฃๅฃ…ๅฃ‘ๅฃ•ๅฃคๅฃซๅฃฌๅฃฎๅฃฐๅฃณๅฃถๅฃธๅฃนๅค„ๅค‡ๅค" + "ๅคๅคๅค”ๅค•ๅค–ๅค™ๅคšๅคœๅคŸๅคคๅคฅๅคงๅคฉๅคชๅคซๅคฌๅคญๅคฎๅคฏๅคฑๅคดๅคทๅคธๅคนๅคบๅคผๅฅๅฅ‚ๅฅ„ๅฅ‡ๅฅˆๅฅ‰ๅฅ‹ๅฅŽๅฅๅฅ‘ๅฅ“ๅฅ”ๅฅ•ๅฅ–" + "ๅฅ—ๅฅ˜ๅฅšๅฅ ๅฅกๅฅขๅฅฅๅฅญๅฅณๅฅดๅฅถๅฅธๅฅนๅฅฝๅฆๅฆ‚ๅฆƒๅฆ„ๅฆ†ๅฆ‡ๅฆˆๅฆŠๅฆๅฆ’ๅฆ“ๅฆ–ๅฆ—ๅฆ˜ๅฆ™ๅฆžๅฆฃๅฆคๅฆฅๅฆงๅฆจๅฆฉๅฆชๅฆซๅฆญๅฆฎ" + "ๅฆฏๅฆฒๅฆนๅฆปๅฆพๅง†ๅงˆๅงŠๅง‹ๅงๅง‘ๅง’ๅง“ๅง”ๅง—ๅง˜ๅงšๅงœๅงๅงžๅงฃๅงคๅงฅๅงจๅงฌๅงฎๅงฑๅงถๅงนๅงปๅงฝๅงฟๅจ€ๅจๅจƒๅจ„ๅจ…ๅจ†ๅจ‡ๅจˆ" + "ๅจ‰ๅจŒๅจ‘ๅจ“ๅจ˜ๅจœๅจŸๅจ ๅจฃๅจฅๅจฉๅจฑๅจฒๅจดๅจตๅจถๅจผๅฉ€ๅฉ†ๅฉ‰ๅฉŠๅฉŒๅฉๅฉ•ๅฉ˜ๅฉšๅฉžๅฉ ๅฉขๅฉคๅฉงๅฉชๅฉซๅฉณๅฉดๅฉตๅฉถๅฉทๅฉบๅฉป" + "ๅฉผๅฉฟๅช‚ๅช„ๅช†ๅช’ๅช“ๅช–ๅชšๅช›ๅชžๅชชๅชญๅชฑๅชฒๅชณๅชตๅชธๅชพๅซๅซ‚ๅซ„ๅซ‰ๅซŒๅซ’ๅซ”ๅซ•ๅซ–ๅซ˜ๅซšๅซœๅซ ๅซกๅซฃๅซฆๅซฉๅซชๅซซๅซญๅซฑ" + "ๅซฝๅฌ‰ๅฌ–ๅฌ—ๅฌ›ๅฌฅๅฌฌๅฌดๅฌทๅฌฟๅญ€ๅญ…ๅญๅญ‘ๅญ“ๅญ”ๅญ•ๅญ–ๅญ—ๅญ˜ๅญ™ๅญšๅญ›ๅญœๅญๅญŸๅญขๅญฃๅญคๅญฅๅญฆๅญฉๅญชๅญฌๅญฐๅญฑๅญณๅญตๅญบๅญฝ" + "ๅฎๅฎƒๅฎ„ๅฎ…ๅฎ‡ๅฎˆๅฎ‰ๅฎ‹ๅฎŒๅฎๅฎ“ๅฎ•ๅฎ—ๅฎ˜ๅฎ™ๅฎšๅฎ›ๅฎœๅฎๅฎžๅฎ ๅฎกๅฎขๅฎฃๅฎคๅฎฅๅฎฆๅฎงๅฎชๅฎซๅฎฌๅฎฐๅฎณๅฎดๅฎตๅฎถๅฎธๅฎนๅฎฝๅฎพ" + "ๅฎฟๅฏๅฏ‚ๅฏ„ๅฏ…ๅฏ†ๅฏ‡ๅฏŒๅฏๅฏ’ๅฏ“ๅฏๅฏžๅฏŸๅฏกๅฏคๅฏฅๅฏจๅฏฎๅฏฐๅฏธๅฏนๅฏบๅฏปๅฏผๅฏฟๅฐๅฐ„ๅฐ†ๅฐ‰ๅฐŠๅฐๅฐ‘ๅฐ”ๅฐ•ๅฐ–ๅฐ˜ๅฐšๅฐœๅฐ" + "ๅฐขๅฐคๅฐฅๅฐงๅฐจๅฐชๅฐฌๅฐฑๅฐดๅฐธๅฐนๅฐบๅฐปๅฐผๅฐฝๅฐพๅฐฟๅฑ€ๅฑๅฑ‚ๅฑƒๅฑ…ๅฑˆๅฑ‰ๅฑŠๅฑ‹ๅฑŽๅฑๅฑๅฑ‘ๅฑ•ๅฑ™ๅฑžๅฑ ๅฑกๅฑฃๅฑฅๅฑฆๅฑฏๅฑฑ" + "ๅฑนๅฑบๅฑผๅฑพๅฑฟๅฒๅฒ‚ๅฒˆๅฒŠๅฒŒๅฒๅฒๅฒ‘ๅฒ”ๅฒ–ๅฒ—ๅฒ˜ๅฒ™ๅฒšๅฒ›ๅฒœๅฒžๅฒ ๅฒขๅฒฃๅฒจๅฒฉๅฒซๅฒฌๅฒญๅฒฑๅฒณๅฒตๅฒทๅฒธๅฒฝๅฒฟๅณๅณ‚ๅณƒ" + "ๅณ„ๅณ‹ๅณ’ๅณ—ๅณ˜ๅณ™ๅณ›ๅณกๅณฃๅณคๅณฅๅณฆๅณงๅณจๅณชๅณญๅณฐๅณฑๅณปๅณฟๅด€ๅดๅด‚ๅดƒๅด„ๅด†ๅด‡ๅดŒๅดŽๅด’ๅด”ๅด–ๅดšๅด›ๅดžๅดŸๅดกๅดคๅดฆๅดง" + "ๅดฉๅดญๅดฎๅดดๅดถๅดฝๅดพๅดฟๅตๅต…ๅต‡ๅตŠๅต‹ๅตŒๅตŽๅต–ๅต˜ๅตšๅต›ๅตๅตฉๅตซๅตฌๅตฏๅตฒๅตดๅถ‚ๅถ…ๅถๅถ’ๅถ“ๅถ™ๅถๅถŸๅถฆๅถฒๅถทๅท…ๅท‡ๅท‰" + "ๅทๅทๅทžๅทกๅทขๅทฅๅทฆๅทงๅทจๅทฉๅทซๅทฎๅทฏๅทฑๅทฒๅทณๅทดๅททๅทฝๅทพๅธๅธ‚ๅธƒๅธ…ๅธ†ๅธˆๅธŒๅธๅธๅธ‘ๅธ”ๅธ•ๅธ–ๅธ˜ๅธ™ๅธšๅธ›ๅธœๅธๅธก" + "ๅธฆๅธงๅธจๅธญๅธฎๅธฑๅธทๅธธๅธปๅธผๅธฝๅน‚ๅน„ๅน…ๅนŒๅน”ๅน•ๅน–ๅน›ๅนžๅนกๅนขๅนชๅนฒๅนณๅนดๅนถๅนธๅนบๅนปๅนผๅนฝๅนฟๅบ„ๅบ†ๅบ‡ๅบŠๅบ‹ๅบๅบ" + "ๅบ‘ๅบ“ๅบ”ๅบ•ๅบ–ๅบ—ๅบ™ๅบšๅบœๅบžๅบŸๅบ ๅบคๅบฅๅบฆๅบงๅบญๅบฑๅบณๅบตๅบถๅบทๅบธๅบนๅบผๅบพๅป†ๅป‰ๅปŠๅป‹ๅป‘ๅป’ๅป“ๅป–ๅป™ๅป›ๅปจๅปชๅปถๅปท" + "ๅปบๅปฟๅผ€ๅผๅผ‚ๅผƒๅผ„ๅผ†ๅผ‡ๅผˆๅผŠๅผ‹ๅผๅผ‘ๅผ“ๅผ•ๅผ—ๅผ˜ๅผ›ๅผŸๅผ ๅผขๅผฅๅผฆๅผงๅผจๅผฉๅผญๅผฏๅผฑๅผถๅผธๅผนๅผบๅผผๅฝ€ๅฝ’ๅฝ“ๅฝ•ๅฝ–" + "ๅฝ—ๅฝ˜ๅฝๅฝŸๅฝขๅฝคๅฝฆๅฝงๅฝฉๅฝชๅฝฌๅฝญๅฝฐๅฝฑๅฝณๅฝทๅฝนๅฝปๅฝผๅพ€ๅพๅพ‚ๅพ„ๅพ…ๅพ‡ๅพˆๅพ‰ๅพŠๅพ‹ๅพๅพ’ๅพ•ๅพ—ๅพ˜ๅพ™ๅพ›ๅพœๅพกๅพจๅพช" + "ๅพญๅพฎๅพตๅพทๅพผๅพฝๅฟƒๅฟ…ๅฟ†ๅฟ‰ๅฟŒๅฟๅฟๅฟๅฟ‘ๅฟ’ๅฟ–ๅฟ—ๅฟ˜ๅฟ™ๅฟๅฟžๅฟ ๅฟกๅฟคๅฟงๅฟชๅฟซๅฟญๅฟฎๅฟฑๅฟณๅฟตๅฟธๅฟบๅฟปๅฟฝๅฟพๅฟฟๆ€€" + "ๆ€ๆ€‚ๆ€ƒๆ€„ๆ€…ๆ€†ๆ€Šๆ€ๆ€Žๆ€ๆ€’ๆ€”ๆ€•ๆ€–ๆ€™ๆ€›ๆ€œๆ€ๆ€ ๆ€กๆ€ฅๆ€ฆๆ€งๆ€จๆ€ฉๆ€ชๆ€ซๆ€ฏๆ€ตๆ€ปๆ€ผๆ€ฟๆๆ‚ๆƒๆ‹ๆๆๆ’ๆ“" + "ๆ”ๆ•ๆ™ๆšๆๆขๆฃๆคๆงๆจๆฉๆชๆซๆฌๆญๆฏๆฐๆณๆถๆธๆนๆบๆปๆผๆฝๆฟๆ‚ƒๆ‚„ๆ‚†ๆ‚ˆๆ‚‰ๆ‚Œๆ‚ๆ‚’ๆ‚”ๆ‚–ๆ‚šๆ‚›ๆ‚ๆ‚Ÿ" + "ๆ‚ ๆ‚ขๆ‚ฃๆ‚ฆๆ‚จๆ‚ซๆ‚ฌๆ‚ญๆ‚ฏๆ‚ฐๆ‚ฑๆ‚ฒๆ‚ดๆ‚ธๆ‚ปๆ‚ผๆƒ…ๆƒ†ๆƒ‡ๆƒŠๆƒ‹ๆƒŽๆƒ‘ๆƒ”ๆƒ•ๆƒ˜ๆƒ™ๆƒšๆƒ›ๆƒœๆƒๆƒŸๆƒ ๆƒฆๆƒงๆƒจๆƒฉๆƒซๆƒฌๆƒญ" + "ๆƒฎๆƒฏๆƒฐๆƒณๆƒดๆƒถๆƒนๆƒบๆ„€ๆ„ๆ„ƒๆ„†ๆ„ˆๆ„‰ๆ„ๆ„Žๆ„ๆ„ๆ„”ๆ„•ๆ„šๆ„Ÿๆ„ ๆ„ฃๆ„คๆ„ฆๆ„งๆ„ซๆ„ญๆ„ฟๆ…†ๆ…ˆๆ…Šๆ…Œๆ…Žๆ…‘ๆ…•ๆ…ๆ…ขๆ…ฅ" + "ๆ…งๆ…จๆ…ฌๆ…ญๆ…ฐๆ…ตๆ…ทๆ†‹ๆ†Žๆ†”ๆ†•ๆ†™ๆ†งๆ†จๆ†ฉๆ†ฌๆ†ญๆ†ทๆ†บๆ†พๆ‡‚ๆ‡ˆๆ‡Šๆ‡‹ๆ‡‘ๆ‡’ๆ‡”ๆ‡ฆๆ‡ตๆ‡ฟๆˆ†ๆˆˆๆˆŠๆˆ‹ๆˆŒๆˆๆˆŽๆˆๆˆๆˆ‘" + "ๆˆ’ๆˆ•ๆˆ–ๆˆ—ๆˆ˜ๆˆšๆˆ›ๆˆŸๆˆกๆˆขๆˆฃๆˆคๆˆฅๆˆชๆˆฌๆˆญๆˆฎๆˆณๆˆดๆˆทๆˆฝๆˆพๆˆฟๆ‰€ๆ‰ๆ‰‚ๆ‰ƒๆ‰…ๆ‰†ๆ‰‡ๆ‰ˆๆ‰‰ๆ‰Šๆ‰‹ๆ‰ๆ‰Žๆ‰‘ๆ‰’ๆ‰“ๆ‰”" + "ๆ‰˜ๆ‰›ๆ‰žๆ‰ฃๆ‰ฆๆ‰งๆ‰ฉๆ‰ชๆ‰ซๆ‰ฌๆ‰ญๆ‰ฎๆ‰ฏๆ‰ฐๆ‰ณๆ‰ถๆ‰นๆ‰บๆ‰ผๆ‰ฝๆ‰พๆ‰ฟๆŠ€ๆŠƒๆŠ„ๆŠ‰ๆŠŠๆŠ‘ๆŠ’ๆŠ“ๆŠ”ๆŠ•ๆŠ–ๆŠ—ๆŠ˜ๆŠšๆŠ›ๆŠŸๆŠ ๆŠก" + "ๆŠขๆŠคๆŠฅๆŠจๆŠซๆŠฌๆŠฑๆŠตๆŠนๆŠปๆŠผๆŠฝๆŠฟๆ‹‚ๆ‹ƒๆ‹„ๆ‹…ๆ‹†ๆ‹‡ๆ‹ˆๆ‹‰ๆ‹Šๆ‹Œๆ‹ๆ‹Žๆ‹ๆ‹’ๆ‹“ๆ‹”ๆ‹–ๆ‹—ๆ‹˜ๆ‹™ๆ‹›ๆ‹œๆ‹Ÿๆ‹ขๆ‹ฃๆ‹คๆ‹ฅ" + "ๆ‹ฆๆ‹งๆ‹จๆ‹ฉๆ‹ฌๆ‹ญๆ‹ฎๆ‹ฏๆ‹ฑๆ‹ณๆ‹ดๆ‹ถๆ‹ทๆ‹ผๆ‹ฝๆ‹พๆ‹ฟๆŒๆŒ‚ๆŒ‡ๆŒˆๆŒ‰ๆŒŽๆŒ‘ๆŒ“ๆŒ–ๆŒšๆŒ›ๆŒๆŒžๆŒŸๆŒ ๆŒกๆŒฃๆŒคๆŒฅๆŒฆๆŒจๆŒชๆŒซ" + "ๆŒฏๆŒฒๆŒนๆŒบๆŒฝๆ‚ๆƒๆ…ๆ†ๆ‰ๆ‹ๆŒๆๆŽๆๆๆ•ๆžๆŸๆกๆขๆฃๆงๆฉๆญๆฎๆฏๆถๆทๆบๆปๆฝๆŽ€ๆŽ‚ๆŽ‡ๆŽˆๆŽ‰ๆŽŠๆŽŒๆŽŽ" + "ๆŽๆŽๆŽ’ๆŽ–ๆŽ˜ๆŽžๆŽ ๆŽขๆŽฃๆŽฅๆŽงๆŽจๆŽฉๆŽชๆŽฌๆŽญๆŽฎๆŽฐๆŽณๆŽดๆŽทๆŽธๆŽบๆŽผๆŽพๆ„ๆ†ๆ‰ๆๆๆๆ’ๆ•ๆ–ๆ ๆกๆฃๆฉๆชๆญ" + "ๆณๆดๆถๆธๆฝๆฟๆ€ๆๆ‚ๆ…ๆ‹ๆŒๆๆๆ’ๆ“ๆ”ๆ›ๆœๆžๆ ๆกๆฆๆชๆฌๆญๆดๆบๆฝๆ‘ๆ‘„ๆ‘…ๆ‘†ๆ‘‡ๆ‘ˆๆ‘Šๆ‘ๆ‘’ๆ‘”ๆ‘˜" + "ๆ‘›ๆ‘žๆ‘งๆ‘ฉๆ‘ญๆ‘ดๆ‘ธๆ‘นๆ‘ฝๆ’‚ๆ’„ๆ’…ๆ’‡ๆ’‘ๆ’’ๆ’•ๆ’–ๆ’™ๆ’žๆ’คๆ’ฉๆ’ฌๆ’ญๆ’ฎๆ’ฐๆ’ตๆ’ทๆ’ธๆ’บๆ’ผๆ“€ๆ“‚ๆ“…ๆ“ๆ“Žๆ“ๆ“’ๆ“˜ๆ“žๆ“ข" + "ๆ“คๆ“ฆๆ“ฟๆ”€ๆ”‰ๆ”’ๆ”˜ๆ”ฅๆ”ซๆ”ฎๆ”ฏๆ”ถๆ”ธๆ”นๆ”ปๆ”ฝๆ”พๆ”ฟๆ•…ๆ•ˆๆ•‰ๆ•Œๆ•ๆ•‘ๆ•”ๆ••ๆ•–ๆ•™ๆ•›ๆ•ๆ•žๆ•ขๆ•ฃๆ•ฆๆ•ฉๆ•ซๆ•ฌๆ•ฐๆ•ฒๆ•ด" + "ๆ•ทๆ–‡ๆ–‹ๆ–Œๆ–ๆ–‘ๆ–“ๆ–—ๆ–™ๆ–›ๆ–œๆ–ๆ–Ÿๆ– ๆ–กๆ–คๆ–ฅๆ–งๆ–ฉๆ–ซๆ–ญๆ–ฏๆ–ฐๆ–ถๆ–นๆ–ผๆ–ฝๆ—ๆ—ƒๆ—„ๆ—…ๆ—†ๆ—‹ๆ—Œๆ—Žๆ—ๆ—ๆ—’ๆ—–ๆ——" + "ๆ—žๆ— ๆ—ขๆ—ฅๆ—ฆๆ—งๆ—จๆ—ฉๆ—ฌๆ—ญๆ—ฎๆ—ฏๆ—ฐๆ—ฑๆ—ดๆ—ตๆ—ถๆ—ทๆ—ธๆ—บๆ—ปๆ—ฟๆ˜€ๆ˜‚ๆ˜ƒๆ˜„ๆ˜†ๆ˜‡ๆ˜ˆๆ˜‰ๆ˜Šๆ˜Œๆ˜Žๆ˜ๆ˜’ๆ˜“ๆ˜”ๆ˜•ๆ˜™ๆ˜" + "ๆ˜Ÿๆ˜ ๆ˜กๆ˜ฃๆ˜คๆ˜ฅๆ˜งๆ˜จๆ˜ชๆ˜ซๆ˜ญๆ˜ฏๆ˜ฑๆ˜ณๆ˜ดๆ˜ตๆ˜ถๆ˜บๆ˜ผๆ˜ฝๆ˜พๆ™ๆ™ƒๆ™…ๆ™Šๆ™‹ๆ™Œๆ™ๆ™ๆ™’ๆ™“ๆ™”ๆ™•ๆ™–ๆ™—ๆ™™ๆ™šๆ™žๆ™Ÿๆ™ก" + "ๆ™ขๆ™คๆ™ฆๆ™จๆ™ชๆ™ซๆ™ฎๆ™ฏๆ™ฐๆ™ฑๆ™ดๆ™ถๆ™ทๆ™บๆ™พๆš‚ๆš„ๆš…ๆš‡ๆšŒๆš‘ๆš•ๆš–ๆš—ๆšๆšงๆšจๆšฎๆšฒๆšดๆšตๆšถๆšนๆšพๆšฟๆ›ˆๆ›Œๆ›™ๆ››ๆ›œ" + "ๆ›ๆ›ฆๆ›ฉๆ›ฐๆ›ฒๆ›ณๆ›ดๆ›ทๆ›นๆ›ผๆ›พๆ›ฟๆœ€ๆœˆๆœ‰ๆœ‹ๆœๆœๆœๆœ“ๆœ”ๆœ•ๆœ—ๆœ›ๆœๆœŸๆœฆๆœจๆœชๆœซๆœฌๆœญๆœฏๆœฑๆœณๆœดๆœตๆœธๆœบๆœฝ" + "ๆ€ๆ‚ๆƒๆ„ๆ†ๆˆๆ‰ๆŒๆŽๆๆๆ‘ๆ“ๆ•ๆ–ๆ™ๆœๆžๆŸๆ ๆกๆฅๆงๆจๆฉๆชๆญๆฏๆฐๆฒๆณๆตๆทๆปๆผๆพๆฟๆžๆž„ๆž…" + "ๆž‡ๆž‰ๆž‹ๆžๆžๆž•ๆž—ๆž˜ๆžšๆžœๆžๆžžๆžขๆžฃๆžฅๆžงๆžจๆžชๆžซๆžญๆžฏๆžฐๆžฒๆžณๆžตๆžถๆžทๆžธๆžนๆŸๆŸƒๆŸ„ๆŸˆๆŸŠๆŸๆŸๆŸ‘ๆŸ’ๆŸ“ๆŸ”" + "ๆŸ–ๆŸ˜ๆŸ™ๆŸšๆŸœๆŸๆŸžๆŸ ๆŸขๆŸฅๆŸฉๆŸฌๆŸฏๆŸฐๆŸฑๆŸณๆŸดๆŸทๆŸฝๆŸฟๆ €ๆ …ๆ ‡ๆ ˆๆ ‰ๆ Šๆ ‹ๆ Œๆ Žๆ ๆ ๆ ‘ๆ ’ๆ “ๆ –ๆ —ๆ ๆ Ÿๆ กๆ ฉ" + "ๆ ชๆ ฒๆ ณๆ ดๆ ทๆ ธๆ นๆ ปๆ ผๆ ฝๆ พๆก€ๆกๆก‚ๆกƒๆก„ๆก…ๆก†ๆกˆๆก‰ๆกŠๆกŒๆกŽๆกๆก‘ๆก“ๆก”ๆก•ๆก ๆกกๆกขๆกฃๆกคๆกฅๆกฆๆกงๆกจๆกฉๆกซๆกฏ" + "ๆกฒๆกดๆกถๆกทๆกนๆขๆขƒๆข…ๆข†ๆขŒๆขๆข“ๆข—ๆข ๆขขๆขฃๆขฆๆขงๆขจๆขญๆขฏๆขฐๆขณๆขดๆขตๆขผๆขฝๆขพๆขฟๆฃ€ๆฃๆฃ‚ๆฃ‰ๆฃ‹ๆฃๆฃๆฃ’ๆฃ“ๆฃ•ๆฃ˜" + "ๆฃšๆฃ ๆฃฃๆฃคๆฃจๆฃชๆฃซๆฃฌๆฃฎๆฃฐๆฃฑๆฃตๆฃนๆฃบๆฃปๆฃผๆฃฝๆค€ๆคๆค…ๆค†ๆค‹ๆคๆคŽๆคๆค‘ๆค’ๆค“ๆคŸๆค ๆคคๆคชๆคญๆคฐๆคดๆคธๆคนๆคฝๆคฟๆฅ‚" + "ๆฅ’ๆฅ”ๆฅ—ๆฅ™ๆฅšๆฅๆฅžๆฅ ๆฅฃๆฅฆๆฅฉๆฅชๆฅซๆฅฎๆฅฏๆฅทๆฅธๆฅนๆฅผๆฆ‚ๆฆƒๆฆ„ๆฆ…ๆฆ†ๆฆ‡ๆฆˆๆฆ‰ๆฆๆฆ‘ๆฆ”ๆฆ•ๆฆ–ๆฆ›ๆฆœๆฆงๆฆจๆฆซๆฆญๆฆฐๆฆฑ" + "ๆฆดๆฆทๆฆปๆงๆงƒๆงŠๆงŒๆงŽๆงๆง”ๆงšๆง›ๆงœๆงŸๆง ๆงญๆงฑๆงฒๆงฝๆงฟๆจŠๆจ—ๆจ˜ๆจŸๆจกๆจจๆจชๆจฏๆจฑๆจตๆจฝๆจพๆฉ„ๆฉ‡ๆฉๆฉ‘ๆฉ˜ๆฉ™ๆฉ›ๆฉž" + "ๆฉกๆฉฅๆฉฆๆฉฑๆฉนๆฉผๆช€ๆช„ๆชŽๆชๆช‘ๆช—ๆชžๆช ๆชฉๆชซๆชฌๆซ†ๆฌ‚ๆฌ ๆฌกๆฌขๆฌฃๆฌคๆฌงๆฌฒๆฌธๆฌนๆฌบๆฌปๆฌพๆญƒๆญ…ๆญ†ๆญ‡ๆญ‰ๆญŒๆญ™ๆญขๆญฃ" + "ๆญคๆญฅๆญฆๆญงๆญชๆญนๆญปๆญผๆฎๆฎ‚ๆฎƒๆฎ„ๆฎ†ๆฎ‡ๆฎ‰ๆฎŠๆฎ‹ๆฎๆฎ’ๆฎ“ๆฎ–ๆฎšๆฎ›ๆฎกๆฎฃๆฎชๆฎณๆฎดๆฎตๆฎทๆฎฟๆฏๆฏ‚ๆฏ…ๆฏ‹ๆฏŒๆฏๆฏๆฏๆฏ’" + "ๆฏ“ๆฏ”ๆฏ•ๆฏ–ๆฏ—ๆฏ™ๆฏ›ๆฏกๆฏชๆฏซๆฏฏๆฏณๆฏตๆฏนๆฏฝๆฐ…ๆฐ†ๆฐ‡ๆฐๆฐๆฐๆฐ‘ๆฐ“ๆฐ”ๆฐ•ๆฐ–ๆฐ˜ๆฐ™ๆฐšๆฐ›ๆฐŸๆฐกๆฐขๆฐคๆฐฆๆฐงๆฐจๆฐฉๆฐชๆฐฎ" + "ๆฐฏๆฐฐๆฐฒๆฐดๆฐธๆฐพๆฐฟๆฑ€ๆฑๆฑ‚ๆฑ†ๆฑ‡ๆฑˆๆฑ‰ๆฑŠๆฑ‹ๆฑๆฑ”ๆฑ•ๆฑ—ๆฑ›ๆฑœๆฑๆฑžๆฑŸๆฑ ๆฑกๆฑคๆฑงๆฑจๆฑฉๆฑชๆฑซๆฑญๆฑฐๆฑฒๆฑดๆฑถๆฑนๆฑฝ" + "ๆฑพๆฒๆฒ‚ๆฒƒๆฒ„ๆฒ…ๆฒ†ๆฒ‡ๆฒˆๆฒ‰ๆฒŒๆฒๆฒๆฒ“ๆฒ”ๆฒ˜ๆฒ™ๆฒšๆฒ›ๆฒŸๆฒกๆฒฃๆฒคๆฒฅๆฒฆๆฒงๆฒจๆฒฉๆฒชๆฒซๆฒญๆฒฎๆฒฑๆฒณๆฒธๆฒนๆฒบๆฒปๆฒผๆฒฝ" + "ๆฒพๆฒฟๆณ‚ๆณƒๆณ„ๆณ…ๆณ‡ๆณ‰ๆณŠๆณŒๆณๆณ“ๆณ”ๆณ•ๆณ–ๆณ—ๆณ™ๆณšๆณ›ๆณœๆณžๆณ ๆณกๆณขๆณฃๆณฅๆณจๆณชๆณซๆณฎๆณฏๆณฐๆณฑๆณณๆณตๆณทๆณธๆณบๆณปๆณผ" + "ๆณฝๆณพๆดๆด„ๆด‡ๆดˆๆด‹ๆดŒๆดŽๆด‘ๆด’ๆด“ๆด—ๆด˜ๆด™ๆดšๆด›ๆดžๆดขๆดฃๆดฅๆดงๆดจๆดชๆดซๆดญๆดฎๆดฑๆดฒๆดณๆดดๆดตๆดธๆดนๆดบๆดปๆดผๆดฝๆดพๆดฟ" + "ๆตๆตƒๆต…ๆต†ๆต‡ๆตˆๆต‰ๆตŠๆต‹ๆตๆตŽๆตๆตๆต‘ๆต’ๆต“ๆต”ๆต•ๆต™ๆตšๆต›ๆตœๆตžๆตŸๆต ๆตกๆตฃๆตฅๆตฆๆตฉๆตชๆตฌๆตญๆตฎๆตฏๆตฐๆตฒๆตดๆตทๆตธ" + "ๆตผๆถ‚ๆถ„ๆถ…ๆถˆๆถ‰ๆถŒๆถๆถŽๆถๆถ‘ๆถ“ๆถ”ๆถ•ๆถ˜ๆถ›ๆถๆถžๆถŸๆถ ๆถกๆถขๆถฃๆถคๆถฆๆถงๆถจๆถฉๆถชๆถซๆถฎๆถฏๆถฒๆถดๆถตๆถธๆถฟๆท€ๆท„ๆท…" + "ๆท†ๆท‡ๆท‹ๆทŒๆทๆท‘ๆท–ๆท˜ๆท™ๆทœๆทๆทžๆทŸๆท ๆทกๆทคๆทฆๆทซๆทฌๆทฎๆทฏๆทฑๆทณๆทดๆททๆทนๆทปๆทผๆธ…ๆธŠๆธŒๆธๆธŽๆธๆธ‘ๆธ”ๆธ—ๆธšๆธๆธŸ" + "ๆธ ๆธกๆธฃๆธคๆธฅๆธฉๆธซๆธญๆธฏๆธฐๆธฒๆธดๆธธๆธบๆธผๆนƒๆน„ๆน‰ๆนๆนŽๆน‘ๆน“ๆน”ๆน–ๆน˜ๆน›ๆนœๆนๆนŸๆนฃๆนซๆนฎๆนฒๆนดๆนพๆนฟๆบๆบƒๆบ…ๆบ†" + "ๆบ‡ๆบ‰ๆบๆบๆบๆบ˜ๆบšๆบœๆบžๆบŸๆบ ๆบขๆบฅๆบฆๆบงๆบชๆบฏๆบฑๆบฒๆบดๆบตๆบถๆบทๆบนๆบบๆบปๆบฝๆปๆป‚ๆปƒๆป†ๆป‡ๆป‰ๆป‹ๆปๆปๆป‘ๆป“ๆป”ๆป•" + "ๆป—ๆป˜ๆปšๆปžๆปŸๆป ๆปกๆปขๆปคๆปฅๆปฆๆปงๆปจๆปฉๆปชๆปซๆปดๆปนๆผ‚ๆผ†ๆผˆๆผ‰ๆผ‹ๆผๆผ“ๆผ”ๆผ•ๆผ–ๆผ ๆผคๆผฆๆผฉๆผชๆผซๆผญๆผฏๆผฑๆผณๆผดๆผถ" + "ๆผทๆผนๆผปๆผผๆผพๆฝ†ๆฝ‡ๆฝ‹ๆฝๆฝๆฝ–ๆฝ˜ๆฝœๆฝžๆฝŸๆฝขๆฝฆๆฝฉๆฝญๆฝฎๆฝฒๆฝดๆฝตๆฝธๆฝบๆฝผๆฝฝๆฝพๆพ‚ๆพ„ๆพˆๆพ‰ๆพŒๆพๆพŽๆพ›ๆพœๆพกๆพฅๆพง" + "ๆพชๆพญๆพณๆพดๆพถๆพนๆพผๆพฝๆฟ€ๆฟ‚ๆฟ‰ๆฟ‹ๆฟ‘ๆฟ’ๆฟžๆฟ ๆฟกๆฟฉๆฟฎๆฟฏ็€Œ็€็€‘็€”็€š็€›็€ฃ็€ฑ็€ต็€น็€ผ็ˆ็Œ็็ž็ซ็ญ็ฏ็ฐ็ต" + "็ถ็ธ็ผ็พ็ฟ็‚€็‚…็‚†็‚‰็‚Š็‚Œ็‚Ž็‚’็‚”็‚•็‚–็‚˜็‚™็‚œ็‚็‚Ÿ็‚ฃ็‚ซ็‚ฌ็‚ญ็‚ฎ็‚ฏ็‚ฑ็‚ณ็‚ท็‚ธ็‚น็‚ป็‚ผ็‚ฝ็ƒ€็ƒ็ƒ‚็ƒƒ็ƒˆ" + "็ƒŠ็ƒ”็ƒ˜็ƒ™็ƒ›็ƒœ็ƒ็ƒŸ็ƒ ็ƒค็ƒฆ็ƒง็ƒจ็ƒฉ็ƒซ็ƒฌ็ƒญ็ƒฏ็ƒถ็ƒท็ƒน็ƒบ็ƒป็ƒฝ็„†็„‰็„Š็„Œ็„็„“็„•็„–็„—็„˜็„™็„š็„œ็„ž็„ฆ็„ฏ" + "็„ฐ็„ฑ็„ถ็…็…ƒ็……็…Š็…‹็…Œ็…Ž็…“็…œ็…ž็…Ÿ็…ค็…ฆ็…ง็…จ็…ฎ็…ฒ็…ณ็…ด็…ธ็…บ็…ฝ็†„็†‡็†Š็†็†”็†˜็†™็†›็†œ็†Ÿ็† ็†ฅ็†จ็†ฌ็†ต" + "็†น็†ป็‡ƒ็‡Š็‡‹็‡Ž็‡็‡”็‡•็‡š็‡ ็‡ฅ็‡ง็‡ฎ็‡น็ˆ†็ˆ‡็ˆ”็ˆš็ˆ็ˆŸ็ˆจ็ˆช็ˆฌ็ˆฐ็ˆฑ็ˆต็ˆถ็ˆท็ˆธ็ˆน็ˆป็ˆฝ็ˆฟ็‰็‰‚็‰‡็‰ˆ็‰Œ็‰" + "็‰’็‰–็‰™็‰š็‰›็‰็‰Ÿ็‰ก็‰ข็‰ค็‰ฅ็‰ฆ็‰ง็‰ฉ็‰ฎ็‰ฏ็‰ฒ็‰ต็‰น็‰บ็‰ป็‰พ็‰ฟ็Š€็Š็Š„็Š‡็ŠŠ็Š‹็Š็Š็Š’็ŠŸ็Šจ็Šฌ็Šฏ็Šฐ็Šด็Šถ็Šท" + "็Šธ็Šน็‹็‹‚็‹ƒ็‹„็‹ˆ็‹‰็‹็‹Ž็‹็‹’็‹—็‹™็‹็‹ž็‹ ็‹ก็‹จ็‹ฉ็‹ฌ็‹ญ็‹ฎ็‹ฏ็‹ฐ็‹ฑ็‹ฒ็‹ณ็‹ด็‹ท็‹ธ็‹บ็‹ป็‹ผ็Œ็Œƒ็Œ„็Œ‡็ŒŠ็ŒŽ" + "็Œ•็Œ–็Œ—็Œ›็Œœ็Œ็Œž็Œก็Œข็Œฅ็Œฉ็Œช็Œซ็Œฌ็Œฎ็Œฏ็Œฐ็Œฑ็Œด็Œท็Œน็Œบ็Œพ็Œฟ็็็’็—็ ็ฌ็ญ็ฏ็ด็พ็Žƒ็Ž„็Ž‡็Ž‰็Ž‹็ŽŽ" + "็Ž‘็Ž’็Ž“็Ž•็Ž–็Ž˜็Ž™็Žš็Ž›็Žž็ŽŸ็Ž ็Žก็Žข็Žค็Žฅ็Žฆ็Žฉ็Žซ็Žญ็Žฎ็Žฏ็Žฐ็Žฑ็Žฒ็Žณ็Žถ็Žท็Žน็Žบ็Žป็Žผ็Žฟ็€็‚็…็‡็ˆ็‰็Š" + "็‹็Œ็็็็‘็’็•็–็™็›็็ž็ ็ข็ฃ็ฅ็ฆ็ง็ฉ็ช็ซ็ญ็ฐ็ฒ็ต็ท็ธ็น็บ็ฝ็€็ƒ็„็…็†็‡็ˆ็‰็Š" + "็Ž็็็”็š็›็Ÿ็ก็ข็ค็ฅ็ฆ็จ็ช็ซ็ฌ็ญ็ฎ็ฏ็ฐ็ฒ็ณ็ด็ต็ถ็ผ็‘€็‘็‘‚็‘ƒ็‘„็‘…็‘†็‘‘็‘“็‘”็‘•็‘–็‘—็‘™" + "็‘š็‘›็‘œ็‘็‘ž็‘Ÿ็‘ข็‘ง็‘จ็‘ฌ็‘ญ็‘ฐ็‘ฑ็‘ณ็‘ถ็‘ท็‘พ็’€็’็’ƒ็’†็’‡็’ˆ็’‹็’Ž็’็’’็’˜็’œ็’ž็’Ÿ็’ ็’ฅ็’ง็’จ็’ฉ็’ช็’ฌ็’ฎ็’ฑ" + "็’ฒ็’บ็“€็“’็“–็“˜็“œ็“ž็“ ็“ข็“ฃ็“ค็“ฆ็“ฎ็“ฏ็“ด็“ถ็“ท็“ป็“ฟ็”„็”็”็”‘็”“็”—็”˜็”š็”œ็”Ÿ็”ก็”ฅ็”ฆ็”จ็”ฉ็”ช็”ซ็”ฌ็”ญ็”ฏ" + "็”ฐ็”ฑ็”ฒ็”ณ็”ต็”ท็”ธ็”บ็”ป็”พ็•€็•…็•ˆ็•‹็•Œ็•Ž็•็•”็•–็•™็•š็•›็•œ็•ค็•ฅ็•ฆ็•ช็•ฌ็•ฏ็•ฒ็•ด็•ธ็•น็•ฟ็–็–ƒ็–†็–็–็–" + "็–‘็–”็––็–—็–™็–š็–็–Ÿ็– ็–ก็–ข็–ฃ็–ค็–ฅ็–ซ็–ฌ็–ญ็–ฎ็–ฏ็–ฐ็–ฑ็–ฒ็–ณ็–ด็–ต็–ธ็–น็–ผ็–ฝ็–พ็—‚็—ƒ็—„็—…็—‡็—ˆ็—‰็—Š็—็—’" + "็—“็—”็—•็—˜็—›็—ž็—ข็—ฃ็—ค็—ฆ็—ง็—จ็—ช็—ซ็—ฐ็—ฑ็—ด็—น็—ผ็—ฟ็˜€็˜็˜ƒ็˜…็˜†็˜Š็˜Œ็˜็˜•็˜—็˜˜็˜™็˜›็˜Ÿ็˜ ็˜ข็˜ค็˜ฅ็˜ฆ็˜ฉ" + "็˜ช็˜ซ็˜ญ็˜ฐ็˜ณ็˜ด็˜ต็˜ธ็˜ผ็˜พ็˜ฟ็™€็™ƒ็™Œ็™็™”็™–็™—็™œ็™ž็™ฃ็™ซ็™ฏ็™ธ็™ป็™ฝ็™พ็™ฟ็š‚็š„็š†็š‡็šˆ็š‹็šŽ็š‘็š“็š•็š–็š™" + "็š›็šž็šค็šฆ็šญ็šฎ็šฑ็šฒ็šด็šฟ็›‚็›…็›†็›ˆ็›‰็›Š็›็›Ž็›็›็›‘็›’็›”็›–็›—็›˜็››็›Ÿ็›ฅ็›ฆ็›ฎ็›ฏ็›ฑ็›ฒ็›ด็›ท็›ธ็›น็›ผ็›พ" + "็œ็œ„็œ‡็œˆ็œ‰็œŠ็œ‹็œ็œ™็œš็œŸ็œ ็œข็œฆ็œจ็œฉ็œฌ็œญ็œฏ็œต็œถ็œท็œธ็œบ็œผ็€็็ƒ็„็‡็Ž็็‘็š็›็ก็ข็ฃ็ฅ็ฆ" + "็จ็ซ็ฌ็น็ฝ็พ็ฟ็ž€็ž„็ž…็ž‹็žŒ็ž็žŽ็ž‘็ž’็žŸ็ž ็žข็žฅ็žง็žฉ็žช็žซ็žฌ็žญ็žฐ็žณ็žต็žป็žฝ็žฟ็Ÿ็Ÿ—็Ÿ›็Ÿœ็Ÿž็Ÿข็Ÿฃ็Ÿฅ" + "็Ÿง็Ÿฉ็Ÿซ็Ÿฌ็Ÿญ็Ÿฎ็Ÿฐ็Ÿณ็Ÿถ็Ÿธ็Ÿป็Ÿผ็Ÿพ็Ÿฟ็ €็ ็ ‚็ „็ †็ ‰็ Œ็ ็ ‘็ ’็ ”็ –็ —็ ˜็ š็ œ็ ็ Ÿ็  ็ ฃ็ ฅ็ ง็ ซ็ ฌ็ ญ็ ฎ" + "็ ฐ็ ด็ ต็ ท็ ธ็ น็ บ็ ป็ ผ็ พ็ก€็ก็ก…็ก‡็กŠ็กŒ็ก็กŽ็ก็ก’็ก”็ก•็ก–็ก—็ก™็กš็ก็กช็กซ็กฌ็กญ็กฎ็กผ็กฟ็ขƒ็ข‡็ขˆ็ข‰็ขŒ็ข" + "็ขŽ็ข็ข‘็ข“็ข—็ข˜็ขš็ข›็ขœ็ขŸ็ขก็ขฃ็ขฅ็ขง็ขจ็ขฐ็ขฑ็ขฒ็ขณ็ขด็ขถ็ขน็ขพ็ฃ็ฃ…็ฃ‰็ฃŠ็ฃ‹็ฃ็ฃ็ฃ”็ฃ•็ฃ™็ฃœ็ฃก็ฃจ็ฃฌ็ฃฒ็ฃด็ฃท" + "็ฃน็ฃป็ค็ค…็คŒ็ค“็คž็คด็คต็คบ็คผ็คพ็ฅ€็ฅ็ฅƒ็ฅ†็ฅ‡็ฅˆ็ฅ‰็ฅŠ็ฅ‹็ฅŽ็ฅ็ฅ็ฅ“็ฅ•็ฅ–็ฅ—็ฅš็ฅ›็ฅœ็ฅ็ฅž็ฅŸ็ฅ ็ฅข็ฅฅ็ฅง็ฅจ็ฅญ" + "็ฅฏ็ฅฒ็ฅท็ฅธ็ฅบ็ฅผ็ฅพ็ฆ€็ฆ็ฆ„็ฆ…็ฆŠ็ฆ‹็ฆ็ฆ’็ฆ”็ฆ˜็ฆš็ฆ›็ฆค็ฆง็ฆณ็ฆน็ฆบ็ฆป็ฆฝ็ฆพ็ง€็ง็งƒ็ง†็ง‰็ง‹็ง็ง‘็ง’็ง•็ง˜็งŸ็งฃ" + "็งค็งฆ็งง็งฉ็งซ็งฌ็งญ็งฏ็งฐ็งธ็งป็งฝ็งพ็จ€็จ‚็จƒ็จ†็จ‹็จŒ็จ็จŽ็จ‘็จ”็จ—็จ™็จš็จž็จ ็จฃ็จณ็จท็จน็จป็จผ็จฝ็จฟ็ฉ„็ฉ†็ฉ‘็ฉ—" + "็ฉ™็ฉœ็ฉŸ็ฉฐ็ฉด็ฉถ็ฉท็ฉธ็ฉน็ฉบ็ฉฟ็ช€็ช็ชƒ็ช„็ช…็ชˆ็ชŠ็ช็ชŽ็ช‘็ช’็ช•็ช–็ช—็ช˜็ชœ็ช็ชŸ็ช ็ชฃ็ชฅ็ชฆ็ชจ็ชฌ็ชญ็ชณ็ชธ็ชฟ็ซ‹" + "็ซ‘็ซ–็ซ˜็ซ™็ซž็ซŸ็ซ ็ซฃ็ซฅ็ซฆ็ซซ็ซญ็ซฏ็ซน็ซบ็ซฝ็ซฟ็ฌƒ็ฌ„็ฌ†็ฌˆ็ฌŠ็ฌ‹็ฌ็ฌ‘็ฌ”็ฌ•็ฌ™็ฌ›็ฌž็ฌ ็ฌค็ฌฅ็ฌฆ็ฌจ็ฌช็ฌซ็ฌฌ็ฌฎ็ฌฏ" + "็ฌฑ็ฌณ็ฌธ็ฌบ็ฌผ็ฌพ็ญ€็ญ…็ญ‡็ญ‰็ญ‹็ญŒ็ญ็ญ็ญ‘็ญ’็ญ”็ญ–็ญ˜็ญš็ญ›็ญœ็ญ็ญ ็ญข็ญค็ญฅ็ญฆ็ญฎ็ญฑ็ญฒ็ญต็ญถ็ญท็ญน็ญป็ญผ็ญพ็ฎ€็ฎ…" + "็ฎ็ฎ็ฎ“็ฎ”็ฎ•็ฎ–็ฎ—็ฎœ็ฎก็ฎข็ฎฆ็ฎง็ฎจ็ฎฉ็ฎช็ฎซ็ฎฌ็ฎญ็ฎฑ็ฎด็ฎธ็ฏ็ฏ†็ฏ‡็ฏŒ็ฏ‘็ฏ“็ฏ™็ฏš็ฏ็ฏก็ฏฅ็ฏฆ็ฏช็ฏฎ็ฏฏ็ฏฑ็ฏท็ฏผ็ฏพ" + "็ฐƒ็ฐ‡็ฐ‰็ฐ‹็ฐŒ็ฐ็ฐ•็ฐ–็ฐ็ฐŸ็ฐ ็ฐง็ฐช็ฐฐ็ฐธ็ฐฟ็ฑ€็ฑ็ฑ็ฑฅ็ฑณ็ฑด็ฑป็ฑผ็ฑฝ็ฒ‰็ฒ‘็ฒ’็ฒ•็ฒ—็ฒ˜็ฒœ็ฒ็ฒž็ฒŸ็ฒข็ฒค็ฒฅ็ฒช็ฒฎ" + "็ฒฑ็ฒฒ็ฒณ็ฒน็ฒผ็ฒฝ็ฒพ็ฒฟ็ณ็ณ…็ณ‡็ณˆ็ณŠ็ณŒ็ณ็ณ’็ณ•็ณ–็ณ—็ณ™็ณœ็ณŸ็ณ ็ณจ็ณฏ็ณต็ณป็ดŠ็ด ็ดข็ดง็ดซ็ดฏ็ตœ็ตฎ็ตท็ถฆ็ถฎ็ธ ็ธข" + "็ธป็น็น„็น‡็บ‚็บ›็บ ็บก็บข็บฃ็บค็บฅ็บฆ็บง็บจ็บฉ็บช็บซ็บฌ็บญ็บฎ็บฏ็บฐ็บฑ็บฒ็บณ็บด็บต็บถ็บท็บธ็บน็บบ็บป็บผ็บฝ็บพ็บฟ็ป€็ป" + "็ป‚็ปƒ็ป„็ป…็ป†็ป‡็ปˆ็ป‰็ปŠ็ป‹็ปŒ็ป็ปŽ็ป็ป็ป‘็ป’็ป“็ป”็ป•็ป–็ป—็ป˜็ป™็ปš็ป›็ปœ็ป็ปž็ปŸ็ป ็ปก็ปข็ปฃ็ปค็ปฅ็ปฆ็ปง็ปจ็ปฉ" + "็ปช็ปซ็ปญ็ปฎ็ปฏ็ปฐ็ปฑ็ปฒ็ปณ็ปด็ปต็ปถ็ปท็ปธ็ปน็ปบ็ปป็ปผ็ปฝ็ปพ็ปฟ็ผ€็ผ็ผ‚็ผƒ็ผ„็ผ…็ผ†็ผ‡็ผˆ็ผ‰็ผŠ็ผŒ็ผŽ็ผ็ผ‘็ผ’็ผ“็ผ”็ผ•" + "็ผ–็ผ—็ผ˜็ผ™็ผš็ผ›็ผœ็ผ็ผž็ผŸ็ผ ็ผก็ผข็ผฃ็ผค็ผฅ็ผฆ็ผง็ผจ็ผฉ็ผช็ผซ็ผฌ็ผญ็ผฎ็ผฏ็ผฐ็ผฑ็ผฒ็ผณ็ผด็ผต็ผถ็ผธ็ผบ็ฝ‚็ฝ„็ฝ…็ฝ็ฝ" + "็ฝ‘็ฝ”็ฝ•็ฝ—็ฝ˜็ฝš็ฝŸ็ฝก็ฝข็ฝจ็ฝฉ็ฝช็ฝฎ็ฝฑ็ฝฒ็ฝด็ฝถ็ฝน็ฝฝ็ฝพ็พ็พŠ็พŒ็พŽ็พ‘็พ“็พ”็พ•็พ–็พš็พ็พž็พŸ็พก็พค็พง็พฏ็พฐ็พฑ็พฒ" + "็พธ็พน็พผ็พฝ็พฟ็ฟ€็ฟ็ฟ‚็ฟƒ็ฟ…็ฟˆ็ฟŠ็ฟŒ็ฟŽ็ฟ”็ฟ•็ฟ˜็ฟ™็ฟš็ฟ›็ฟŸ็ฟ ็ฟก็ฟฅ็ฟฆ็ฟฉ็ฟฎ็ฟฏ็ฟฐ็ฟฑ็ฟณ็ฟท็ฟป็ฟผ็ฟพ่€€่€่€ƒ่€„่€…" + "่€†่€‡่€‹่€Œ่€่€่€่€‘่€’่€”่€•่€–่€—่€˜่€™่€œ่€ ่€ข่€ค่€ฅ่€ฆ่€ง่€จ่€ฉ่€ช่€ฐ่€ฑ่€ณ่€ต่€ถ่€ท่€ธ่€ป่€ฝ่€ฟ่‚่ƒ่†่Š่‹" + "่Œ่่’่”่˜่š่ฉ่ช่ฑ่ฟ่‚ƒ่‚„่‚†่‚‡่‚‰่‚‹่‚Œ่‚“่‚–่‚˜่‚š่‚›่‚่‚Ÿ่‚ ่‚ก่‚ข่‚ค่‚ฅ่‚ฉ่‚ช่‚ซ่‚ญ่‚ฎ่‚ฏ่‚ฑ่‚ฒ่‚ด่‚ท่‚ธ" + "่‚บ่‚ผ่‚ฝ่‚พ่‚ฟ่ƒ€่ƒ่ƒ‚่ƒƒ่ƒ„่ƒ†่ƒˆ่ƒŒ่ƒ่ƒŽ่ƒ–่ƒ—่ƒ™่ƒš่ƒ›่ƒœ่ƒ่ƒž่ƒ ่ƒก่ƒฃ่ƒค่ƒฅ่ƒง่ƒจ่ƒฉ่ƒช่ƒซ่ƒฌ่ƒญ่ƒฏ่ƒฐ่ƒฑ่ƒฒ่ƒณ" + "่ƒด่ƒถ่ƒธ่ƒบ่ƒผ่ƒฝ่„‚่„†่„‰่„Š่„่„Ž่„่„่„‘่„’่„“่„”่„–่„˜่„š่„ž่„Ÿ่„ฉ่„ฌ่„ฏ่„ฑ่„ฒ่„ถ่„ธ่„พ่„ฟ่…†่…ˆ่…Š่…‹่…Œ่…่…‘่…’" + "่…“่…”่…•่…˜่…™่…š่… ่…ฅ่…ง่…จ่…ฉ่…ญ่…ฎ่…ฏ่…ฐ่…ฑ่…ด่…น่…บ่…ป่…ผ่…ฝ่…พ่…ฟ่†€่†‚่†ˆ่†Š่†่†‘่†˜่†™่†›่†œ่†่†ฆ่†จ่†ณ่†บ่†ป" + "่‡€่‡‚่‡ƒ่‡†่‡Š่‡Œ่‡‘่‡œ่‡ฃ่‡ง่‡ช่‡ฌ่‡ญ่‡ณ่‡ด่‡ป่‡ผ่‡พ่ˆ€่ˆ่ˆ‚่ˆ„่ˆ…่ˆ†่ˆŒ่ˆ่ˆ่ˆ’่ˆ”่ˆ›่ˆœ่ˆž่ˆŸ่ˆ ่ˆข่ˆฃ่ˆฅ่ˆช่ˆซ่ˆฌ" + "่ˆญ่ˆฏ่ˆฐ่ˆฑ่ˆฒ่ˆณ่ˆด่ˆต่ˆถ่ˆท่ˆธ่ˆน่ˆป่ˆพ่‰„่‰…่‰‡่‰‰่‰‹่‰Ž่‰่‰˜่‰š่‰Ÿ่‰จ่‰ฎ่‰ฏ่‰ฐ่‰ฒ่‰ณ่‰ด่‰บ่‰ฝ่‰พ่‰ฟ่Š‚่Šƒ่Š„่Šˆ่ŠŠ" + "่Š‹่Š่ŠŽ่Š่Š‘่Š’่Š—่Š˜่Š™่Šœ่Š่ŠŸ่Š ่Šก่Šฃ่Šค่Šฅ่Šฆ่Šจ่Šฉ่Šช่Šซ่Šฌ่Šญ่Šฎ่Šฏ่Šฐ่Šฑ่Šณ่Šด่Šท่Šธ่Šน่Šผ่Šฝ่Šพ่‹่‹„่‹‡่‹ˆ" + "่‹‰่‹Š่‹‹่‹Œ่‹่‹Ž่‹่‹‘่‹’่‹“่‹”่‹•่‹—่‹˜่‹›่‹œ่‹ž่‹Ÿ่‹ ่‹ก่‹ฃ่‹ค่‹ฅ่‹ฆ่‹ง่‹ซ่‹ฏ่‹ฑ่‹ด่‹ท่‹น่‹ป่‹พ่Œ€่Œ่Œ‚่Œƒ่Œ„่Œ…่Œ†" + "่Œˆ่Œ‰่Œ‹่ŒŒ่ŒŽ่Œ่Œ‘่Œ“่Œ”่Œ•่Œ—่Œš่Œ›่Œœ่Œ่Œง่Œจ่Œซ่Œฌ่Œญ่Œฏ่Œฑ่Œณ่Œด่Œต่Œถ่Œธ่Œน่Œบ่Œผ่Œฝ่€่่ƒ่„่†่‡่‰่่" + "่‘่’่“่”่–่™่š่›่œ่ž่Ÿ่ ่ก่ฃ่ค่ฅ่ฆ่ง่จ่ฉ่ช่ซ่ฌ่ญ่ฎ่ฏ่ท่ธ่ป่ผ่ฝ่Ž…่Ž†่Ž‰่ŽŽ่Ž’่Ž“่Ž˜่Ž™่Ž›" + "่Žœ่Ž่Žž่Ž ่Žจ่Žฉ่Žช่Žซ่Žฐ่Žฑ่Žฒ่Žณ่Žด่Žถ่Žท่Žธ่Žน่Žบ่Žผ่Žฝ่Žฟ่€่่‚่…่‡่‰่Š่Œ่่่”่–่˜่œ่่Ÿ่ ่ก่ฅ" + "่ฉ่ช่ฐ่ฑ่ฒ่น่ผ่ฝ่่ƒ่„่†่‹่Œ่่Ž่่‘่˜่š่œ่่ฃ่ค่ฅ่ฆ่ง่จ่ฉ่ฑ่ณ่ธ่น่ผ่ฝ่‘†่‘Ž่‘‘่‘–่‘—" + "่‘™่‘š่‘›่‘œ่‘ก่‘ฃ่‘ฉ่‘ซ่‘ฌ่‘ญ่‘ฐ่‘ฑ่‘ณ่‘ด่‘ต่‘ถ่‘ธ่‘บ่’‚่’„่’‡่’ˆ่’‰่’‹่’Œ่’Ž่’่’—่’™่’œ่’Ÿ่’ก่’จ่’ฏ่’ฑ่’ฒ่’ด่’ธ่’น่’บ" + "่’ป่’ฝ่’ฟ่“่“‚่“„่“‡่“‰่“Š่“่“่“่“‘่““่“–่“่“Ÿ่“ ่“ข่“ฃ่“ฅ่“ฆ่“ฌ่“ฐ่“ผ่“ฟ่”€่”ƒ่”ˆ่”Š่”Œ่”‘่”“่”—่”š่”Ÿ่”ก่”ซ่”ฌ่”ท" + "่”ธ่”น่”บ่”ป่”ผ่”ฝ่•ƒ่•ˆ่•‰่•Š่•–่•—่•™่•ž่•ค่•จ่•ฐ่•ฒ่•ด่•น่•บ่•ป่•พ่–่–„่–…่–‡่–่–›่–œ่–ข่–ค่–จ่–ช่–ฎ่–ฏ่–ฐ่–ณ่–ท่–ธ" + "่–น่–ฟ่—่—‰่—่—่—“่—•่—œ่—Ÿ่— ่—ค่—ฆ่—จ่—ฉ่—ป่—ฟ่˜…่˜‘่˜–่˜˜่˜ง่˜ฉ่˜ธ่˜ผ่™Ž่™่™่™‘่™’่™“่™”่™š่™ž่™ข่™ค่™ซ่™ฌ่™ฎ่™ฑ" + "่™ท่™ธ่™น่™บ่™ป่™ผ่™ฝ่™พ่™ฟ่š€่š่š‚่š„่š†่šŠ่š‹่šŒ่š่š“่š•่šœ่š่šฃ่šค่šง่šจ่šฉ่šช่šฌ่šฏ่šฐ่šฑ่šฒ่šด่šถ่šบ่›€่›ƒ่›„่›†" + "่›‡่›‰่›Š่›‹่›Ž่›่›่›‘่›”่›˜่›™่››่›ž่›Ÿ่›ค่›ฉ่›ญ่›ฎ่›ฐ่›ฑ่›ฒ่›ณ่›ด่›ธ่›น่›พ่œ€่œ‚่œƒ่œ‡่œˆ่œ‰่œŠ่œ่œŽ่œ่œ’่œ“่œ•่œ—" + "่œ˜่œš่œœ่œž่œก่œข่œฃ่œฅ่œฉ่œฎ่œฑ่œด่œท่œป่œพ่œฟ่‡่ˆ่‰่Œ่Ž่“่—่˜่™่ ่ฃ่ค่ฅ่ฎ่ฐ่ฒ่ด่ถ่ป่ผ่ฝ่พ่ž‚่žƒ" + "่ž…่žˆ่ž‹่ž่ž—่žŸ่ž ่žฃ่žจ่žซ่žฌ่žญ่žฏ่žฑ่žณ่žต่žบ่žฝ่Ÿ€่Ÿ†่ŸŠ่Ÿ‹่Ÿ่Ÿ‘่Ÿ’่Ÿ›่Ÿ ่Ÿฅ่Ÿช่Ÿซ่Ÿฎ่Ÿน่Ÿพ่ ƒ่ Š่ ‹่ “่ •่ –่ ก" + "่ ข่ ฒ่ น่ ผ่ก€่กƒ่ก„่ก…่กŒ่ก่กŽ่ก’่ก”่ก—่ก™่ก ่กก่กข่กฃ่กฅ่กจ่กฉ่กซ่กฌ่กฎ่กฐ่กฒ่กท่กฝ่กพ่กฟ่ข่ข‚่ข„่ข…่ข†่ขˆ่ข‹่ข่ข’" + "่ข–่ข—่ขœ่ขข่ขค่ขช่ขซ่ขญ่ขฏ่ขฑ่ขท่ขผ่ฃ่ฃ‚่ฃ…่ฃ†่ฃˆ่ฃ‰่ฃŽ่ฃ’่ฃ”่ฃ•่ฃ˜่ฃ™่ฃ›่ฃŸ่ฃข่ฃฃ่ฃค่ฃฅ่ฃจ่ฃฐ่ฃฑ่ฃณ่ฃด่ฃธ่ฃน่ฃผ่ฃพ่ค‚" + "่คŠ่ค่ค’่ค“่ค•่ค™่คš่ค›่คŸ่คก่คฅ่คช่คซ่คฏ่คฐ่คด่คถ่ฅ่ฅ„่ฅ•่ฅš่ฅœ่ฅž่ฅŸ่ฅฆ่ฅซ่ฅป่ฅฟ่ฆ่ฆƒ่ฆ†่ง่ง‚่งƒ่ง„่ง…่ง†่ง‡่งˆ่ง‰" + "่งŠ่ง‹่งŒ่งŽ่ง่ง่ง‘่ง’่ง–่งš่งœ่งž่งŸ่งฃ่งฅ่งฆ่งซ่งญ่งฏ่งฑ่งณ่งฟ่จ€่จ„่จ‡่จš่จพ่ฉˆ่ฉŸ่ฉน่ช‰่ชŠ่ช“่ฌ‡่ญฆ่ญฌ่ฎก่ฎข่ฎฃ่ฎค" + "่ฎฅ่ฎฆ่ฎง่ฎจ่ฎฉ่ฎช่ฎซ่ฎญ่ฎฎ่ฎฏ่ฎฐ่ฎฑ่ฎฒ่ฎณ่ฎด่ฎต่ฎถ่ฎท่ฎธ่ฎน่ฎบ่ฎป่ฎผ่ฎฝ่ฎพ่ฎฟ่ฏ€่ฏ่ฏ‚่ฏƒ่ฏ„่ฏ…่ฏ†่ฏ‡่ฏˆ่ฏ‰่ฏŠ่ฏ‹่ฏŒ่ฏ" + "่ฏŽ่ฏ่ฏ่ฏ‘่ฏ’่ฏ“่ฏ”่ฏ•่ฏ–่ฏ—่ฏ˜่ฏ™่ฏš่ฏ›่ฏœ่ฏ่ฏž่ฏŸ่ฏ ่ฏก่ฏข่ฏฃ่ฏค่ฏฅ่ฏฆ่ฏง่ฏจ่ฏฉ่ฏซ่ฏฌ่ฏญ่ฏฎ่ฏฏ่ฏฐ่ฏฑ่ฏฒ่ฏณ่ฏด่ฏต่ฏท" + "่ฏธ่ฏน่ฏบ่ฏป่ฏผ่ฏฝ่ฏพ่ฏฟ่ฐ€่ฐ่ฐ‚่ฐƒ่ฐ„่ฐ…่ฐ†่ฐ‡่ฐˆ่ฐŠ่ฐ‹่ฐŒ่ฐ่ฐŽ่ฐ่ฐ่ฐ‘่ฐ’่ฐ“่ฐ”่ฐ•่ฐ–่ฐ—่ฐ™่ฐš่ฐ›่ฐœ่ฐ่ฐž่ฐŸ่ฐ ่ฐก" + "่ฐข่ฐฃ่ฐค่ฐฅ่ฐฆ่ฐง่ฐจ่ฐฉ่ฐช่ฐซ่ฐฌ่ฐญ่ฐฎ่ฐฏ่ฐฐ่ฐฑ่ฐฒ่ฐณ่ฐด่ฐต่ฐถ่ฐท่ฐผ่ฐฟ่ฑ่ฑ†่ฑ‡่ฑ‰่ฑŒ่ฑ•่ฑš่ฑก่ฑข่ฑจ่ฑช่ฑซ่ฑฎ่ฑณ่ฑธ่ฑน" + "่ฑบ่ฒ‚่ฒ…่ฒ†่ฒ‰่ฒŠ่ฒŒ่ฒ”่ฒ˜่ด่ดž่ดŸ่ดก่ดข่ดฃ่ดค่ดฅ่ดฆ่ดง่ดจ่ดฉ่ดช่ดซ่ดฌ่ดญ่ดฎ่ดฏ่ดฐ่ดฑ่ดฒ่ดณ่ดด่ดต่ดถ่ดท่ดธ่ดน่ดบ่ดป่ดผ" + "่ดฝ่ดพ่ดฟ่ต€่ต่ต‚่ตƒ่ต„่ต…่ต†่ต‡่ตˆ่ต‰่ตŠ่ต‹่ตŒ่ต่ตŽ่ต่ต่ต‘่ต’่ต“่ต”่ต•่ต–่ต—่ต˜่ต™่ตš่ต›่ตœ่ต่ตž่ตŸ่ต ่ตก่ตข่ตฃ่ตค" + "่ตฆ่ตง่ตช่ตซ่ตญ่ตฐ่ตณ่ตด่ตต่ตถ่ตท่ถ่ถ„่ถ…่ถŠ่ถ‹่ถ‘่ถ”่ถŸ่ถฃ่ถฏ่ถฑ่ถณ่ถด่ถต่ถธ่ถบ่ถผ่ถพ่ถฟ่ท‚่ทƒ่ท„่ท†่ท‹่ทŒ่ทŽ่ท่ท่ท‘" + "่ท–่ท—่ทš่ท›่ท่ทž่ทŸ่ทฃ่ทค่ทจ่ทช่ทฌ่ทฏ่ทฑ่ทณ่ทต่ทถ่ทท่ทธ่ทน่ทบ่ทป่ทฝ่ธ…่ธ‰่ธŠ่ธŒ่ธ่ธ’่ธ”่ธ่ธž่ธŸ่ธข่ธฃ่ธฆ่ธฉ่ธช่ธฌ่ธฎ" + "่ธฏ่ธฑ่ธต่ธถ่ธน่ธบ่ธฝ่น€่น่น‚่น„่น…่น‡่นˆ่น‰่นŠ่น‹่น่น‘่น’่น™่นš่นœ่นข่นฆ่นฉ่นฌ่นญ่นฏ่นฐ่นฒ่นด่นถ่นผ่นฝ่นพ่นฟ่บ่บ…่บ‡" + "่บ่บ่บ”่บœ่บž่บซ่บฌ่บฏ่บฒ่บบ่ฝฆ่ฝง่ฝจ่ฝฉ่ฝช่ฝซ่ฝฌ่ฝญ่ฝฎ่ฝฏ่ฝฐ่ฝฑ่ฝฒ่ฝณ่ฝด่ฝต่ฝถ่ฝท่ฝธ่ฝน่ฝบ่ฝป่ฝผ่ฝฝ่ฝพ่ฝฟ่พ€่พ่พ‚่พƒ" + "่พ„่พ…่พ†่พ‡่พˆ่พ‰่พŠ่พ‹่พŒ่พ่พŽ่พ่พ่พ‘่พ’่พ“่พ”่พ•่พ–่พ—่พ˜่พ™่พš่พ›่พœ่พž่พŸ่พฃ่พจ่พฉ่พซ่พฐ่พฑ่พน่พฝ่พพ่พฟ่ฟ่ฟ‚่ฟ„" + "่ฟ…่ฟ‡่ฟˆ่ฟŽ่ฟ่ฟ‘่ฟ“่ฟ”่ฟ•่ฟ˜่ฟ™่ฟ›่ฟœ่ฟ่ฟž่ฟŸ่ฟข่ฟค่ฟฅ่ฟฆ่ฟจ่ฟฉ่ฟช่ฟซ่ฟญ่ฟฎ่ฟฐ่ฟณ่ฟท่ฟธ่ฟน่ฟบ่ฟฝ้€€้€้€‚้€ƒ้€„้€…้€†" + "้€‰้€Š้€‹้€้€้€้€‘้€’้€”้€–้€—้€š้€›้€้€ž้€Ÿ้€ ้€ก้€ข้€ฆ้€ญ้€ฎ้€ฏ้€ด้€ต้€ถ้€ธ้€ป้€ผ้€พ้้‚้„้†้‡้้้้‘้’" + "้“้—้˜้›้ข้ฃ้ฅ้จ้ญ้ฎ้ด้ต้น้ฝ้ฟ้‚€้‚‚้‚ƒ้‚ˆ้‚‹้‚‘้‚“้‚•้‚—้‚˜้‚™้‚›้‚้‚ ้‚ก้‚ข้‚ฃ้‚ฆ้‚จ้‚ช้‚ฌ้‚ฎ้‚ฏ้‚ฐ้‚ฑ" + "้‚ฒ้‚ณ้‚ด้‚ต้‚ถ้‚ธ้‚น้‚บ้‚ป้‚ฝ้‚พ้‚ฟ้ƒ้ƒƒ้ƒ„้ƒ…้ƒ‡้ƒˆ้ƒŠ้ƒŽ้ƒ้ƒ้ƒ‘้ƒ“้ƒ—้ƒš้ƒ›้ƒœ้ƒ้ƒก้ƒข้ƒค้ƒฆ้ƒง้ƒจ้ƒช้ƒซ้ƒญ้ƒฏ้ƒด" + "้ƒธ้ƒฝ้ƒพ้ƒฟ้„€้„‚้„ƒ้„„้„…้„Œ้„‘้„—้„˜้„™้„š้„œ้„ž้„ ้„ข้„ฃ้„ซ้„ฏ้„ฑ้„น้…‚้…ƒ้……้…†้…‰้…Š้…‹้…Œ้…้…Ž้…้…้…’้…—้…š้…" + "้…ž้…ก้…ข้…ฃ้…ค้…ฅ้…ฆ้…ฉ้…ช้…ฌ้…ฎ้…ฏ้…ฐ้…ฑ้…ฒ้…ด้…ต้…ถ้…ท้…ธ้…น้…บ้…ฝ้…พ้…ฟ้†…้†‡้†‰้†‹้†Œ้†้†้†‘้†’้†š้†›้†ข้†จ้†ช้†ญ" + "้†ฎ้†ฏ้†ด้†ต้†บ้†พ้‡‡้‡‰้‡Š้‡Œ้‡้‡Ž้‡้‡้‡‘้‡œ้‰ด้ŠŽ้Šฎ้‹†้‹ˆ้Œพ้ช้Ž้Š้–้พ้‘ซ้’†้’‡้’ˆ้’‰้’Š้’‹้’Œ้’้’Ž้’้’้’’" + "้’“้’”้’•้’–้’—้’˜้’™้’š้’›้’œ้’้’ž้’Ÿ้’ ้’ก้’ข้’ฃ้’ค้’ฅ้’ฆ้’ง้’จ้’ฉ้’ช้’ซ้’ฌ้’ญ้’ฎ้’ฏ้’ฐ้’ฑ้’ฒ้’ณ้’ด้’ต้’ท้’น้’บ้’ป้’ผ" + "้’ฝ้’พ้’ฟ้“€้“้“‚้“ƒ้“„้“…้“†้“ˆ้“‰้“Š้“‹้“Œ้“้“Ž้“้“้“‘้“’้“•้“–้“—้“˜้“™้“š้“›้“œ้“้“ž้“Ÿ้“ ้“ก้“ข้“ฃ้“ค้“ฅ้“ง้“จ" + "้“ฉ้“ช้“ซ้“ฌ้“ญ้“ฎ้“ฏ้“ฐ้“ฑ้“ฒ้“ณ้“ด้“ต้“ถ้“ท้“ธ้“น้“บ้“ป้“ผ้“ฝ้“พ้“ฟ้”€้”้”‚้”ƒ้”„้”…้”†้”‡้”ˆ้”‰้”Š้”‹้”Œ้”้”Ž้”้”" + "้”‘้”’้”“้””้”•้”–้”—้”˜้”™้”š้”›้”œ้”้”ž้”Ÿ้”ก้”ข้”ฃ้”ค้”ฅ้”ฆ้”ง้”จ้”ฉ้”ช้”ซ้”ฌ้”ญ้”ฎ้”ฏ้”ฐ้”ฑ้”ฒ้”ณ้”ด้”ต้”ถ้”ท้”ธ้”น" + "้”บ้”ป้”ผ้”ฝ้”พ้”ฟ้•€้•้•‚้•ƒ้•„้•…้•†้•‡้•ˆ้•‰้•Š้•‹้•Œ้•้•Ž้•้•้•‘้•’้•“้•”้••้•–้•—้•˜้•š้•›้•œ้•้•ž้• ้•ก้•ข้•ฃ" + "้•ค้•ฅ้•ฆ้•ง้•จ้•ฉ้•ช้•ซ้•ฌ้•ญ้•ฎ้•ฏ้•ฐ้•ฑ้•ฒ้•ณ้•ด้•ต้•ถ้•ฟ้—จ้—ฉ้—ช้—ซ้—ญ้—ฎ้—ฏ้—ฐ้—ฑ้—ฒ้—ณ้—ด้—ต้—ถ้—ท้—ธ้—น้—บ้—ป้—ผ" + "้—ฝ้—พ้—ฟ้˜€้˜้˜‚้˜ƒ้˜„้˜…้˜†้˜‡้˜ˆ้˜‰้˜Š้˜‹้˜Œ้˜้˜Ž้˜้˜้˜‘้˜’้˜”้˜•้˜–้˜—้˜˜้˜™้˜š้˜œ้˜Ÿ้˜ก้˜ช้˜ฎ้˜ฑ้˜ฒ้˜ณ้˜ด้˜ต้˜ถ" + "้˜ป้˜ผ้˜ฝ้˜ฟ้™€้™‚้™„้™…้™†้™‡้™ˆ้™‰้™‹้™Œ้™้™Ž้™้™‘้™”้™•้™›้™ž้™Ÿ้™ก้™ข้™ค้™ง้™จ้™ฉ้™ช้™ฌ้™ฒ้™ด้™ต้™ถ้™ท้šƒ้š…้š†้šˆ" + "้š‹้š้š้š้š”้š—้š˜้š™้šœ้šง้šฉ้šฐ้šณ้šถ้šน้šบ้šผ้šฝ้šพ้›€้›้›„้›…้›†้›‡้›‰้›Š้›Œ้›้›Ž้›้›’้›•้› ้›จ้›ฉ้›ช้›ฏ้›ฑ้›ณ" + "้›ถ้›ท้›น้›พ้œ€้œ้œ„้œ…้œ†้œ‡้œˆ้œ‰้œ้œŽ้œ้œ“้œ–้œœ้œž้œจ้œช้œญ้œฐ้œฒ้œธ้œน้œพ้’้“้–้™้›้ž้ ้ก้ข้ฅ้ฉ้ฌ้ฐ" + "้ณ้ด้ถ้ธ้บ้ผ้ฝ้ฟ้ž้ž…้ž‹้ž้ž‘้ž’้ž”้ž˜้ž ้žก้žฃ้žง้žจ้žซ้žฌ้žญ้žฎ้žฏ้žฒ้žณ้žด้Ÿ‚้Ÿฆ้Ÿง้Ÿจ้Ÿฉ้Ÿช้Ÿซ้Ÿฌ้Ÿญ้Ÿณ้Ÿต" + "้Ÿถ้กต้กถ้กท้กธ้กน้กบ้กป้กผ้กฝ้กพ้กฟ้ข€้ข้ข‚้ขƒ้ข„้ข…้ข†้ข‡้ขˆ้ข‰้ขŠ้ข‹้ขŒ้ข้ขŽ้ข้ข้ข‘้ข“้ข”้ข–้ข—้ข˜้ข™้ขš้ข›้ขœ้ข" + "้ขž้ขŸ้ข ้ขก้ขข้ขค้ขฅ้ขฆ้ขง้ฃŽ้ฃ้ฃ้ฃ‘้ฃ’้ฃ“้ฃ”้ฃ•้ฃ—้ฃ˜้ฃ™้ฃž้ฃŸ้ฃง้ฃจ้ค้ค้คฎ้ฅ”้ฅ•้ฅฅ้ฅง้ฅจ้ฅฉ้ฅช้ฅซ้ฅฌ้ฅญ้ฅฎ้ฅฏ้ฅฐ" + "้ฅฑ้ฅฒ้ฅณ้ฅด้ฅต้ฅถ้ฅท้ฅธ้ฅน้ฅบ้ฅป้ฅผ้ฅฝ้ฅฟ้ฆ้ฆƒ้ฆ„้ฆ…้ฆ†้ฆ‡้ฆˆ้ฆ‰้ฆŠ้ฆ‹้ฆŒ้ฆ้ฆ้ฆ้ฆ‘้ฆ’้ฆ“้ฆ”้ฆ•้ฆ–้ฆ—้ฆ˜้ฆ™้ฆ้ฆž้ฆฅ" + "้ฆง้ฆจ้ฉฌ้ฉญ้ฉฎ้ฉฏ้ฉฐ้ฉฑ้ฉฒ้ฉณ้ฉด้ฉต้ฉถ้ฉท้ฉธ้ฉน้ฉบ้ฉป้ฉผ้ฉฝ้ฉพ้ฉฟ้ช€้ช้ช‚้ชƒ้ช„้ช…้ช†้ช‡้ชˆ้ช‰้ชŠ้ช‹้ชŒ้ช้ชŽ้ช้ช้ช‘" + "้ช’้ช“้ช•้ช–้ช—้ช˜้ช™้ชš้ช›้ชœ้ช้ชž้ชŸ้ช ้ชก้ชข้ชฃ้ชค้ชฅ้ชฆ้ชง้ชจ้ชฐ้ชฑ้ชถ้ชท้ชธ้ชบ้ชผ้ซ€้ซ้ซ‚้ซƒ้ซ…้ซ‹้ซŒ้ซŽ้ซ‘้ซ“้ซ˜" + "้ซก้ซข้ซฆ้ซซ้ซญ้ซฏ้ซน้ซป้ซฝ้ฌƒ้ฌˆ้ฌ้ฌ’้ฌ“้ฌ˜้ฌŸ้ฌฃ้ฌฏ้ฌฒ้ฌถ้ฌท้ฌป้ฌผ้ญ้ญ‚้ญƒ้ญ„้ญ…้ญ†้ญ‡้ญˆ้ญ‰้ญ‹้ญ้ญ้ญ‘้ญ”้ฑผ้ฑฝ้ฑพ" + "้ฑฟ้ฒ€้ฒ้ฒ‚้ฒƒ้ฒ…้ฒ†้ฒ‡้ฒˆ้ฒ‰้ฒŠ้ฒ‹้ฒŒ้ฒ้ฒŽ้ฒ้ฒ้ฒ‘้ฒ’้ฒ”้ฒ•้ฒ–้ฒ—้ฒ˜้ฒ™้ฒš้ฒ›้ฒœ้ฒ้ฒž้ฒŸ้ฒ ้ฒก้ฒข้ฒฃ้ฒค้ฒฅ้ฒฆ้ฒง้ฒจ" + "้ฒฉ้ฒช้ฒซ้ฒฌ้ฒญ้ฒฎ้ฒฏ้ฒฐ้ฒฑ้ฒฒ้ฒณ้ฒด้ฒต้ฒท้ฒธ้ฒน้ฒบ้ฒป้ฒผ้ฒฝ้ฒพ้ฒฟ้ณ€้ณ้ณ‚้ณƒ้ณ„้ณ…้ณ‡้ณˆ้ณ‰้ณŠ้ณŒ้ณ้ณŽ้ณ้ณ้ณ‘้ณ’้ณ“" + "้ณ”้ณ•้ณ–้ณ—้ณ˜้ณ™้ณš้ณ›้ณœ้ณ้ณž้ณŸ้ณ ้ณก้ณข้ณฃ้ณค้ธŸ้ธ ้ธก้ธข้ธฃ้ธค้ธฅ้ธฆ้ธง้ธจ้ธฉ้ธช้ธซ้ธฌ้ธญ้ธฎ้ธฏ้ธฐ้ธฑ้ธฒ้ธณ้ธต้ธถ" + "้ธท้ธธ้ธน้ธบ้ธป้ธผ้ธฝ้ธพ้ธฟ้น€้น้น‚้นƒ้น„้น…้น†้น‡้นˆ้น‰้นŠ้น‹้นŒ้น้นŽ้น้น้น‘้น’้น”้น•้น–้น—้น˜้น™้นš้น›้นœ้น้นž้นŸ" + "้น ้นก้นข้นฃ้นค้นฆ้นง้นจ้นฉ้นช้นซ้นฌ้นญ้นฎ้นฏ้นฐ้นฑ้นฒ้นณ้นด้นพ้นฟ้บ€้บ‚้บ‡้บˆ้บ‹้บ‘้บ’้บ“้บ–้บ้บŸ้บฆ้บธ้บน้บป้บฝ้บพ้ป„" + "้ป‡้ป‰้ป้ปŽ้ป้ป‘้ป”้ป˜้ป›้ปœ้ป้ปŸ้ป ้ปก้ปข้ปฅ้ปง้ปฉ้ปช้ปฏ้ปน้ปป้ปผ้ปพ้ผ‹้ผ้ผŽ้ผ้ผ’้ผ“้ผ—้ผ™้ผ ้ผข้ผฉ้ผซ้ผฌ้ผฏ้ผฑ้ผท" + "้ผน้ผป้ผฝ้ผพ้ฝ้ฝ‡้ฝ‰้ฝ้ฝ‘้ฝฟ้พ€้พ้พ‚้พƒ้พ„้พ…้พ†้พ‡้พˆ้พ‰้พŠ้พ‹้พŒ้พ™้พš้พ›้พŸ้พ ้พข้ฟ้ฟŽ้ฟใ‘‡ใ‘Šใ•ฎใ˜Žใ™ใ™˜ใ™ฆใ›ƒ" + "ใ›šใ›นใŸƒใ ‡ใ “ใค˜ใฅ„ใงใง‘ใงŸใซฐใฌŠใฌŽใฌšใญŽใญ•ใฎพใฐ€ใณ‡ใณ˜ใณšใด”ใตใถฒใธ†ใธŒใบ„ใปฌใฝใฟ ไ–ไ‚ฎไƒ…ไƒŽไ…ŸไŒนไŽƒไŽ–ไไก" + "ไฒไƒไ“–ไ“›ไ“จไ“ซไ“ฌไ—–ไ—›ไ—ชไ—ดไœฃไ™ไขบไขผไฃ˜ไฅฝไฆƒไฒŸไฒ ไฒขได“ได”ได•ได–ได—ได˜ได™ไถฎ๐ …ค๐ ™ถ๐ ณ๐กŽš๐ก“๐ฃ—‹๐ฃฒ—๐ฃฒ˜๐ฃธฃ๐คง›๐คฉฝ" + "๐คซ‰๐ฅ”ฒ๐ฅ•ข๐ฅ–จ๐ฅป—๐ฆˆก๐ฆ’๐ฆ™ถ๐ฆผ๐ฆญœ๐ฆฐก๐งฟน๐จˆ๐จ™ธ๐จš•๐จŸ ๐จญ‰๐จฑ‡๐จฑ๐จฑ‘๐จฑ”๐จบ™๐ฉฝพ๐ฉพƒ๐ฉพŒ๐ชŸ๐ชฃป๐ชค—๐ชจฐ๐ชจถ๐ชฉ˜๐ชพข๐ซ„ง๐ซ„จ๐ซ„ท๐ซ„ธ๐ซ‡ญ๐ซŒ€๐ซฃ๐ซฏ" + "๐ซฒ๐ซฝ๐ซ„๐ซ๐ซ“๐ซ‘ก๐ซ“ง๐ซ“ฏ๐ซ“ถ๐ซ“น๐ซ”๐ซ”Ž๐ซ”ถ๐ซ–ฎ๐ซ–ฏ๐ซ–ณ๐ซ—ง๐ซ—ด๐ซ˜œ๐ซ˜๐ซ˜ฆ๐ซ˜ง๐ซ˜จ๐ซ˜ช๐ซ˜ฌ๐ซš•๐ซš–๐ซšญ๐ซ›ญ๐ซžฉ๐ซŸ…๐ซŸฆ๐ซŸน๐ซŸผ๐ซ †๐ซ Š๐ซ œ๐ซขธ๐ซซ‡๐ซญŸ" + "๐ซญข๐ซญผ๐ซฎƒ๐ซฐ›๐ซตท๐ซถ‡๐ซทท๐ซธฉ๐ฌ€ฉ๐ฌ€ช๐ฌ‚ฉ๐ฌƒŠ๐ฌ‡•๐ฌ‡™๐ฌ‡น๐ฌ‰ผ๐ฌŠˆ๐ฌŠค๐ฌŒ—๐ฌ›๐ฌก๐ฌค๐ฌ’ˆ๐ฌ’”๐ฌ’—๐ฌ•‚๐ฌ˜“๐ฌ˜˜๐ฌ˜ก๐ฌ˜ฉ๐ฌ˜ซ๐ฌ˜ฌ๐ฌ˜ญ๐ฌ˜ฏ๐ฌ™‚๐ฌ™Š๐ฌ™‹๐ฌœฌ๐ฌœฏ๐ฌžŸ" + "๐ฌŸ๐ฌŸฝ๐ฌฃ™๐ฌฃž๐ฌฃก๐ฌฃณ๐ฌค‡๐ฌคŠ๐ฌค๐ฌจ‚๐ฌจŽ๐ฌฉฝ๐ฌชฉ๐ฌฌฉ๐ฌฌญ๐ฌฌฎ๐ฌฌฑ๐ฌฌธ๐ฌฌน๐ฌฌป๐ฌฌฟ๐ฌญ๐ฌญŠ๐ฌญŽ๐ฌญš๐ฌญ›๐ฌญค๐ฌญฉ๐ฌญฌ๐ฌญฏ๐ฌญณ๐ฌญถ๐ฌญธ๐ฌญผ๐ฌฎฑ๐ฌฎฟ๐ฌฏ€๐ฌฏŽ๐ฌฑ–๐ฌฑŸ" + "๐ฌณต๐ฌณถ๐ฌณฝ๐ฌณฟ๐ฌด‚๐ฌดƒ๐ฌดŠ๐ฌถ‹๐ฌถ๐ฌถ๐ฌถ๐ฌถŸ๐ฌถ ๐ฌถจ๐ฌถญ๐ฌถฎ๐ฌท•๐ฌธ˜๐ฌธš๐ฌธฃ๐ฌธฆ๐ฌธช๐ฌนผ๐ฌบˆ๐ฌบ“" +) +CN_CHARS_EXT = "ๅถ่ฏถๅฑŒๅ›ง้ฃšๅฑ„" + +CN_CHARS = CN_CHARS_COMMON + CN_CHARS_EXT +IN_CH_CHARS = {c: True for c in CN_CHARS} + +EN_CHARS = string.ascii_letters + string.digits +IN_EN_CHARS = {c: True for c in EN_CHARS} + +VALID_CHARS = CN_CHARS + EN_CHARS + " " +IN_VALID_CHARS = {c: True for c in VALID_CHARS} + + +# ================================================================================ # +# basic class +# ================================================================================ # +class ChineseChar(object): + """ + ไธญๆ–‡ๅญ—็ฌฆ + ๆฏไธชๅญ—็ฌฆๅฏนๅบ”็ฎ€ไฝ“ๅ’Œ็นไฝ“, + e.g. ็ฎ€ไฝ“ = '่ดŸ', ็นไฝ“ = '่ฒ ' + ่ฝฌๆขๆ—ถๅฏ่ฝฌๆขไธบ็ฎ€ไฝ“ๆˆ–็นไฝ“ + """ + + def __init__(self, simplified, traditional): + self.simplified = simplified + self.traditional = traditional + # self.__repr__ = self.__str__ + + def __str__(self): + return self.simplified or self.traditional or None + + def __repr__(self): + return self.__str__() + + +class ChineseNumberUnit(ChineseChar): + """ + ไธญๆ–‡ๆ•ฐๅญ—/ๆ•ฐไฝๅญ—็ฌฆ + ๆฏไธชๅญ—็ฌฆ้™ค็น็ฎ€ไฝ“ๅค–่ฟ˜ๆœ‰ไธ€ไธช้ขๅค–็š„ๅคงๅ†™ๅญ—็ฌฆ + e.g. '้™†' ๅ’Œ '้™ธ' + """ + + def __init__(self, power, simplified, traditional, big_s, big_t): + super(ChineseNumberUnit, self).__init__(simplified, traditional) + self.power = power + self.big_s = big_s + self.big_t = big_t + + def __str__(self): + return "10^{}".format(self.power) + + @classmethod + def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False): + if small_unit: + return ChineseNumberUnit(power=index + 1, simplified=value[0], traditional=value[1], big_s=value[1], big_t=value[1]) + elif numbering_type == NUMBERING_TYPES[0]: + return ChineseNumberUnit(power=index + 8, simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1]) + elif numbering_type == NUMBERING_TYPES[1]: + return ChineseNumberUnit(power=(index + 2) * 4, simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1]) + elif numbering_type == NUMBERING_TYPES[2]: + return ChineseNumberUnit(power=pow(2, index + 3), simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1]) + else: + raise ValueError("Counting type should be in {0} ({1} provided).".format(NUMBERING_TYPES, numbering_type)) + + +class ChineseNumberDigit(ChineseChar): + """ + ไธญๆ–‡ๆ•ฐๅญ—ๅญ—็ฌฆ + """ + + def __init__(self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None): + super(ChineseNumberDigit, self).__init__(simplified, traditional) + self.value = value + self.big_s = big_s + self.big_t = big_t + self.alt_s = alt_s + self.alt_t = alt_t + + def __str__(self): + return str(self.value) + + @classmethod + def create(cls, i, v): + return ChineseNumberDigit(i, v[0], v[1], v[2], v[3]) + + +class ChineseMath(ChineseChar): + """ + ไธญๆ–‡ๆ•ฐไฝๅญ—็ฌฆ + """ + + def __init__(self, simplified, traditional, symbol, expression=None): + super(ChineseMath, self).__init__(simplified, traditional) + self.symbol = symbol + self.expression = expression + self.big_s = simplified + self.big_t = traditional + + +CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath + + +class NumberSystem(object): + """ + ไธญๆ–‡ๆ•ฐๅญ—็ณป็ปŸ + """ + + pass + + +class MathSymbol(object): + """ + ็”จไบŽไธญๆ–‡ๆ•ฐๅญ—็ณป็ปŸ็š„ๆ•ฐๅญฆ็ฌฆๅท (็น/็ฎ€ไฝ“), e.g. + positive = ['ๆญฃ', 'ๆญฃ'] + negative = ['่ดŸ', '่ฒ '] + point = ['็‚น', '้ปž'] + """ + + def __init__(self, positive, negative, point): + self.positive = positive + self.negative = negative + self.point = point + + def __iter__(self): + for v in self.__dict__.values(): + yield v + + +# class OtherSymbol(object): +# """ +# ๅ…ถไป–็ฌฆๅท +# """ +# +# def __init__(self, sil): +# self.sil = sil +# +# def __iter__(self): +# for v in self.__dict__.values(): +# yield v + + +# ================================================================================ # +# basic utils +# ================================================================================ # +def create_system(numbering_type=NUMBERING_TYPES[1]): + """ + ๆ นๆฎๆ•ฐๅญ—็ณป็ปŸ็ฑปๅž‹่ฟ”ๅ›žๅˆ›ๅปบ็›ธๅบ”็š„ๆ•ฐๅญ—็ณป็ปŸ๏ผŒ้ป˜่ฎคไธบ mid + NUMBERING_TYPES = ['low', 'mid', 'high']: ไธญๆ–‡ๆ•ฐๅญ—็ณป็ปŸ็ฑปๅž‹ + low: 'ๅ…†' = 'ไบฟ' * 'ๅ' = $10^{9}$, 'ไบฌ' = 'ๅ…†' * 'ๅ', etc. + mid: 'ๅ…†' = 'ไบฟ' * 'ไธ‡' = $10^{12}$, 'ไบฌ' = 'ๅ…†' * 'ไธ‡', etc. + high: 'ๅ…†' = 'ไบฟ' * 'ไบฟ' = $10^{16}$, 'ไบฌ' = 'ๅ…†' * 'ๅ…†', etc. + ่ฟ”ๅ›žๅฏนๅบ”็š„ๆ•ฐๅญ—็ณป็ปŸ + """ + + # chinese number units of 'ไบฟ' and larger + all_larger_units = zip(LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED, LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL) + larger_units = [CNU.create(i, v, numbering_type, False) for i, v in enumerate(all_larger_units)] + # chinese number units of 'ๅ, ็™พ, ๅƒ, ไธ‡' + all_smaller_units = zip(SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED, SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL) + smaller_units = [CNU.create(i, v, small_unit=True) for i, v in enumerate(all_smaller_units)] + # digis + chinese_digis = zip(CHINESE_DIGIS, CHINESE_DIGIS, BIG_CHINESE_DIGIS_SIMPLIFIED, BIG_CHINESE_DIGIS_TRADITIONAL) + digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)] + digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT + digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT + digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1] + + # symbols + positive_cn = CM(POSITIVE[0], POSITIVE[1], "+", lambda x: x) + negative_cn = CM(NEGATIVE[0], NEGATIVE[1], "-", lambda x: -x) + point_cn = CM(POINT[0], POINT[1], ".", lambda x, y: float(str(x) + "." + str(y))) + # sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y))) + system = NumberSystem() + system.units = smaller_units + larger_units + system.digits = digits + system.math = MathSymbol(positive_cn, negative_cn, point_cn) + # system.symbols = OtherSymbol(sil_cn) + return system + + +def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]): + def get_symbol(char, system): + for u in system.units: + if char in [u.traditional, u.simplified, u.big_s, u.big_t]: + return u + for d in system.digits: + if char in [d.traditional, d.simplified, d.big_s, d.big_t, d.alt_s, d.alt_t]: + return d + for m in system.math: + if char in [m.traditional, m.simplified]: + return m + + def string2symbols(chinese_string, system): + int_string, dec_string = chinese_string, "" + for p in [system.math.point.simplified, system.math.point.traditional]: + if p in chinese_string: + int_string, dec_string = chinese_string.split(p) + break + return [get_symbol(c, system) for c in int_string], [get_symbol(c, system) for c in dec_string] + + def correct_symbols(integer_symbols, system): + """ + ไธ€็™พๅ…ซ to ไธ€็™พๅ…ซๅ + ไธ€ไบฟไธ€ๅƒไธ‰็™พไธ‡ to ไธ€ไบฟ ไธ€ๅƒไธ‡ ไธ‰็™พไธ‡ + """ + + if integer_symbols and isinstance(integer_symbols[0], CNU): + if integer_symbols[0].power == 1: + integer_symbols = [system.digits[1]] + integer_symbols + + if len(integer_symbols) > 1: + if isinstance(integer_symbols[-1], CND) and isinstance(integer_symbols[-2], CNU): + integer_symbols.append(CNU(integer_symbols[-2].power - 1, None, None, None, None)) + + result = [] + unit_count = 0 + for s in integer_symbols: + if isinstance(s, CND): + result.append(s) + unit_count = 0 + elif isinstance(s, CNU): + current_unit = CNU(s.power, None, None, None, None) + unit_count += 1 + + if unit_count == 1: + result.append(current_unit) + elif unit_count > 1: + for i in range(len(result)): + if isinstance(result[-i - 1], CNU) and result[-i - 1].power < current_unit.power: + result[-i - 1] = CNU(result[-i - 1].power + current_unit.power, None, None, None, None) + return result + + def compute_value(integer_symbols): + """ + Compute the value. + When current unit is larger than previous unit, current unit * all previous units will be used as all previous units. + e.g. 'ไธคๅƒไธ‡' = 2000 * 10000 not 2000 + 10000 + """ + value = [0] + last_power = 0 + for s in integer_symbols: + if isinstance(s, CND): + value[-1] = s.value + elif isinstance(s, CNU): + value[-1] *= pow(10, s.power) + if s.power > last_power: + value[:-1] = list(map(lambda v: v * pow(10, s.power), value[:-1])) + last_power = s.power + value.append(0) + return sum(value) + + system = create_system(numbering_type) + int_part, dec_part = string2symbols(chinese_string, system) + int_part = correct_symbols(int_part, system) + int_str = str(compute_value(int_part)) + dec_str = "".join([str(d.value) for d in dec_part]) + if dec_part: + return "{0}.{1}".format(int_str, dec_str) + else: + return int_str + + +def num2chn(number_string, numbering_type=NUMBERING_TYPES[1], big=False, traditional=False, alt_zero=False, alt_one=False, alt_two=True, use_zeros=True, use_units=True): + def get_value(value_string, use_zeros=True): + striped_string = value_string.lstrip("0") + + # record nothing if all zeros + if not striped_string: + return [] + + # record one digits + elif len(striped_string) == 1: + if use_zeros and len(value_string) != len(striped_string): + return [system.digits[0], system.digits[int(striped_string)]] + else: + return [system.digits[int(striped_string)]] + + # recursively record multiple digits + else: + result_unit = next(u for u in reversed(system.units) if u.power < len(striped_string)) + result_string = value_string[: -result_unit.power] + return get_value(result_string) + [result_unit] + get_value(striped_string[-result_unit.power :]) + + system = create_system(numbering_type) + + int_dec = number_string.split(".") + if len(int_dec) == 1: + int_string = int_dec[0] + dec_string = "" + elif len(int_dec) == 2: + int_string = int_dec[0] + dec_string = int_dec[1] + else: + raise ValueError("invalid input num string with more than one dot: {}".format(number_string)) + + if use_units and len(int_string) > 1: + result_symbols = get_value(int_string) + else: + result_symbols = [system.digits[int(c)] for c in int_string] + dec_symbols = [system.digits[int(c)] for c in dec_string] + if dec_string: + result_symbols += [system.math.point] + dec_symbols + + if alt_two: + liang = CND(2, system.digits[2].alt_s, system.digits[2].alt_t, system.digits[2].big_s, system.digits[2].big_t) + for i, v in enumerate(result_symbols): + if isinstance(v, CND) and v.value == 2: + next_symbol = result_symbols[i + 1] if i < len(result_symbols) - 1 else None + previous_symbol = result_symbols[i - 1] if i > 0 else None + if isinstance(next_symbol, CNU) and isinstance(previous_symbol, (CNU, type(None))): + if next_symbol.power != 1 and ((previous_symbol is None) or (previous_symbol.power != 1)): + result_symbols[i] = liang + + # if big is True, 'ไธค' will not be used and `alt_two` has no impact on output + if big: + attr_name = "big_" + if traditional: + attr_name += "t" + else: + attr_name += "s" + else: + if traditional: + attr_name = "traditional" + else: + attr_name = "simplified" + + result = "".join([getattr(s, attr_name) for s in result_symbols]) + + # if not use_zeros: + # result = result.strip(getattr(system.digits[0], attr_name)) + + if alt_zero: + result = result.replace(getattr(system.digits[0], attr_name), system.digits[0].alt_s) + + if alt_one: + result = result.replace(getattr(system.digits[1], attr_name), system.digits[1].alt_s) + + for i, p in enumerate(POINT): + if result.startswith(p): + return CHINESE_DIGIS[0] + result + + # ^10, 11, .., 19 + if len(result) >= 2 and result[1] in [SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0], SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0]] and result[0] in [CHINESE_DIGIS[1], BIG_CHINESE_DIGIS_SIMPLIFIED[1], BIG_CHINESE_DIGIS_TRADITIONAL[1]]: + result = result[1:] + + return result + + +# ================================================================================ # +# different types of rewriters +# ================================================================================ # +class Cardinal: + """ + CARDINAL็ฑป + """ + + def __init__(self, cardinal=None, chntext=None): + self.cardinal = cardinal + self.chntext = chntext + + def chntext2cardinal(self): + return chn2num(self.chntext) + + def cardinal2chntext(self): + return num2chn(self.cardinal) + + +class Digit: + """ + DIGIT็ฑป + """ + + def __init__(self, digit=None, chntext=None): + self.digit = digit + self.chntext = chntext + + # def chntext2digit(self): + # return chn2num(self.chntext) + + def digit2chntext(self): + return num2chn(self.digit, alt_two=False, use_units=False) + + +class TelePhone: + """ + TELEPHONE็ฑป + """ + + def __init__(self, telephone=None, raw_chntext=None, chntext=None): + self.telephone = telephone + self.raw_chntext = raw_chntext + self.chntext = chntext + + # def chntext2telephone(self): + # sil_parts = self.raw_chntext.split('') + # self.telephone = '-'.join([ + # str(chn2num(p)) for p in sil_parts + # ]) + # return self.telephone + + def telephone2chntext(self, fixed=False): + if fixed: + sil_parts = self.telephone.split("-") + self.raw_chntext = "".join([num2chn(part, alt_two=False, use_units=False) for part in sil_parts]) + self.chntext = self.raw_chntext.replace("", "") + else: + sp_parts = self.telephone.strip("+").split() + self.raw_chntext = "".join([num2chn(part, alt_two=False, use_units=False) for part in sp_parts]) + self.chntext = self.raw_chntext.replace("", "") + return self.chntext + + +class Fraction: + """ + FRACTION็ฑป + """ + + def __init__(self, fraction=None, chntext=None): + self.fraction = fraction + self.chntext = chntext + + def chntext2fraction(self): + denominator, numerator = self.chntext.split("ๅˆ†ไน‹") + return chn2num(numerator) + "/" + chn2num(denominator) + + def fraction2chntext(self): + numerator, denominator = self.fraction.split("/") + return num2chn(denominator) + "ๅˆ†ไน‹" + num2chn(numerator) + + +class Date: + """ + DATE็ฑป + """ + + def __init__(self, date=None, chntext=None): + self.date = date + self.chntext = chntext + + # def chntext2date(self): + # chntext = self.chntext + # try: + # year, other = chntext.strip().split('ๅนด', maxsplit=1) + # year = Digit(chntext=year).digit2chntext() + 'ๅนด' + # except ValueError: + # other = chntext + # year = '' + # if other: + # try: + # month, day = other.strip().split('ๆœˆ', maxsplit=1) + # month = Cardinal(chntext=month).chntext2cardinal() + 'ๆœˆ' + # except ValueError: + # day = chntext + # month = '' + # if day: + # day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1] + # else: + # month = '' + # day = '' + # date = year + month + day + # self.date = date + # return self.date + + def date2chntext(self): + date = self.date + try: + year, other = date.strip().split("ๅนด", 1) + year = Digit(digit=year).digit2chntext() + "ๅนด" + except ValueError: + other = date + year = "" + if other: + try: + month, day = other.strip().split("ๆœˆ", 1) + month = Cardinal(cardinal=month).cardinal2chntext() + "ๆœˆ" + except ValueError: + day = date + month = "" + if day: + day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1] + else: + month = "" + day = "" + chntext = year + month + day + self.chntext = chntext + return self.chntext + + +class Money: + """ + MONEY็ฑป + """ + + def __init__(self, money=None, chntext=None): + self.money = money + self.chntext = chntext + + # def chntext2money(self): + # return self.money + + def money2chntext(self): + money = self.money + pattern = re.compile(r"(\d+(\.\d+)?)") + matchers = pattern.findall(money) + if matchers: + for matcher in matchers: + money = money.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext()) + self.chntext = money + return self.chntext + + +class Percentage: + """ + PERCENTAGE็ฑป + """ + + def __init__(self, percentage=None, chntext=None): + self.percentage = percentage + self.chntext = chntext + + def chntext2percentage(self): + return chn2num(self.chntext.strip().strip("็™พๅˆ†ไน‹")) + "%" + + def percentage2chntext(self): + return "็™พๅˆ†ไน‹" + num2chn(self.percentage.strip().strip("%")) + + +def normalize_nsw(raw_text): + text = "^" + raw_text + "$" + + # ่ง„่ŒƒๅŒ–ๆ—ฅๆœŸ + pattern = re.compile(r"\D+((([089]\d|(19|20)\d{2})ๅนด)?(\d{1,2}ๆœˆ(\d{1,2}[ๆ—ฅๅท])?)?)") + matchers = pattern.findall(text) + if matchers: + # print('date') + for matcher in matchers: + text = text.replace(matcher[0], Date(date=matcher[0]).date2chntext(), 1) + + # ่ง„่ŒƒๅŒ–้‡‘้’ฑ + pattern = re.compile(r"\D+((\d+(\.\d+)?)[ๅคšไฝ™ๅ‡ ]?" + CURRENCY_UNITS + r"(\d" + CURRENCY_UNITS + r"?)?)") + matchers = pattern.findall(text) + if matchers: + # print('money') + for matcher in matchers: + text = text.replace(matcher[0], Money(money=matcher[0]).money2chntext(), 1) + + # ่ง„่ŒƒๅŒ–ๅ›บ่ฏ/ๆ‰‹ๆœบๅท็  + # ๆ‰‹ๆœบ + # http://www.jihaoba.com/news/show/13680 + # ็งปๅŠจ๏ผš139ใ€138ใ€137ใ€136ใ€135ใ€134ใ€159ใ€158ใ€157ใ€150ใ€151ใ€152ใ€188ใ€187ใ€182ใ€183ใ€184ใ€178ใ€198 + # ่”้€š๏ผš130ใ€131ใ€132ใ€156ใ€155ใ€186ใ€185ใ€176 + # ็”ตไฟก๏ผš133ใ€153ใ€189ใ€180ใ€181ใ€177 + pattern = re.compile(r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D") + matchers = pattern.findall(text) + if matchers: + # print('telephone') + for matcher in matchers: + text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(), 1) + # ๅ›บ่ฏ + pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D") + matchers = pattern.findall(text) + if matchers: + # print('fixed telephone') + for matcher in matchers: + text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(fixed=True), 1) + + # ่ง„่ŒƒๅŒ–ๅˆ†ๆ•ฐ + pattern = re.compile(r"(\d+/\d+)") + matchers = pattern.findall(text) + if matchers: + # print('fraction') + for matcher in matchers: + text = text.replace(matcher, Fraction(fraction=matcher).fraction2chntext(), 1) + + # ่ง„่ŒƒๅŒ–็™พๅˆ†ๆ•ฐ + text = text.replace("๏ผ…", "%") + pattern = re.compile(r"(\d+(\.\d+)?%)") + matchers = pattern.findall(text) + if matchers: + # print('percentage') + for matcher in matchers: + text = text.replace(matcher[0], Percentage(percentage=matcher[0]).percentage2chntext(), 1) + + # ่ง„่ŒƒๅŒ–็บฏๆ•ฐ+้‡่ฏ + pattern = re.compile(r"(\d+(\.\d+)?)[ๅคšไฝ™ๅ‡ ]?" + COM_QUANTIFIERS) + matchers = pattern.findall(text) + if matchers: + # print('cardinal+quantifier') + for matcher in matchers: + text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1) + + # ่ง„่ŒƒๅŒ–ๆ•ฐๅญ—็ผ–ๅท + pattern = re.compile(r"(\d{4,32})") + matchers = pattern.findall(text) + if matchers: + # print('digit') + for matcher in matchers: + text = text.replace(matcher, Digit(digit=matcher).digit2chntext(), 1) + + # ่ง„่ŒƒๅŒ–็บฏๆ•ฐ + pattern = re.compile(r"(\d+(\.\d+)?)") + matchers = pattern.findall(text) + if matchers: + # print('cardinal') + for matcher in matchers: + text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1) + + # restore P2P, O2O, B2C, B2B etc + pattern = re.compile(r"(([a-zA-Z]+)ไบŒ([a-zA-Z]+))") + matchers = pattern.findall(text) + if matchers: + # print('particular') + for matcher in matchers: + text = text.replace(matcher[0], matcher[1] + "2" + matcher[2], 1) + + return text.lstrip("^").rstrip("$") + + +def remove_erhua(text): + """ + ๅŽป้™คๅ„ฟๅŒ–้Ÿณ่ฏไธญ็š„ๅ„ฟ: + ไป–ๅฅณๅ„ฟๅœจ้‚ฃ่พนๅ„ฟ -> ไป–ๅฅณๅ„ฟๅœจ้‚ฃ่พน + """ + + new_str = "" + while re.search("ๅ„ฟ", text): + a = re.search("ๅ„ฟ", text).span() + remove_er_flag = 0 + + if ER_WHITELIST_PATTERN.search(text): + b = ER_WHITELIST_PATTERN.search(text).span() + if b[0] <= a[0]: + remove_er_flag = 1 + + if remove_er_flag == 0: + new_str = new_str + text[0 : a[0]] + text = text[a[1] :] + else: + new_str = new_str + text[0 : b[1]] + text = text[b[1] :] + + text = new_str + text + return text + + +def remove_space(text): + tokens = text.split() + new = [] + for k, t in enumerate(tokens): + if k != 0: + if IN_EN_CHARS.get(tokens[k - 1][-1]) and IN_EN_CHARS.get(t[0]): + new.append(" ") + new.append(t) + return "".join(new) + + +class TextNorm: + def __init__( + self, + to_banjiao: bool = False, + to_upper: bool = False, + to_lower: bool = False, + remove_fillers: bool = False, + remove_erhua: bool = False, + check_chars: bool = False, + remove_space: bool = False, + cc_mode: str = "", + ): + self.to_banjiao = to_banjiao + self.to_upper = to_upper + self.to_lower = to_lower + self.remove_fillers = remove_fillers + self.remove_erhua = remove_erhua + self.check_chars = check_chars + self.remove_space = remove_space + + self.cc = None + if cc_mode: + from opencc import OpenCC # Open Chinese Convert: pip install opencc + + self.cc = OpenCC(cc_mode) + + def __call__(self, text): + if self.cc: + text = self.cc.convert(text) + + if self.to_banjiao: + text = text.translate(QJ2BJ_TRANSFORM) + + if self.to_upper: + text = text.upper() + + if self.to_lower: + text = text.lower() + + if self.remove_fillers: + for c in FILLER_CHARS: + text = text.replace(c, "") + + if self.remove_erhua: + text = remove_erhua(text) + + text = normalize_nsw(text) + + text = text.translate(PUNCS_TRANSFORM) + + if self.check_chars: + for c in text: + if not IN_VALID_CHARS.get(c): + print(f"WARNING: illegal char {c} in: {text}", file=sys.stderr) + return "" + + if self.remove_space: + text = remove_space(text) + + return text + + +if __name__ == "__main__": + p = argparse.ArgumentParser() + + # normalizer options + p.add_argument("--to_banjiao", action="store_true", help="convert quanjiao chars to banjiao") + p.add_argument("--to_upper", action="store_true", help="convert to upper case") + p.add_argument("--to_lower", action="store_true", help="convert to lower case") + p.add_argument("--remove_fillers", action="store_true", help='remove filler chars such as "ๅ‘ƒ, ๅ•Š"') + p.add_argument("--remove_erhua", action="store_true", help='remove erhua chars such as "ไป–ๅฅณๅ„ฟๅœจ้‚ฃ่พนๅ„ฟ -> ไป–ๅฅณๅ„ฟๅœจ้‚ฃ่พน"') + p.add_argument("--check_chars", action="store_true", help="skip sentences containing illegal chars") + p.add_argument("--remove_space", action="store_true", help="remove whitespace") + p.add_argument("--cc_mode", choices=["", "t2s", "s2t"], default="", help="convert between traditional to simplified") + + # I/O options + p.add_argument("--log_interval", type=int, default=10000, help="log interval in number of processed lines") + p.add_argument("--has_key", action="store_true", help="will be deprecated, set --format ark instead") + p.add_argument("--format", type=str, choices=["txt", "ark", "tsv"], default="txt", help="input format") + p.add_argument("ifile", help="input filename, assume utf-8 encoding") + p.add_argument("ofile", help="output filename") + + args = p.parse_args() + + if args.has_key: + args.format = "ark" + + normalizer = TextNorm( + to_banjiao=args.to_banjiao, + to_upper=args.to_upper, + to_lower=args.to_lower, + remove_fillers=args.remove_fillers, + remove_erhua=args.remove_erhua, + check_chars=args.check_chars, + remove_space=args.remove_space, + cc_mode=args.cc_mode, + ) + + ndone = 0 + with open(args.ifile, "r", encoding="utf8") as istream, open(args.ofile, "w+", encoding="utf8") as ostream: + if args.format == "tsv": + reader = csv.DictReader(istream, delimiter="\t") + assert "TEXT" in reader.fieldnames + print("\t".join(reader.fieldnames), file=ostream) + + for item in reader: + text = item["TEXT"] + + if text: + text = normalizer(text) + + if text: + item["TEXT"] = text + print("\t".join([item[f] for f in reader.fieldnames]), file=ostream) + + ndone += 1 + if ndone % args.log_interval == 0: + print(f"text norm: {ndone} lines done.", file=sys.stderr, flush=True) + else: + for l in istream: + key, text = "", "" + if args.format == "ark": # KALDI archive, line format: "key text" + cols = l.strip().split(maxsplit=1) + key, text = cols[0], cols[1] if len(cols) == 2 else "" + else: + text = l.strip() + + if text: + text = normalizer(text) + + if text: + if args.format == "ark": + print(key + "\t" + text, file=ostream) + else: + print(text, file=ostream) + + ndone += 1 + if ndone % args.log_interval == 0: + print(f"text norm: {ndone} lines done.", file=sys.stderr, flush=True) + print(f"text norm: {ndone} lines done in total.", file=sys.stderr, flush=True) diff --git a/lmms_eval/tasks/librispeech/librispeech.yaml b/lmms_eval/tasks/librispeech/librispeech.yaml new file mode 100755 index 00000000..dd6be3bc --- /dev/null +++ b/lmms_eval/tasks/librispeech/librispeech.yaml @@ -0,0 +1,6 @@ +group: librispeech +task: + - librispeech_dev_clean + - librispeech_dev_other + - librispeech_test_clean + - librispeech_test_other diff --git a/lmms_eval/tasks/librispeech/librispeech_dev_clean.yaml b/lmms_eval/tasks/librispeech/librispeech_dev_clean.yaml new file mode 100644 index 00000000..237d955e --- /dev/null +++ b/lmms_eval/tasks/librispeech/librispeech_dev_clean.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/librispeech +dataset_kwargs: + token: True +task : "librispeech_dev_clean" +test_split: librispeech_dev_clean +dataset_name: librispeech_dev_clean +output_type: generate_until +doc_to_visual: !function utils.librispeech_doc_to_audio +doc_to_text: !function utils.librispeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.librispeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.librispeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/librispeech/librispeech_dev_other.yaml b/lmms_eval/tasks/librispeech/librispeech_dev_other.yaml new file mode 100644 index 00000000..c4426db5 --- /dev/null +++ b/lmms_eval/tasks/librispeech/librispeech_dev_other.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/librispeech +dataset_kwargs: + token: True +task : "librispeech_dev_other" +test_split: librispeech_dev_other +dataset_name: librispeech_dev_other +output_type: generate_until +doc_to_visual: !function utils.librispeech_doc_to_audio +doc_to_text: !function utils.librispeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.librispeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.librispeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/librispeech/librispeech_test_clean.yaml b/lmms_eval/tasks/librispeech/librispeech_test_clean.yaml new file mode 100644 index 00000000..bf1c680e --- /dev/null +++ b/lmms_eval/tasks/librispeech/librispeech_test_clean.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/librispeech +dataset_kwargs: + token: True +task : "librispeech_test_clean" +test_split: librispeech_test_clean +dataset_name: librispeech_test_clean +output_type: generate_until +doc_to_visual: !function utils.librispeech_doc_to_audio +doc_to_text: !function utils.librispeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.librispeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.librispeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/librispeech/librispeech_test_other.yaml b/lmms_eval/tasks/librispeech/librispeech_test_other.yaml new file mode 100644 index 00000000..ccac2ec0 --- /dev/null +++ b/lmms_eval/tasks/librispeech/librispeech_test_other.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/librispeech +dataset_kwargs: + token: True +task : "librispeech_test_other" +test_split: librispeech_test_other +dataset_name: librispeech_test_other +output_type: generate_until +doc_to_visual: !function utils.librispeech_doc_to_audio +doc_to_text: !function utils.librispeech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.librispeech_process_result +metric_list: + - metric: wer + aggregation : !function utils.librispeech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/librispeech/utils.py b/lmms_eval/tasks/librispeech/utils.py new file mode 100755 index 00000000..dfe4109b --- /dev/null +++ b/lmms_eval/tasks/librispeech/utils.py @@ -0,0 +1,217 @@ +import os +import re +import unicodedata + +import editdistance as ed # TODO: new package +import zhconv # TODO: new package + +from lmms_eval.tasks.librispeech.cn_tn import TextNorm +from lmms_eval.tasks.librispeech.whisper_normalizer.basic import BasicTextNormalizer +from lmms_eval.tasks.librispeech.whisper_normalizer.english import EnglishTextNormalizer + +# ImportError: To support decoding audio files, please install 'librosa' and 'soundfile'. +english_normalizer = EnglishTextNormalizer() +chinese_normalizer = TextNorm( + to_banjiao=False, + to_upper=False, + to_lower=False, + remove_fillers=False, + remove_erhua=False, + check_chars=False, + remove_space=False, + cc_mode="", +) +basic_normalizer = BasicTextNormalizer() + +dir_name = os.path.dirname(os.path.abspath(__file__)) + + +def librispeech_doc_to_audio(doc): + return [doc["audio"]] + + +def librispeech_doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}" + + +def librispeech_process_result(doc, result): + pred = result[0] if len(result) > 0 else "" + + gt = doc["gt"] + source = doc["source"] + task = doc["task"] + + data_dict = {"gt": gt, "pred": pred, "source": source, "task": task} + + return {"wer": data_dict} + + +PUNCS = "!,.?;:" + + +def remove_sp(text, language): + gt = re.sub(r"<\|.*?\|>", " ", text) + gt = re.sub(rf"\s+", r" ", gt) # Replace consecutive spaces in the text with a single space. + gt = re.sub(f" ?([{PUNCS}])", r"\1", gt) + gt = gt.lstrip(" ") + if language == "zh": + gt = re.sub(rf"\s+", r"", gt) + return gt + + +class EvaluationTokenizer(object): + """A generic evaluation-time tokenizer, which leverages built-in tokenizers + in sacreBLEU (https://github.com/mjpost/sacrebleu). It additionally provides + lowercasing, punctuation removal and character tokenization, which are + applied after sacreBLEU tokenization. + + Args: + tokenizer_type (str): the type of sacreBLEU tokenizer to apply. + lowercase (bool): lowercase the text. + punctuation_removal (bool): remove punctuation (based on unicode + category) from text. + character_tokenization (bool): tokenize the text to characters. + """ + + SPACE = chr(32) + SPACE_ESCAPE = chr(9601) + # ALL_TOKENIZER_TYPES = ChoiceEnum(["none", "13a", "intl", "zh", "ja-mecab"]) + + def __init__( + self, + tokenizer_type: str = "13a", + lowercase: bool = False, + punctuation_removal: bool = False, + character_tokenization: bool = False, + ): + # from sacrebleu.tokenizers import TOKENIZERS + # from sacrebleu.tokenizers import tokenizer_none + from sacrebleu.tokenizers.tokenizer_13a import Tokenizer13a + from sacrebleu.tokenizers.tokenizer_char import TokenizerChar + from sacrebleu.tokenizers.tokenizer_intl import TokenizerV14International + from sacrebleu.tokenizers.tokenizer_ja_mecab import TokenizerJaMecab + from sacrebleu.tokenizers.tokenizer_none import NoneTokenizer + from sacrebleu.tokenizers.tokenizer_zh import TokenizerZh + + TOKENIZERS = { + "none": NoneTokenizer, + "13a": Tokenizer13a, + "intl": TokenizerV14International, + "zh": TokenizerZh, + "ja-mecab": TokenizerJaMecab, + "char": TokenizerChar, + } + + assert tokenizer_type in TOKENIZERS, f"{tokenizer_type}, {TOKENIZERS}" + self.lowercase = lowercase + self.punctuation_removal = punctuation_removal + self.character_tokenization = character_tokenization + self.tokenizer = TOKENIZERS[tokenizer_type] + # self.tokenizer = tokenizer_none + + @classmethod + def remove_punctuation(cls, sent: str): + """Remove punctuation based on Unicode category.""" + return cls.SPACE.join(t for t in sent.split(cls.SPACE) if not all(unicodedata.category(c)[0] == "P" for c in t)) + + def tokenize(self, sent: str): + tokenized = self.tokenizer()(sent) + + if self.punctuation_removal: + tokenized = self.remove_punctuation(tokenized) + + if self.character_tokenization: + tokenized = self.SPACE.join(list(tokenized.replace(self.SPACE, self.SPACE_ESCAPE))) + + if self.lowercase: + tokenized = tokenized.lower() + + return tokenized + + +def compute_wer(refs, hyps, language): + distance = 0 + ref_length = 0 + tokenizer = EvaluationTokenizer( + tokenizer_type="none", + lowercase=True, + punctuation_removal=True, + character_tokenization=False, + ) + for i in range(len(refs)): + ref = refs[i] + pred = hyps[i] + if language in ["yue"]: + ref = zhconv.convert(ref, "zh-cn") + pred = zhconv.convert(pred, "zh-cn") + if language in ["en"]: + ref = english_normalizer(ref) + pred = english_normalizer(pred) + if language in ["zh"]: + ref = chinese_normalizer(ref) + pred = chinese_normalizer(pred) + else: + ref = basic_normalizer(ref) + pred = basic_normalizer(pred) + ref_items = tokenizer.tokenize(ref).split() + pred_items = tokenizer.tokenize(pred).split() + if language in ["zh", "yue"]: + ref_items = [x for x in "".join(ref_items)] + pred_items = [x for x in "".join(pred_items)] + if i == 0: + print(f"ref: {ref}") + print(f"pred: {pred}") + print(f"ref_items:\n{ref_items}\n{len(ref_items)}\n{ref_items[0]}") + print(f"pred_items:\n{pred_items}\n{len(ref_items)}\n{ref_items[0]}") + distance += ed.eval(ref_items, pred_items) + ref_length += len(ref_items) + return distance / ref_length + + +def librispeech_wer(results, args): + # lan = args["language"] + refs, hyps = [], [] + # results_list = results_dict[source] + for result in results: + lan = result["task"][4:] + gt = result["gt"] + response = result["pred"] + gt = remove_sp(gt, lan) + response = remove_sp(response, lan) + refs.append(gt) + hyps.append(response) + wer = compute_wer(refs, hyps, lan) + # print(f"source: {source} cnt: {len(refs)} wer: {wer:.4f}") + return wer * 100 + + # for gt, response, source, audio_path in zip(merged_gts, merged_responses, merged_sources, merged_audio_paths): + # results.append({ + # 'gt': gt, + # 'response': response, + # 'source': source, + # 'audio_path': audio_path, + # }) + # time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) + # results_file = f'{args.dataset}_{time_prefix}.json' + # json.dump(results, open(results_file, 'w')) + results_dict = {} + for item in results: + source = item["source"] + results_dict.setdefault(source, []).append(item) + lan = ds_collections[args.dataset]["language"] + for source in results_dict: + refs, hyps = [], [] + results_list = results_dict[source] + for result in results_list: + gt = result["gt"] + response = result["response"] + gt = remove_sp(gt, lan) + response = remove_sp(response, lan) + refs.append(gt) + hyps.append(response) + wer = compute_wer(refs, hyps, lan) + print(f"source: {source} cnt: {len(refs)} wer: {wer:.4f}") + + pass diff --git a/lmms_eval/tasks/librispeech/whisper_normalizer/basic.py b/lmms_eval/tasks/librispeech/whisper_normalizer/basic.py new file mode 100644 index 00000000..00a54dcc --- /dev/null +++ b/lmms_eval/tasks/librispeech/whisper_normalizer/basic.py @@ -0,0 +1,58 @@ +import re +import unicodedata + +import regex + +# non-ASCII letters that are not separated by "NFKD" normalization +ADDITIONAL_DIACRITICS = { + "ล“": "oe", + "ล’": "OE", + "รธ": "o", + "ร˜": "O", + "รฆ": "ae", + "ร†": "AE", + "รŸ": "ss", + "แบž": "SS", + "ฤ‘": "d", + "ฤ": "D", + "รฐ": "d", + "ร": "D", + "รพ": "th", + "รž": "th", + "ล‚": "l", + "ล": "L", +} + + +def remove_symbols_and_diacritics(s: str, keep=""): + """ + Replace any other markers, symbols, and punctuations with a space, + and drop any diacritics (category 'Mn' and some manual mappings) + """ + return "".join(c if c in keep else ADDITIONAL_DIACRITICS[c] if c in ADDITIONAL_DIACRITICS else "" if unicodedata.category(c) == "Mn" else " " if unicodedata.category(c)[0] in "MSP" else c for c in unicodedata.normalize("NFKD", s)) + + +def remove_symbols(s: str): + """ + Replace any other markers, symbols, punctuations with a space, keeping diacritics + """ + return "".join(" " if unicodedata.category(c)[0] in "MSP" else c for c in unicodedata.normalize("NFKC", s)) + + +class BasicTextNormalizer: + def __init__(self, remove_diacritics: bool = False, split_letters: bool = False): + self.clean = remove_symbols_and_diacritics if remove_diacritics else remove_symbols + self.split_letters = split_letters + + def __call__(self, s: str): + s = s.lower() + s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets + s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis + s = self.clean(s).lower() + + if self.split_letters: + s = " ".join(regex.findall(r"\X", s, regex.U)) + + s = re.sub(r"\s+", " ", s) # replace any successive whitespace characters with a space + + return s diff --git a/lmms_eval/tasks/librispeech/whisper_normalizer/english.json b/lmms_eval/tasks/librispeech/whisper_normalizer/english.json new file mode 100644 index 00000000..566e4812 --- /dev/null +++ b/lmms_eval/tasks/librispeech/whisper_normalizer/english.json @@ -0,0 +1,1741 @@ +{ + "accessorise": "accessorize", + "accessorised": "accessorized", + "accessorises": "accessorizes", + "accessorising": "accessorizing", + "acclimatisation": "acclimatization", + "acclimatise": "acclimatize", + "acclimatised": "acclimatized", + "acclimatises": "acclimatizes", + "acclimatising": "acclimatizing", + "accoutrements": "accouterments", + "aeon": "eon", + "aeons": "eons", + "aerogramme": "aerogram", + "aerogrammes": "aerograms", + "aeroplane": "airplane", + "aeroplanes": "airplanes", + "aesthete": "esthete", + "aesthetes": "esthetes", + "aesthetic": "esthetic", + "aesthetically": "esthetically", + "aesthetics": "esthetics", + "aetiology": "etiology", + "ageing": "aging", + "aggrandisement": "aggrandizement", + "agonise": "agonize", + "agonised": "agonized", + "agonises": "agonizes", + "agonising": "agonizing", + "agonisingly": "agonizingly", + "almanack": "almanac", + "almanacks": "almanacs", + "aluminium": "aluminum", + "amortisable": "amortizable", + "amortisation": "amortization", + "amortisations": "amortizations", + "amortise": "amortize", + "amortised": "amortized", + "amortises": "amortizes", + "amortising": "amortizing", + "amphitheatre": "amphitheater", + "amphitheatres": "amphitheaters", + "anaemia": "anemia", + "anaemic": "anemic", + "anaesthesia": "anesthesia", + "anaesthetic": "anesthetic", + "anaesthetics": "anesthetics", + "anaesthetise": "anesthetize", + "anaesthetised": "anesthetized", + "anaesthetises": "anesthetizes", + "anaesthetising": "anesthetizing", + "anaesthetist": "anesthetist", + "anaesthetists": "anesthetists", + "anaesthetize": "anesthetize", + "anaesthetized": "anesthetized", + "anaesthetizes": "anesthetizes", + "anaesthetizing": "anesthetizing", + "analogue": "analog", + "analogues": "analogs", + "analyse": "analyze", + "analysed": "analyzed", + "analyses": "analyzes", + "analysing": "analyzing", + "anglicise": "anglicize", + "anglicised": "anglicized", + "anglicises": "anglicizes", + "anglicising": "anglicizing", + "annualised": "annualized", + "antagonise": "antagonize", + "antagonised": "antagonized", + "antagonises": "antagonizes", + "antagonising": "antagonizing", + "apologise": "apologize", + "apologised": "apologized", + "apologises": "apologizes", + "apologising": "apologizing", + "appal": "appall", + "appals": "appalls", + "appetiser": "appetizer", + "appetisers": "appetizers", + "appetising": "appetizing", + "appetisingly": "appetizingly", + "arbour": "arbor", + "arbours": "arbors", + "archeological": "archaeological", + "archaeologically": "archeologically", + "archaeologist": "archeologist", + "archaeologists": "archeologists", + "archaeology": "archeology", + "ardour": "ardor", + "armour": "armor", + "armoured": "armored", + "armourer": "armorer", + "armourers": "armorers", + "armouries": "armories", + "armoury": "armory", + "artefact": "artifact", + "artefacts": "artifacts", + "authorise": "authorize", + "authorised": "authorized", + "authorises": "authorizes", + "authorising": "authorizing", + "axe": "ax", + "backpedalled": "backpedaled", + "backpedalling": "backpedaling", + "bannister": "banister", + "bannisters": "banisters", + "baptise": "baptize", + "baptised": "baptized", + "baptises": "baptizes", + "baptising": "baptizing", + "bastardise": "bastardize", + "bastardised": "bastardized", + "bastardises": "bastardizes", + "bastardising": "bastardizing", + "battleax": "battleaxe", + "baulk": "balk", + "baulked": "balked", + "baulking": "balking", + "baulks": "balks", + "bedevilled": "bedeviled", + "bedevilling": "bedeviling", + "behaviour": "behavior", + "behavioural": "behavioral", + "behaviourism": "behaviorism", + "behaviourist": "behaviorist", + "behaviourists": "behaviorists", + "behaviours": "behaviors", + "behove": "behoove", + "behoved": "behooved", + "behoves": "behooves", + "bejewelled": "bejeweled", + "belabour": "belabor", + "belaboured": "belabored", + "belabouring": "belaboring", + "belabours": "belabors", + "bevelled": "beveled", + "bevvies": "bevies", + "bevvy": "bevy", + "biassed": "biased", + "biassing": "biasing", + "bingeing": "binging", + "bougainvillaea": "bougainvillea", + "bougainvillaeas": "bougainvilleas", + "bowdlerise": "bowdlerize", + "bowdlerised": "bowdlerized", + "bowdlerises": "bowdlerizes", + "bowdlerising": "bowdlerizing", + "breathalyse": "breathalyze", + "breathalysed": "breathalyzed", + "breathalyser": "breathalyzer", + "breathalysers": "breathalyzers", + "breathalyses": "breathalyzes", + "breathalysing": "breathalyzing", + "brutalise": "brutalize", + "brutalised": "brutalized", + "brutalises": "brutalizes", + "brutalising": "brutalizing", + "busses": "buses", + "bussing": "busing", + "caesarean": "cesarean", + "caesareans": "cesareans", + "calibre": "caliber", + "calibres": "calibers", + "calliper": "caliper", + "callipers": "calipers", + "callisthenics": "calisthenics", + "canalise": "canalize", + "canalised": "canalized", + "canalises": "canalizes", + "canalising": "canalizing", + "cancelation": "cancellation", + "cancelations": "cancellations", + "cancelled": "canceled", + "cancelling": "canceling", + "candour": "candor", + "cannibalise": "cannibalize", + "cannibalised": "cannibalized", + "cannibalises": "cannibalizes", + "cannibalising": "cannibalizing", + "canonise": "canonize", + "canonised": "canonized", + "canonises": "canonizes", + "canonising": "canonizing", + "capitalise": "capitalize", + "capitalised": "capitalized", + "capitalises": "capitalizes", + "capitalising": "capitalizing", + "caramelise": "caramelize", + "caramelised": "caramelized", + "caramelises": "caramelizes", + "caramelising": "caramelizing", + "carbonise": "carbonize", + "carbonised": "carbonized", + "carbonises": "carbonizes", + "carbonising": "carbonizing", + "carolled": "caroled", + "carolling": "caroling", + "catalogue": "catalog", + "catalogued": "cataloged", + "catalogues": "catalogs", + "cataloguing": "cataloging", + "catalyse": "catalyze", + "catalysed": "catalyzed", + "catalyses": "catalyzes", + "catalysing": "catalyzing", + "categorise": "categorize", + "categorised": "categorized", + "categorises": "categorizes", + "categorising": "categorizing", + "cauterise": "cauterize", + "cauterised": "cauterized", + "cauterises": "cauterizes", + "cauterising": "cauterizing", + "cavilled": "caviled", + "cavilling": "caviling", + "centigramme": "centigram", + "centigrammes": "centigrams", + "centilitre": "centiliter", + "centilitres": "centiliters", + "centimetre": "centimeter", + "centimetres": "centimeters", + "centralise": "centralize", + "centralised": "centralized", + "centralises": "centralizes", + "centralising": "centralizing", + "centre": "center", + "centred": "centered", + "centrefold": "centerfold", + "centrefolds": "centerfolds", + "centrepiece": "centerpiece", + "centrepieces": "centerpieces", + "centres": "centers", + "channelled": "channeled", + "channelling": "channeling", + "characterise": "characterize", + "characterised": "characterized", + "characterises": "characterizes", + "characterising": "characterizing", + "cheque": "check", + "chequebook": "checkbook", + "chequebooks": "checkbooks", + "chequered": "checkered", + "cheques": "checks", + "chilli": "chili", + "chimaera": "chimera", + "chimaeras": "chimeras", + "chiselled": "chiseled", + "chiselling": "chiseling", + "circularise": "circularize", + "circularised": "circularized", + "circularises": "circularizes", + "circularising": "circularizing", + "civilise": "civilize", + "civilised": "civilized", + "civilises": "civilizes", + "civilising": "civilizing", + "clamour": "clamor", + "clamoured": "clamored", + "clamouring": "clamoring", + "clamours": "clamors", + "clangour": "clangor", + "clarinettist": "clarinetist", + "clarinettists": "clarinetists", + "collectivise": "collectivize", + "collectivised": "collectivized", + "collectivises": "collectivizes", + "collectivising": "collectivizing", + "colonisation": "colonization", + "colonise": "colonize", + "colonised": "colonized", + "coloniser": "colonizer", + "colonisers": "colonizers", + "colonises": "colonizes", + "colonising": "colonizing", + "colour": "color", + "colourant": "colorant", + "colourants": "colorants", + "coloured": "colored", + "coloureds": "coloreds", + "colourful": "colorful", + "colourfully": "colorfully", + "colouring": "coloring", + "colourize": "colorize", + "colourized": "colorized", + "colourizes": "colorizes", + "colourizing": "colorizing", + "colourless": "colorless", + "colours": "colors", + "commercialise": "commercialize", + "commercialised": "commercialized", + "commercialises": "commercializes", + "commercialising": "commercializing", + "compartmentalise": "compartmentalize", + "compartmentalised": "compartmentalized", + "compartmentalises": "compartmentalizes", + "compartmentalising": "compartmentalizing", + "computerise": "computerize", + "computerised": "computerized", + "computerises": "computerizes", + "computerising": "computerizing", + "conceptualise": "conceptualize", + "conceptualised": "conceptualized", + "conceptualises": "conceptualizes", + "conceptualising": "conceptualizing", + "connexion": "connection", + "connexions": "connections", + "contextualise": "contextualize", + "contextualised": "contextualized", + "contextualises": "contextualizes", + "contextualising": "contextualizing", + "cosier": "cozier", + "cosies": "cozies", + "cosiest": "coziest", + "cosily": "cozily", + "cosiness": "coziness", + "cosy": "cozy", + "councillor": "councilor", + "councillors": "councilors", + "counselled": "counseled", + "counselling": "counseling", + "counsellor": "counselor", + "counsellors": "counselors", + "crenelated": "crenellated", + "criminalise": "criminalize", + "criminalised": "criminalized", + "criminalises": "criminalizes", + "criminalising": "criminalizing", + "criticise": "criticize", + "criticised": "criticized", + "criticises": "criticizes", + "criticising": "criticizing", + "crueller": "crueler", + "cruellest": "cruelest", + "crystallisation": "crystallization", + "crystallise": "crystallize", + "crystallised": "crystallized", + "crystallises": "crystallizes", + "crystallising": "crystallizing", + "cudgelled": "cudgeled", + "cudgelling": "cudgeling", + "customise": "customize", + "customised": "customized", + "customises": "customizes", + "customising": "customizing", + "cypher": "cipher", + "cyphers": "ciphers", + "decentralisation": "decentralization", + "decentralise": "decentralize", + "decentralised": "decentralized", + "decentralises": "decentralizes", + "decentralising": "decentralizing", + "decriminalisation": "decriminalization", + "decriminalise": "decriminalize", + "decriminalised": "decriminalized", + "decriminalises": "decriminalizes", + "decriminalising": "decriminalizing", + "defence": "defense", + "defenceless": "defenseless", + "defences": "defenses", + "dehumanisation": "dehumanization", + "dehumanise": "dehumanize", + "dehumanised": "dehumanized", + "dehumanises": "dehumanizes", + "dehumanising": "dehumanizing", + "demeanour": "demeanor", + "demilitarisation": "demilitarization", + "demilitarise": "demilitarize", + "demilitarised": "demilitarized", + "demilitarises": "demilitarizes", + "demilitarising": "demilitarizing", + "demobilisation": "demobilization", + "demobilise": "demobilize", + "demobilised": "demobilized", + "demobilises": "demobilizes", + "demobilising": "demobilizing", + "democratisation": "democratization", + "democratise": "democratize", + "democratised": "democratized", + "democratises": "democratizes", + "democratising": "democratizing", + "demonise": "demonize", + "demonised": "demonized", + "demonises": "demonizes", + "demonising": "demonizing", + "demoralisation": "demoralization", + "demoralise": "demoralize", + "demoralised": "demoralized", + "demoralises": "demoralizes", + "demoralising": "demoralizing", + "denationalisation": "denationalization", + "denationalise": "denationalize", + "denationalised": "denationalized", + "denationalises": "denationalizes", + "denationalising": "denationalizing", + "deodorise": "deodorize", + "deodorised": "deodorized", + "deodorises": "deodorizes", + "deodorising": "deodorizing", + "depersonalise": "depersonalize", + "depersonalised": "depersonalized", + "depersonalises": "depersonalizes", + "depersonalising": "depersonalizing", + "deputise": "deputize", + "deputised": "deputized", + "deputises": "deputizes", + "deputising": "deputizing", + "desensitisation": "desensitization", + "desensitise": "desensitize", + "desensitised": "desensitized", + "desensitises": "desensitizes", + "desensitising": "desensitizing", + "destabilisation": "destabilization", + "destabilise": "destabilize", + "destabilised": "destabilized", + "destabilises": "destabilizes", + "destabilising": "destabilizing", + "dialled": "dialed", + "dialling": "dialing", + "dialogue": "dialog", + "dialogues": "dialogs", + "diarrhoea": "diarrhea", + "digitise": "digitize", + "digitised": "digitized", + "digitises": "digitizes", + "digitising": "digitizing", + "disc": "disk", + "discolour": "discolor", + "discoloured": "discolored", + "discolouring": "discoloring", + "discolours": "discolors", + "discs": "disks", + "disembowelled": "disemboweled", + "disembowelling": "disemboweling", + "disfavour": "disfavor", + "dishevelled": "disheveled", + "dishonour": "dishonor", + "dishonourable": "dishonorable", + "dishonourably": "dishonorably", + "dishonoured": "dishonored", + "dishonouring": "dishonoring", + "dishonours": "dishonors", + "disorganisation": "disorganization", + "disorganised": "disorganized", + "distil": "distill", + "distils": "distills", + "dramatisation": "dramatization", + "dramatisations": "dramatizations", + "dramatise": "dramatize", + "dramatised": "dramatized", + "dramatises": "dramatizes", + "dramatising": "dramatizing", + "draught": "draft", + "draughtboard": "draftboard", + "draughtboards": "draftboards", + "draughtier": "draftier", + "draughtiest": "draftiest", + "draughts": "drafts", + "draughtsman": "draftsman", + "draughtsmanship": "draftsmanship", + "draughtsmen": "draftsmen", + "draughtswoman": "draftswoman", + "draughtswomen": "draftswomen", + "draughty": "drafty", + "drivelled": "driveled", + "drivelling": "driveling", + "duelled": "dueled", + "duelling": "dueling", + "economise": "economize", + "economised": "economized", + "economises": "economizes", + "economising": "economizing", + "edoema": "edema", + "editorialise": "editorialize", + "editorialised": "editorialized", + "editorialises": "editorializes", + "editorialising": "editorializing", + "empathise": "empathize", + "empathised": "empathized", + "empathises": "empathizes", + "empathising": "empathizing", + "emphasise": "emphasize", + "emphasised": "emphasized", + "emphasises": "emphasizes", + "emphasising": "emphasizing", + "enamelled": "enameled", + "enamelling": "enameling", + "enamoured": "enamored", + "encyclopaedia": "encyclopedia", + "encyclopaedias": "encyclopedias", + "encyclopaedic": "encyclopedic", + "endeavour": "endeavor", + "endeavoured": "endeavored", + "endeavouring": "endeavoring", + "endeavours": "endeavors", + "energise": "energize", + "energised": "energized", + "energises": "energizes", + "energising": "energizing", + "enrol": "enroll", + "enrols": "enrolls", + "enthral": "enthrall", + "enthrals": "enthralls", + "epaulette": "epaulet", + "epaulettes": "epaulets", + "epicentre": "epicenter", + "epicentres": "epicenters", + "epilogue": "epilog", + "epilogues": "epilogs", + "epitomise": "epitomize", + "epitomised": "epitomized", + "epitomises": "epitomizes", + "epitomising": "epitomizing", + "equalisation": "equalization", + "equalise": "equalize", + "equalised": "equalized", + "equaliser": "equalizer", + "equalisers": "equalizers", + "equalises": "equalizes", + "equalising": "equalizing", + "eulogise": "eulogize", + "eulogised": "eulogized", + "eulogises": "eulogizes", + "eulogising": "eulogizing", + "evangelise": "evangelize", + "evangelised": "evangelized", + "evangelises": "evangelizes", + "evangelising": "evangelizing", + "exorcise": "exorcize", + "exorcised": "exorcized", + "exorcises": "exorcizes", + "exorcising": "exorcizing", + "extemporisation": "extemporization", + "extemporise": "extemporize", + "extemporised": "extemporized", + "extemporises": "extemporizes", + "extemporising": "extemporizing", + "externalisation": "externalization", + "externalisations": "externalizations", + "externalise": "externalize", + "externalised": "externalized", + "externalises": "externalizes", + "externalising": "externalizing", + "factorise": "factorize", + "factorised": "factorized", + "factorises": "factorizes", + "factorising": "factorizing", + "faecal": "fecal", + "faeces": "feces", + "familiarisation": "familiarization", + "familiarise": "familiarize", + "familiarised": "familiarized", + "familiarises": "familiarizes", + "familiarising": "familiarizing", + "fantasise": "fantasize", + "fantasised": "fantasized", + "fantasises": "fantasizes", + "fantasising": "fantasizing", + "favour": "favor", + "favourable": "favorable", + "favourably": "favorably", + "favoured": "favored", + "favouring": "favoring", + "favourite": "favorite", + "favourites": "favorites", + "favouritism": "favoritism", + "favours": "favors", + "feminise": "feminize", + "feminised": "feminized", + "feminises": "feminizes", + "feminising": "feminizing", + "fertilisation": "fertilization", + "fertilise": "fertilize", + "fertilised": "fertilized", + "fertiliser": "fertilizer", + "fertilisers": "fertilizers", + "fertilises": "fertilizes", + "fertilising": "fertilizing", + "fervour": "fervor", + "fibre": "fiber", + "fibreglass": "fiberglass", + "fibres": "fibers", + "fictionalisation": "fictionalization", + "fictionalisations": "fictionalizations", + "fictionalise": "fictionalize", + "fictionalised": "fictionalized", + "fictionalises": "fictionalizes", + "fictionalising": "fictionalizing", + "fillet": "filet", + "filleted": "fileted", + "filleting": "fileting", + "fillets": "filets", + "finalisation": "finalization", + "finalise": "finalize", + "finalised": "finalized", + "finalises": "finalizes", + "finalising": "finalizing", + "flautist": "flutist", + "flautists": "flutists", + "flavour": "flavor", + "flavoured": "flavored", + "flavouring": "flavoring", + "flavourings": "flavorings", + "flavourless": "flavorless", + "flavours": "flavors", + "flavoursome": "flavorsome", + "flyer / flier": "flier / flyer", + "foetal": "fetal", + "foetid": "fetid", + "foetus": "fetus", + "foetuses": "fetuses", + "formalisation": "formalization", + "formalise": "formalize", + "formalised": "formalized", + "formalises": "formalizes", + "formalising": "formalizing", + "fossilisation": "fossilization", + "fossilise": "fossilize", + "fossilised": "fossilized", + "fossilises": "fossilizes", + "fossilising": "fossilizing", + "fraternisation": "fraternization", + "fraternise": "fraternize", + "fraternised": "fraternized", + "fraternises": "fraternizes", + "fraternising": "fraternizing", + "fulfil": "fulfill", + "fulfilment": "fulfillment", + "fulfils": "fulfills", + "funnelled": "funneled", + "funnelling": "funneling", + "galvanise": "galvanize", + "galvanised": "galvanized", + "galvanises": "galvanizes", + "galvanising": "galvanizing", + "gambolled": "gamboled", + "gambolling": "gamboling", + "gaol": "jail", + "gaolbird": "jailbird", + "gaolbirds": "jailbirds", + "gaolbreak": "jailbreak", + "gaolbreaks": "jailbreaks", + "gaoled": "jailed", + "gaoler": "jailer", + "gaolers": "jailers", + "gaoling": "jailing", + "gaols": "jails", + "gasses": "gases", + "gage": "gauge", + "gaged": "gauged", + "gages": "gauges", + "gaging": "gauging", + "generalisation": "generalization", + "generalisations": "generalizations", + "generalise": "generalize", + "generalised": "generalized", + "generalises": "generalizes", + "generalising": "generalizing", + "ghettoise": "ghettoize", + "ghettoised": "ghettoized", + "ghettoises": "ghettoizes", + "ghettoising": "ghettoizing", + "gipsies": "gypsies", + "glamorise": "glamorize", + "glamorised": "glamorized", + "glamorises": "glamorizes", + "glamorising": "glamorizing", + "glamor": "glamour", + "globalisation": "globalization", + "globalise": "globalize", + "globalised": "globalized", + "globalises": "globalizes", + "globalising": "globalizing", + "glueing": "gluing", + "goitre": "goiter", + "goitres": "goiters", + "gonorrhoea": "gonorrhea", + "gramme": "gram", + "grammes": "grams", + "gravelled": "graveled", + "grey": "gray", + "greyed": "grayed", + "greying": "graying", + "greyish": "grayish", + "greyness": "grayness", + "greys": "grays", + "grovelled": "groveled", + "grovelling": "groveling", + "groyne": "groin", + "groynes": "groins", + "gruelling": "grueling", + "gruellingly": "gruelingly", + "gryphon": "griffin", + "gryphons": "griffins", + "gynaecological": "gynecological", + "gynaecologist": "gynecologist", + "gynaecologists": "gynecologists", + "gynaecology": "gynecology", + "haematological": "hematological", + "haematologist": "hematologist", + "haematologists": "hematologists", + "haematology": "hematology", + "haemoglobin": "hemoglobin", + "haemophilia": "hemophilia", + "haemophiliac": "hemophiliac", + "haemophiliacs": "hemophiliacs", + "haemorrhage": "hemorrhage", + "haemorrhaged": "hemorrhaged", + "haemorrhages": "hemorrhages", + "haemorrhaging": "hemorrhaging", + "haemorrhoids": "hemorrhoids", + "harbour": "harbor", + "harboured": "harbored", + "harbouring": "harboring", + "harbours": "harbors", + "harmonisation": "harmonization", + "harmonise": "harmonize", + "harmonised": "harmonized", + "harmonises": "harmonizes", + "harmonising": "harmonizing", + "homoeopath": "homeopath", + "homoeopathic": "homeopathic", + "homoeopaths": "homeopaths", + "homoeopathy": "homeopathy", + "homogenise": "homogenize", + "homogenised": "homogenized", + "homogenises": "homogenizes", + "homogenising": "homogenizing", + "honour": "honor", + "honourable": "honorable", + "honourably": "honorably", + "honoured": "honored", + "honouring": "honoring", + "honours": "honors", + "hospitalisation": "hospitalization", + "hospitalise": "hospitalize", + "hospitalised": "hospitalized", + "hospitalises": "hospitalizes", + "hospitalising": "hospitalizing", + "humanise": "humanize", + "humanised": "humanized", + "humanises": "humanizes", + "humanising": "humanizing", + "humour": "humor", + "humoured": "humored", + "humouring": "humoring", + "humourless": "humorless", + "humours": "humors", + "hybridise": "hybridize", + "hybridised": "hybridized", + "hybridises": "hybridizes", + "hybridising": "hybridizing", + "hypnotise": "hypnotize", + "hypnotised": "hypnotized", + "hypnotises": "hypnotizes", + "hypnotising": "hypnotizing", + "hypothesise": "hypothesize", + "hypothesised": "hypothesized", + "hypothesises": "hypothesizes", + "hypothesising": "hypothesizing", + "idealisation": "idealization", + "idealise": "idealize", + "idealised": "idealized", + "idealises": "idealizes", + "idealising": "idealizing", + "idolise": "idolize", + "idolised": "idolized", + "idolises": "idolizes", + "idolising": "idolizing", + "immobilisation": "immobilization", + "immobilise": "immobilize", + "immobilised": "immobilized", + "immobiliser": "immobilizer", + "immobilisers": "immobilizers", + "immobilises": "immobilizes", + "immobilising": "immobilizing", + "immortalise": "immortalize", + "immortalised": "immortalized", + "immortalises": "immortalizes", + "immortalising": "immortalizing", + "immunisation": "immunization", + "immunise": "immunize", + "immunised": "immunized", + "immunises": "immunizes", + "immunising": "immunizing", + "impanelled": "impaneled", + "impanelling": "impaneling", + "imperilled": "imperiled", + "imperilling": "imperiling", + "individualise": "individualize", + "individualised": "individualized", + "individualises": "individualizes", + "individualising": "individualizing", + "industrialise": "industrialize", + "industrialised": "industrialized", + "industrialises": "industrializes", + "industrialising": "industrializing", + "inflexion": "inflection", + "inflexions": "inflections", + "initialise": "initialize", + "initialised": "initialized", + "initialises": "initializes", + "initialising": "initializing", + "initialled": "initialed", + "initialling": "initialing", + "instal": "install", + "instalment": "installment", + "instalments": "installments", + "instals": "installs", + "instil": "instill", + "instils": "instills", + "institutionalisation": "institutionalization", + "institutionalise": "institutionalize", + "institutionalised": "institutionalized", + "institutionalises": "institutionalizes", + "institutionalising": "institutionalizing", + "intellectualise": "intellectualize", + "intellectualised": "intellectualized", + "intellectualises": "intellectualizes", + "intellectualising": "intellectualizing", + "internalisation": "internalization", + "internalise": "internalize", + "internalised": "internalized", + "internalises": "internalizes", + "internalising": "internalizing", + "internationalisation": "internationalization", + "internationalise": "internationalize", + "internationalised": "internationalized", + "internationalises": "internationalizes", + "internationalising": "internationalizing", + "ionisation": "ionization", + "ionise": "ionize", + "ionised": "ionized", + "ioniser": "ionizer", + "ionisers": "ionizers", + "ionises": "ionizes", + "ionising": "ionizing", + "italicise": "italicize", + "italicised": "italicized", + "italicises": "italicizes", + "italicising": "italicizing", + "itemise": "itemize", + "itemised": "itemized", + "itemises": "itemizes", + "itemising": "itemizing", + "jeopardise": "jeopardize", + "jeopardised": "jeopardized", + "jeopardises": "jeopardizes", + "jeopardising": "jeopardizing", + "jewelled": "jeweled", + "jeweller": "jeweler", + "jewellers": "jewelers", + "jewellery": "jewelry", + "judgement": "judgment", + "kilogramme": "kilogram", + "kilogrammes": "kilograms", + "kilometre": "kilometer", + "kilometres": "kilometers", + "labelled": "labeled", + "labelling": "labeling", + "labour": "labor", + "laboured": "labored", + "labourer": "laborer", + "labourers": "laborers", + "labouring": "laboring", + "labours": "labors", + "lacklustre": "lackluster", + "legalisation": "legalization", + "legalise": "legalize", + "legalised": "legalized", + "legalises": "legalizes", + "legalising": "legalizing", + "legitimise": "legitimize", + "legitimised": "legitimized", + "legitimises": "legitimizes", + "legitimising": "legitimizing", + "leukaemia": "leukemia", + "levelled": "leveled", + "leveller": "leveler", + "levellers": "levelers", + "levelling": "leveling", + "libelled": "libeled", + "libelling": "libeling", + "libellous": "libelous", + "liberalisation": "liberalization", + "liberalise": "liberalize", + "liberalised": "liberalized", + "liberalises": "liberalizes", + "liberalising": "liberalizing", + "licence": "license", + "licenced": "licensed", + "licences": "licenses", + "licencing": "licensing", + "likeable": "likable", + "lionisation": "lionization", + "lionise": "lionize", + "lionised": "lionized", + "lionises": "lionizes", + "lionising": "lionizing", + "liquidise": "liquidize", + "liquidised": "liquidized", + "liquidiser": "liquidizer", + "liquidisers": "liquidizers", + "liquidises": "liquidizes", + "liquidising": "liquidizing", + "litre": "liter", + "litres": "liters", + "localise": "localize", + "localised": "localized", + "localises": "localizes", + "localising": "localizing", + "louvre": "louver", + "louvred": "louvered", + "louvres": "louvers", + "lustre": "luster", + "magnetise": "magnetize", + "magnetised": "magnetized", + "magnetises": "magnetizes", + "magnetising": "magnetizing", + "manoeuvrability": "maneuverability", + "manoeuvrable": "maneuverable", + "manoeuvre": "maneuver", + "manoeuvred": "maneuvered", + "manoeuvres": "maneuvers", + "manoeuvring": "maneuvering", + "manoeuvrings": "maneuverings", + "marginalisation": "marginalization", + "marginalise": "marginalize", + "marginalised": "marginalized", + "marginalises": "marginalizes", + "marginalising": "marginalizing", + "marshalled": "marshaled", + "marshalling": "marshaling", + "marvelled": "marveled", + "marvelling": "marveling", + "marvellous": "marvelous", + "marvellously": "marvelously", + "materialisation": "materialization", + "materialise": "materialize", + "materialised": "materialized", + "materialises": "materializes", + "materialising": "materializing", + "maximisation": "maximization", + "maximise": "maximize", + "maximised": "maximized", + "maximises": "maximizes", + "maximising": "maximizing", + "meagre": "meager", + "mechanisation": "mechanization", + "mechanise": "mechanize", + "mechanised": "mechanized", + "mechanises": "mechanizes", + "mechanising": "mechanizing", + "mediaeval": "medieval", + "memorialise": "memorialize", + "memorialised": "memorialized", + "memorialises": "memorializes", + "memorialising": "memorializing", + "memorise": "memorize", + "memorised": "memorized", + "memorises": "memorizes", + "memorising": "memorizing", + "mesmerise": "mesmerize", + "mesmerised": "mesmerized", + "mesmerises": "mesmerizes", + "mesmerising": "mesmerizing", + "metabolise": "metabolize", + "metabolised": "metabolized", + "metabolises": "metabolizes", + "metabolising": "metabolizing", + "metre": "meter", + "metres": "meters", + "micrometre": "micrometer", + "micrometres": "micrometers", + "militarise": "militarize", + "militarised": "militarized", + "militarises": "militarizes", + "militarising": "militarizing", + "milligramme": "milligram", + "milligrammes": "milligrams", + "millilitre": "milliliter", + "millilitres": "milliliters", + "millimetre": "millimeter", + "millimetres": "millimeters", + "miniaturisation": "miniaturization", + "miniaturise": "miniaturize", + "miniaturised": "miniaturized", + "miniaturises": "miniaturizes", + "miniaturising": "miniaturizing", + "minibusses": "minibuses", + "minimise": "minimize", + "minimised": "minimized", + "minimises": "minimizes", + "minimising": "minimizing", + "misbehaviour": "misbehavior", + "misdemeanour": "misdemeanor", + "misdemeanours": "misdemeanors", + "misspelt": "misspelled", + "mitre": "miter", + "mitres": "miters", + "mobilisation": "mobilization", + "mobilise": "mobilize", + "mobilised": "mobilized", + "mobilises": "mobilizes", + "mobilising": "mobilizing", + "modelled": "modeled", + "modeller": "modeler", + "modellers": "modelers", + "modelling": "modeling", + "modernise": "modernize", + "modernised": "modernized", + "modernises": "modernizes", + "modernising": "modernizing", + "moisturise": "moisturize", + "moisturised": "moisturized", + "moisturiser": "moisturizer", + "moisturisers": "moisturizers", + "moisturises": "moisturizes", + "moisturising": "moisturizing", + "monologue": "monolog", + "monologues": "monologs", + "monopolisation": "monopolization", + "monopolise": "monopolize", + "monopolised": "monopolized", + "monopolises": "monopolizes", + "monopolising": "monopolizing", + "moralise": "moralize", + "moralised": "moralized", + "moralises": "moralizes", + "moralising": "moralizing", + "motorised": "motorized", + "mould": "mold", + "moulded": "molded", + "moulder": "molder", + "mouldered": "moldered", + "mouldering": "moldering", + "moulders": "molders", + "mouldier": "moldier", + "mouldiest": "moldiest", + "moulding": "molding", + "mouldings": "moldings", + "moulds": "molds", + "mouldy": "moldy", + "moult": "molt", + "moulted": "molted", + "moulting": "molting", + "moults": "molts", + "moustache": "mustache", + "moustached": "mustached", + "moustaches": "mustaches", + "moustachioed": "mustachioed", + "multicoloured": "multicolored", + "nationalisation": "nationalization", + "nationalisations": "nationalizations", + "nationalise": "nationalize", + "nationalised": "nationalized", + "nationalises": "nationalizes", + "nationalising": "nationalizing", + "naturalisation": "naturalization", + "naturalise": "naturalize", + "naturalised": "naturalized", + "naturalises": "naturalizes", + "naturalising": "naturalizing", + "neighbour": "neighbor", + "neighbourhood": "neighborhood", + "neighbourhoods": "neighborhoods", + "neighbouring": "neighboring", + "neighbourliness": "neighborliness", + "neighbourly": "neighborly", + "neighbours": "neighbors", + "neutralisation": "neutralization", + "neutralise": "neutralize", + "neutralised": "neutralized", + "neutralises": "neutralizes", + "neutralising": "neutralizing", + "normalisation": "normalization", + "normalise": "normalize", + "normalised": "normalized", + "normalises": "normalizes", + "normalising": "normalizing", + "odour": "odor", + "odourless": "odorless", + "odours": "odors", + "oesophagus": "esophagus", + "oesophaguses": "esophaguses", + "oestrogen": "estrogen", + "offence": "offense", + "offences": "offenses", + "omelette": "omelet", + "omelettes": "omelets", + "optimise": "optimize", + "optimised": "optimized", + "optimises": "optimizes", + "optimising": "optimizing", + "organisation": "organization", + "organisational": "organizational", + "organisations": "organizations", + "organise": "organize", + "organised": "organized", + "organiser": "organizer", + "organisers": "organizers", + "organises": "organizes", + "organising": "organizing", + "orthopaedic": "orthopedic", + "orthopaedics": "orthopedics", + "ostracise": "ostracize", + "ostracised": "ostracized", + "ostracises": "ostracizes", + "ostracising": "ostracizing", + "outmanoeuvre": "outmaneuver", + "outmanoeuvred": "outmaneuvered", + "outmanoeuvres": "outmaneuvers", + "outmanoeuvring": "outmaneuvering", + "overemphasise": "overemphasize", + "overemphasised": "overemphasized", + "overemphasises": "overemphasizes", + "overemphasising": "overemphasizing", + "oxidisation": "oxidization", + "oxidise": "oxidize", + "oxidised": "oxidized", + "oxidises": "oxidizes", + "oxidising": "oxidizing", + "paederast": "pederast", + "paederasts": "pederasts", + "paediatric": "pediatric", + "paediatrician": "pediatrician", + "paediatricians": "pediatricians", + "paediatrics": "pediatrics", + "paedophile": "pedophile", + "paedophiles": "pedophiles", + "paedophilia": "pedophilia", + "palaeolithic": "paleolithic", + "palaeontologist": "paleontologist", + "palaeontologists": "paleontologists", + "palaeontology": "paleontology", + "panelled": "paneled", + "panelling": "paneling", + "panellist": "panelist", + "panellists": "panelists", + "paralyse": "paralyze", + "paralysed": "paralyzed", + "paralyses": "paralyzes", + "paralysing": "paralyzing", + "parcelled": "parceled", + "parcelling": "parceling", + "parlour": "parlor", + "parlours": "parlors", + "particularise": "particularize", + "particularised": "particularized", + "particularises": "particularizes", + "particularising": "particularizing", + "passivisation": "passivization", + "passivise": "passivize", + "passivised": "passivized", + "passivises": "passivizes", + "passivising": "passivizing", + "pasteurisation": "pasteurization", + "pasteurise": "pasteurize", + "pasteurised": "pasteurized", + "pasteurises": "pasteurizes", + "pasteurising": "pasteurizing", + "patronise": "patronize", + "patronised": "patronized", + "patronises": "patronizes", + "patronising": "patronizing", + "patronisingly": "patronizingly", + "pedalled": "pedaled", + "pedalling": "pedaling", + "pedestrianisation": "pedestrianization", + "pedestrianise": "pedestrianize", + "pedestrianised": "pedestrianized", + "pedestrianises": "pedestrianizes", + "pedestrianising": "pedestrianizing", + "penalise": "penalize", + "penalised": "penalized", + "penalises": "penalizes", + "penalising": "penalizing", + "pencilled": "penciled", + "pencilling": "penciling", + "personalise": "personalize", + "personalised": "personalized", + "personalises": "personalizes", + "personalising": "personalizing", + "pharmacopoeia": "pharmacopeia", + "pharmacopoeias": "pharmacopeias", + "philosophise": "philosophize", + "philosophised": "philosophized", + "philosophises": "philosophizes", + "philosophising": "philosophizing", + "philtre": "filter", + "philtres": "filters", + "phoney": "phony", + "plagiarise": "plagiarize", + "plagiarised": "plagiarized", + "plagiarises": "plagiarizes", + "plagiarising": "plagiarizing", + "plough": "plow", + "ploughed": "plowed", + "ploughing": "plowing", + "ploughman": "plowman", + "ploughmen": "plowmen", + "ploughs": "plows", + "ploughshare": "plowshare", + "ploughshares": "plowshares", + "polarisation": "polarization", + "polarise": "polarize", + "polarised": "polarized", + "polarises": "polarizes", + "polarising": "polarizing", + "politicisation": "politicization", + "politicise": "politicize", + "politicised": "politicized", + "politicises": "politicizes", + "politicising": "politicizing", + "popularisation": "popularization", + "popularise": "popularize", + "popularised": "popularized", + "popularises": "popularizes", + "popularising": "popularizing", + "pouffe": "pouf", + "pouffes": "poufs", + "practise": "practice", + "practised": "practiced", + "practises": "practices", + "practising": "practicing", + "praesidium": "presidium", + "praesidiums": "presidiums", + "pressurisation": "pressurization", + "pressurise": "pressurize", + "pressurised": "pressurized", + "pressurises": "pressurizes", + "pressurising": "pressurizing", + "pretence": "pretense", + "pretences": "pretenses", + "primaeval": "primeval", + "prioritisation": "prioritization", + "prioritise": "prioritize", + "prioritised": "prioritized", + "prioritises": "prioritizes", + "prioritising": "prioritizing", + "privatisation": "privatization", + "privatisations": "privatizations", + "privatise": "privatize", + "privatised": "privatized", + "privatises": "privatizes", + "privatising": "privatizing", + "professionalisation": "professionalization", + "professionalise": "professionalize", + "professionalised": "professionalized", + "professionalises": "professionalizes", + "professionalising": "professionalizing", + "programme": "program", + "programmes": "programs", + "prologue": "prolog", + "prologues": "prologs", + "propagandise": "propagandize", + "propagandised": "propagandized", + "propagandises": "propagandizes", + "propagandising": "propagandizing", + "proselytise": "proselytize", + "proselytised": "proselytized", + "proselytiser": "proselytizer", + "proselytisers": "proselytizers", + "proselytises": "proselytizes", + "proselytising": "proselytizing", + "psychoanalyse": "psychoanalyze", + "psychoanalysed": "psychoanalyzed", + "psychoanalyses": "psychoanalyzes", + "psychoanalysing": "psychoanalyzing", + "publicise": "publicize", + "publicised": "publicized", + "publicises": "publicizes", + "publicising": "publicizing", + "pulverisation": "pulverization", + "pulverise": "pulverize", + "pulverised": "pulverized", + "pulverises": "pulverizes", + "pulverising": "pulverizing", + "pummelled": "pummel", + "pummelling": "pummeled", + "pyjama": "pajama", + "pyjamas": "pajamas", + "pzazz": "pizzazz", + "quarrelled": "quarreled", + "quarrelling": "quarreling", + "radicalise": "radicalize", + "radicalised": "radicalized", + "radicalises": "radicalizes", + "radicalising": "radicalizing", + "rancour": "rancor", + "randomise": "randomize", + "randomised": "randomized", + "randomises": "randomizes", + "randomising": "randomizing", + "rationalisation": "rationalization", + "rationalisations": "rationalizations", + "rationalise": "rationalize", + "rationalised": "rationalized", + "rationalises": "rationalizes", + "rationalising": "rationalizing", + "ravelled": "raveled", + "ravelling": "raveling", + "realisable": "realizable", + "realisation": "realization", + "realisations": "realizations", + "realise": "realize", + "realised": "realized", + "realises": "realizes", + "realising": "realizing", + "recognisable": "recognizable", + "recognisably": "recognizably", + "recognisance": "recognizance", + "recognise": "recognize", + "recognised": "recognized", + "recognises": "recognizes", + "recognising": "recognizing", + "reconnoitre": "reconnoiter", + "reconnoitred": "reconnoitered", + "reconnoitres": "reconnoiters", + "reconnoitring": "reconnoitering", + "refuelled": "refueled", + "refuelling": "refueling", + "regularisation": "regularization", + "regularise": "regularize", + "regularised": "regularized", + "regularises": "regularizes", + "regularising": "regularizing", + "remodelled": "remodeled", + "remodelling": "remodeling", + "remould": "remold", + "remoulded": "remolded", + "remoulding": "remolding", + "remoulds": "remolds", + "reorganisation": "reorganization", + "reorganisations": "reorganizations", + "reorganise": "reorganize", + "reorganised": "reorganized", + "reorganises": "reorganizes", + "reorganising": "reorganizing", + "revelled": "reveled", + "reveller": "reveler", + "revellers": "revelers", + "revelling": "reveling", + "revitalise": "revitalize", + "revitalised": "revitalized", + "revitalises": "revitalizes", + "revitalising": "revitalizing", + "revolutionise": "revolutionize", + "revolutionised": "revolutionized", + "revolutionises": "revolutionizes", + "revolutionising": "revolutionizing", + "rhapsodise": "rhapsodize", + "rhapsodised": "rhapsodized", + "rhapsodises": "rhapsodizes", + "rhapsodising": "rhapsodizing", + "rigour": "rigor", + "rigours": "rigors", + "ritualised": "ritualized", + "rivalled": "rivaled", + "rivalling": "rivaling", + "romanticise": "romanticize", + "romanticised": "romanticized", + "romanticises": "romanticizes", + "romanticising": "romanticizing", + "rumour": "rumor", + "rumoured": "rumored", + "rumours": "rumors", + "sabre": "saber", + "sabres": "sabers", + "saltpetre": "saltpeter", + "sanitise": "sanitize", + "sanitised": "sanitized", + "sanitises": "sanitizes", + "sanitising": "sanitizing", + "satirise": "satirize", + "satirised": "satirized", + "satirises": "satirizes", + "satirising": "satirizing", + "saviour": "savior", + "saviours": "saviors", + "savour": "savor", + "savoured": "savored", + "savouries": "savories", + "savouring": "savoring", + "savours": "savors", + "savoury": "savory", + "scandalise": "scandalize", + "scandalised": "scandalized", + "scandalises": "scandalizes", + "scandalising": "scandalizing", + "sceptic": "skeptic", + "sceptical": "skeptical", + "sceptically": "skeptically", + "scepticism": "skepticism", + "sceptics": "skeptics", + "sceptre": "scepter", + "sceptres": "scepters", + "scrutinise": "scrutinize", + "scrutinised": "scrutinized", + "scrutinises": "scrutinizes", + "scrutinising": "scrutinizing", + "secularisation": "secularization", + "secularise": "secularize", + "secularised": "secularized", + "secularises": "secularizes", + "secularising": "secularizing", + "sensationalise": "sensationalize", + "sensationalised": "sensationalized", + "sensationalises": "sensationalizes", + "sensationalising": "sensationalizing", + "sensitise": "sensitize", + "sensitised": "sensitized", + "sensitises": "sensitizes", + "sensitising": "sensitizing", + "sentimentalise": "sentimentalize", + "sentimentalised": "sentimentalized", + "sentimentalises": "sentimentalizes", + "sentimentalising": "sentimentalizing", + "sepulchre": "sepulcher", + "sepulchres": "sepulchers", + "serialisation": "serialization", + "serialisations": "serializations", + "serialise": "serialize", + "serialised": "serialized", + "serialises": "serializes", + "serialising": "serializing", + "sermonise": "sermonize", + "sermonised": "sermonized", + "sermonises": "sermonizes", + "sermonising": "sermonizing", + "sheikh": "sheik", + "shovelled": "shoveled", + "shovelling": "shoveling", + "shrivelled": "shriveled", + "shrivelling": "shriveling", + "signalise": "signalize", + "signalised": "signalized", + "signalises": "signalizes", + "signalising": "signalizing", + "signalled": "signaled", + "signalling": "signaling", + "smoulder": "smolder", + "smouldered": "smoldered", + "smouldering": "smoldering", + "smoulders": "smolders", + "snivelled": "sniveled", + "snivelling": "sniveling", + "snorkelled": "snorkeled", + "snorkelling": "snorkeling", + "snowplough": "snowplow", + "snowploughs": "snowplow", + "socialisation": "socialization", + "socialise": "socialize", + "socialised": "socialized", + "socialises": "socializes", + "socialising": "socializing", + "sodomise": "sodomize", + "sodomised": "sodomized", + "sodomises": "sodomizes", + "sodomising": "sodomizing", + "solemnise": "solemnize", + "solemnised": "solemnized", + "solemnises": "solemnizes", + "solemnising": "solemnizing", + "sombre": "somber", + "specialisation": "specialization", + "specialisations": "specializations", + "specialise": "specialize", + "specialised": "specialized", + "specialises": "specializes", + "specialising": "specializing", + "spectre": "specter", + "spectres": "specters", + "spiralled": "spiraled", + "spiralling": "spiraling", + "splendour": "splendor", + "splendours": "splendors", + "squirrelled": "squirreled", + "squirrelling": "squirreling", + "stabilisation": "stabilization", + "stabilise": "stabilize", + "stabilised": "stabilized", + "stabiliser": "stabilizer", + "stabilisers": "stabilizers", + "stabilises": "stabilizes", + "stabilising": "stabilizing", + "standardisation": "standardization", + "standardise": "standardize", + "standardised": "standardized", + "standardises": "standardizes", + "standardising": "standardizing", + "stencilled": "stenciled", + "stencilling": "stenciling", + "sterilisation": "sterilization", + "sterilisations": "sterilizations", + "sterilise": "sterilize", + "sterilised": "sterilized", + "steriliser": "sterilizer", + "sterilisers": "sterilizers", + "sterilises": "sterilizes", + "sterilising": "sterilizing", + "stigmatisation": "stigmatization", + "stigmatise": "stigmatize", + "stigmatised": "stigmatized", + "stigmatises": "stigmatizes", + "stigmatising": "stigmatizing", + "storey": "story", + "storeys": "stories", + "subsidisation": "subsidization", + "subsidise": "subsidize", + "subsidised": "subsidized", + "subsidiser": "subsidizer", + "subsidisers": "subsidizers", + "subsidises": "subsidizes", + "subsidising": "subsidizing", + "succour": "succor", + "succoured": "succored", + "succouring": "succoring", + "succours": "succors", + "sulphate": "sulfate", + "sulphates": "sulfates", + "sulphide": "sulfide", + "sulphides": "sulfides", + "sulphur": "sulfur", + "sulphurous": "sulfurous", + "summarise": "summarize", + "summarised": "summarized", + "summarises": "summarizes", + "summarising": "summarizing", + "swivelled": "swiveled", + "swivelling": "swiveling", + "symbolise": "symbolize", + "symbolised": "symbolized", + "symbolises": "symbolizes", + "symbolising": "symbolizing", + "sympathise": "sympathize", + "sympathised": "sympathized", + "sympathiser": "sympathizer", + "sympathisers": "sympathizers", + "sympathises": "sympathizes", + "sympathising": "sympathizing", + "synchronisation": "synchronization", + "synchronise": "synchronize", + "synchronised": "synchronized", + "synchronises": "synchronizes", + "synchronising": "synchronizing", + "synthesise": "synthesize", + "synthesised": "synthesized", + "synthesiser": "synthesizer", + "synthesisers": "synthesizers", + "synthesises": "synthesizes", + "synthesising": "synthesizing", + "syphon": "siphon", + "syphoned": "siphoned", + "syphoning": "siphoning", + "syphons": "siphons", + "systematisation": "systematization", + "systematise": "systematize", + "systematised": "systematized", + "systematises": "systematizes", + "systematising": "systematizing", + "tantalise": "tantalize", + "tantalised": "tantalized", + "tantalises": "tantalizes", + "tantalising": "tantalizing", + "tantalisingly": "tantalizingly", + "tasselled": "tasseled", + "technicolour": "technicolor", + "temporise": "temporize", + "temporised": "temporized", + "temporises": "temporizes", + "temporising": "temporizing", + "tenderise": "tenderize", + "tenderised": "tenderized", + "tenderises": "tenderizes", + "tenderising": "tenderizing", + "terrorise": "terrorize", + "terrorised": "terrorized", + "terrorises": "terrorizes", + "terrorising": "terrorizing", + "theatre": "theater", + "theatregoer": "theatergoer", + "theatregoers": "theatergoers", + "theatres": "theaters", + "theorise": "theorize", + "theorised": "theorized", + "theorises": "theorizes", + "theorising": "theorizing", + "tonne": "ton", + "tonnes": "tons", + "towelled": "toweled", + "towelling": "toweling", + "toxaemia": "toxemia", + "tranquillise": "tranquilize", + "tranquillised": "tranquilized", + "tranquilliser": "tranquilizer", + "tranquillisers": "tranquilizers", + "tranquillises": "tranquilizes", + "tranquillising": "tranquilizing", + "tranquillity": "tranquility", + "tranquillize": "tranquilize", + "tranquillized": "tranquilized", + "tranquillizer": "tranquilizer", + "tranquillizers": "tranquilizers", + "tranquillizes": "tranquilizes", + "tranquillizing": "tranquilizing", + "tranquilly": "tranquility", + "transistorised": "transistorized", + "traumatise": "traumatize", + "traumatised": "traumatized", + "traumatises": "traumatizes", + "traumatising": "traumatizing", + "travelled": "traveled", + "traveller": "traveler", + "travellers": "travelers", + "travelling": "traveling", + "travelog": "travelogue", + "travelogs": "travelogues", + "trialled": "trialed", + "trialling": "trialing", + "tricolour": "tricolor", + "tricolours": "tricolors", + "trivialise": "trivialize", + "trivialised": "trivialized", + "trivialises": "trivializes", + "trivialising": "trivializing", + "tumour": "tumor", + "tumours": "tumors", + "tunnelled": "tunneled", + "tunnelling": "tunneling", + "tyrannise": "tyrannize", + "tyrannised": "tyrannized", + "tyrannises": "tyrannizes", + "tyrannising": "tyrannizing", + "tyre": "tire", + "tyres": "tires", + "unauthorised": "unauthorized", + "uncivilised": "uncivilized", + "underutilised": "underutilized", + "unequalled": "unequaled", + "unfavourable": "unfavorable", + "unfavourably": "unfavorably", + "unionisation": "unionization", + "unionise": "unionize", + "unionised": "unionized", + "unionises": "unionizes", + "unionising": "unionizing", + "unorganised": "unorganized", + "unravelled": "unraveled", + "unravelling": "unraveling", + "unrecognisable": "unrecognizable", + "unrecognised": "unrecognized", + "unrivalled": "unrivaled", + "unsavoury": "unsavory", + "untrammelled": "untrammeled", + "urbanisation": "urbanization", + "urbanise": "urbanize", + "urbanised": "urbanized", + "urbanises": "urbanizes", + "urbanising": "urbanizing", + "utilisable": "utilizable", + "utilisation": "utilization", + "utilise": "utilize", + "utilised": "utilized", + "utilises": "utilizes", + "utilising": "utilizing", + "valour": "valor", + "vandalise": "vandalize", + "vandalised": "vandalized", + "vandalises": "vandalizes", + "vandalising": "vandalizing", + "vaporisation": "vaporization", + "vaporise": "vaporize", + "vaporised": "vaporized", + "vaporises": "vaporizes", + "vaporising": "vaporizing", + "vapour": "vapor", + "vapours": "vapors", + "verbalise": "verbalize", + "verbalised": "verbalized", + "verbalises": "verbalizes", + "verbalising": "verbalizing", + "victimisation": "victimization", + "victimise": "victimize", + "victimised": "victimized", + "victimises": "victimizes", + "victimising": "victimizing", + "videodisc": "videodisk", + "videodiscs": "videodisks", + "vigour": "vigor", + "visualisation": "visualization", + "visualisations": "visualizations", + "visualise": "visualize", + "visualised": "visualized", + "visualises": "visualizes", + "visualising": "visualizing", + "vocalisation": "vocalization", + "vocalisations": "vocalizations", + "vocalise": "vocalize", + "vocalised": "vocalized", + "vocalises": "vocalizes", + "vocalising": "vocalizing", + "vulcanised": "vulcanized", + "vulgarisation": "vulgarization", + "vulgarise": "vulgarize", + "vulgarised": "vulgarized", + "vulgarises": "vulgarizes", + "vulgarising": "vulgarizing", + "waggon": "wagon", + "waggons": "wagons", + "watercolour": "watercolor", + "watercolours": "watercolors", + "weaselled": "weaseled", + "weaselling": "weaseling", + "westernisation": "westernization", + "westernise": "westernize", + "westernised": "westernized", + "westernises": "westernizes", + "westernising": "westernizing", + "womanise": "womanize", + "womanised": "womanized", + "womaniser": "womanizer", + "womanisers": "womanizers", + "womanises": "womanizes", + "womanising": "womanizing", + "woollen": "woolen", + "woollens": "woolens", + "woollies": "woolies", + "woolly": "wooly", + "worshipped": "worshiped", + "worshipping": "worshiping", + "worshipper": "worshiper", + "yodelled": "yodeled", + "yodelling": "yodeling", + "yoghourt": "yogurt", + "yoghourts": "yogurts", + "yoghurt": "yogurt", + "yoghurts": "yogurts", + "mhm": "hmm", + "mmm": "hmm" +} \ No newline at end of file diff --git a/lmms_eval/tasks/librispeech/whisper_normalizer/english.py b/lmms_eval/tasks/librispeech/whisper_normalizer/english.py new file mode 100644 index 00000000..7102f2ec --- /dev/null +++ b/lmms_eval/tasks/librispeech/whisper_normalizer/english.py @@ -0,0 +1,529 @@ +import json +import os +import re +from fractions import Fraction +from typing import Iterator, List, Match, Optional, Union + +from more_itertools import windowed # TODO: new package + +from .basic import remove_symbols_and_diacritics + + +class EnglishNumberNormalizer: + """ + Convert any spelled-out numbers into arabic numbers, while handling: + + - remove any commas + - keep the suffixes such as: `1960s`, `274th`, `32nd`, etc. + - spell out currency symbols after the number. e.g. `$20 million` -> `20000000 dollars` + - spell out `one` and `ones` + - interpret successive single-digit numbers as nominal: `one oh one` -> `101` + """ + + def __init__(self): + super().__init__() + + self.zeros = {"o", "oh", "zero"} + self.ones = { + name: i + for i, name in enumerate( + [ + "one", + "two", + "three", + "four", + "five", + "six", + "seven", + "eight", + "nine", + "ten", + "eleven", + "twelve", + "thirteen", + "fourteen", + "fifteen", + "sixteen", + "seventeen", + "eighteen", + "nineteen", + ], + start=1, + ) + } + self.ones_plural = {"sixes" if name == "six" else name + "s": (value, "s") for name, value in self.ones.items()} + self.ones_ordinal = { + "zeroth": (0, "th"), + "first": (1, "st"), + "second": (2, "nd"), + "third": (3, "rd"), + "fifth": (5, "th"), + "twelfth": (12, "th"), + **{name + ("h" if name.endswith("t") else "th"): (value, "th") for name, value in self.ones.items() if value > 3 and value != 5 and value != 12}, + } + self.ones_suffixed = {**self.ones_plural, **self.ones_ordinal} + + self.tens = { + "twenty": 20, + "thirty": 30, + "forty": 40, + "fifty": 50, + "sixty": 60, + "seventy": 70, + "eighty": 80, + "ninety": 90, + } + self.tens_plural = {name.replace("y", "ies"): (value, "s") for name, value in self.tens.items()} + self.tens_ordinal = {name.replace("y", "ieth"): (value, "th") for name, value in self.tens.items()} + self.tens_suffixed = {**self.tens_plural, **self.tens_ordinal} + + self.multipliers = { + "hundred": 100, + "thousand": 1_000, + "million": 1_000_000, + "billion": 1_000_000_000, + "trillion": 1_000_000_000_000, + "quadrillion": 1_000_000_000_000_000, + "quintillion": 1_000_000_000_000_000_000, + "sextillion": 1_000_000_000_000_000_000_000, + "septillion": 1_000_000_000_000_000_000_000_000, + "octillion": 1_000_000_000_000_000_000_000_000_000, + "nonillion": 1_000_000_000_000_000_000_000_000_000_000, + "decillion": 1_000_000_000_000_000_000_000_000_000_000_000, + } + self.multipliers_plural = {name + "s": (value, "s") for name, value in self.multipliers.items()} + self.multipliers_ordinal = {name + "th": (value, "th") for name, value in self.multipliers.items()} + self.multipliers_suffixed = { + **self.multipliers_plural, + **self.multipliers_ordinal, + } + self.decimals = {*self.ones, *self.tens, *self.zeros} + + self.preceding_prefixers = { + "minus": "-", + "negative": "-", + "plus": "+", + "positive": "+", + } + self.following_prefixers = { + "pound": "ยฃ", + "pounds": "ยฃ", + "euro": "โ‚ฌ", + "euros": "โ‚ฌ", + "dollar": "$", + "dollars": "$", + "cent": "ยข", + "cents": "ยข", + } + self.prefixes = set(list(self.preceding_prefixers.values()) + list(self.following_prefixers.values())) + self.suffixers = { + "per": {"cent": "%"}, + "percent": "%", + } + self.specials = {"and", "double", "triple", "point"} + + self.words = set( + [ + key + for mapping in [ + self.zeros, + self.ones, + self.ones_suffixed, + self.tens, + self.tens_suffixed, + self.multipliers, + self.multipliers_suffixed, + self.preceding_prefixers, + self.following_prefixers, + self.suffixers, + self.specials, + ] + for key in mapping + ] + ) + self.literal_words = {"one", "ones"} + + def process_words(self, words: List[str]) -> Iterator[str]: + prefix: Optional[str] = None + value: Optional[Union[str, int]] = None + skip = False + + def to_fraction(s: str): + try: + return Fraction(s) + except ValueError: + return None + + def output(result: Union[str, int]): + nonlocal prefix, value + result = str(result) + if prefix is not None: + result = prefix + result + value = None + prefix = None + return result + + if len(words) == 0: + return + + for prev, current, next in windowed([None] + words + [None], 3): + if skip: + skip = False + continue + + next_is_numeric = next is not None and re.match(r"^\d+(\.\d+)?$", next) + has_prefix = current[0] in self.prefixes + current_without_prefix = current[1:] if has_prefix else current + if re.match(r"^\d+(\.\d+)?$", current_without_prefix): + # arabic numbers (potentially with signs and fractions) + f = to_fraction(current_without_prefix) + assert f is not None + if value is not None: + if isinstance(value, str) and value.endswith("."): + # concatenate decimals / ip address components + value = str(value) + str(current) + continue + else: + yield output(value) + + prefix = current[0] if has_prefix else prefix + if f.denominator == 1: + value = f.numerator # store integers as int + else: + value = current_without_prefix + elif current not in self.words: + # non-numeric words + if value is not None: + yield output(value) + yield output(current) + elif current in self.zeros: + value = str(value or "") + "0" + elif current in self.ones: + ones = self.ones[current] + + if value is None: + value = ones + elif isinstance(value, str) or prev in self.ones: + if prev in self.tens and ones < 10: # replace the last zero with the digit + assert value[-1] == "0" + value = value[:-1] + str(ones) + else: + value = str(value) + str(ones) + elif ones < 10: + if value % 10 == 0: + value += ones + else: + value = str(value) + str(ones) + else: # eleven to nineteen + if value % 100 == 0: + value += ones + else: + value = str(value) + str(ones) + elif current in self.ones_suffixed: + # ordinal or cardinal; yield the number right away + ones, suffix = self.ones_suffixed[current] + if value is None: + yield output(str(ones) + suffix) + elif isinstance(value, str) or prev in self.ones: + if prev in self.tens and ones < 10: + assert value[-1] == "0" + yield output(value[:-1] + str(ones) + suffix) + else: + yield output(str(value) + str(ones) + suffix) + elif ones < 10: + if value % 10 == 0: + yield output(str(value + ones) + suffix) + else: + yield output(str(value) + str(ones) + suffix) + else: # eleven to nineteen + if value % 100 == 0: + yield output(str(value + ones) + suffix) + else: + yield output(str(value) + str(ones) + suffix) + value = None + elif current in self.tens: + tens = self.tens[current] + if value is None: + value = tens + elif isinstance(value, str): + value = str(value) + str(tens) + else: + if value % 100 == 0: + value += tens + else: + value = str(value) + str(tens) + elif current in self.tens_suffixed: + # ordinal or cardinal; yield the number right away + tens, suffix = self.tens_suffixed[current] + if value is None: + yield output(str(tens) + suffix) + elif isinstance(value, str): + yield output(str(value) + str(tens) + suffix) + else: + if value % 100 == 0: + yield output(str(value + tens) + suffix) + else: + yield output(str(value) + str(tens) + suffix) + elif current in self.multipliers: + multiplier = self.multipliers[current] + if value is None: + value = multiplier + elif isinstance(value, str) or value == 0: + f = to_fraction(value) + p = f * multiplier if f is not None else None + if f is not None and p.denominator == 1: + value = p.numerator + else: + yield output(value) + value = multiplier + else: + before = value // 1000 * 1000 + residual = value % 1000 + value = before + residual * multiplier + elif current in self.multipliers_suffixed: + multiplier, suffix = self.multipliers_suffixed[current] + if value is None: + yield output(str(multiplier) + suffix) + elif isinstance(value, str): + f = to_fraction(value) + p = f * multiplier if f is not None else None + if f is not None and p.denominator == 1: + yield output(str(p.numerator) + suffix) + else: + yield output(value) + yield output(str(multiplier) + suffix) + else: # int + before = value // 1000 * 1000 + residual = value % 1000 + value = before + residual * multiplier + yield output(str(value) + suffix) + value = None + elif current in self.preceding_prefixers: + # apply prefix (positive, minus, etc.) if it precedes a number + if value is not None: + yield output(value) + + if next in self.words or next_is_numeric: + prefix = self.preceding_prefixers[current] + else: + yield output(current) + elif current in self.following_prefixers: + # apply prefix (dollars, cents, etc.) only after a number + if value is not None: + prefix = self.following_prefixers[current] + yield output(value) + else: + yield output(current) + elif current in self.suffixers: + # apply suffix symbols (percent -> '%') + if value is not None: + suffix = self.suffixers[current] + if isinstance(suffix, dict): + if next in suffix: + yield output(str(value) + suffix[next]) + skip = True + else: + yield output(value) + yield output(current) + else: + yield output(str(value) + suffix) + else: + yield output(current) + elif current in self.specials: + if next not in self.words and not next_is_numeric: + # apply special handling only if the next word can be numeric + if value is not None: + yield output(value) + yield output(current) + elif current == "and": + # ignore "and" after hundreds, thousands, etc. + if prev not in self.multipliers: + if value is not None: + yield output(value) + yield output(current) + elif current == "double" or current == "triple": + if next in self.ones or next in self.zeros: + repeats = 2 if current == "double" else 3 + ones = self.ones.get(next, 0) + value = str(value or "") + str(ones) * repeats + skip = True + else: + if value is not None: + yield output(value) + yield output(current) + elif current == "point": + if next in self.decimals or next_is_numeric: + value = str(value or "") + "." + else: + # should all have been covered at this point + raise ValueError(f"Unexpected token: {current}") + else: + # all should have been covered at this point + raise ValueError(f"Unexpected token: {current}") + + if value is not None: + yield output(value) + + def preprocess(self, s: str): + # replace " and a half" with " point five" + results = [] + + segments = re.split(r"\band\s+a\s+half\b", s) + for i, segment in enumerate(segments): + if len(segment.strip()) == 0: + continue + if i == len(segments) - 1: + results.append(segment) + else: + results.append(segment) + last_word = segment.rsplit(maxsplit=2)[-1] + if last_word in self.decimals or last_word in self.multipliers: + results.append("point five") + else: + results.append("and a half") + + s = " ".join(results) + + # put a space at number/letter boundary + s = re.sub(r"([a-z])([0-9])", r"\1 \2", s) + s = re.sub(r"([0-9])([a-z])", r"\1 \2", s) + + # but remove spaces which could be a suffix + s = re.sub(r"([0-9])\s+(st|nd|rd|th|s)\b", r"\1\2", s) + + return s + + def postprocess(self, s: str): + def combine_cents(m: Match): + try: + currency = m.group(1) + integer = m.group(2) + cents = int(m.group(3)) + return f"{currency}{integer}.{cents:02d}" + except ValueError: + return m.string + + def extract_cents(m: Match): + try: + return f"ยข{int(m.group(1))}" + except ValueError: + return m.string + + # apply currency postprocessing; "$2 and ยข7" -> "$2.07" + s = re.sub(r"([โ‚ฌยฃ$])([0-9]+) (?:and )?ยข([0-9]{1,2})\b", combine_cents, s) + s = re.sub(r"[โ‚ฌยฃ$]0.([0-9]{1,2})\b", extract_cents, s) + + # write "one(s)" instead of "1(s)", just for the readability + s = re.sub(r"\b1(s?)\b", r"one\1", s) + + return s + + def __call__(self, s: str): + s = self.preprocess(s) + s = " ".join(word for word in self.process_words(s.split()) if word is not None) + s = self.postprocess(s) + + return s + + +class EnglishSpellingNormalizer: + """ + Applies British-American spelling mappings as listed in [1]. + + [1] https://www.tysto.com/uk-us-spelling-list.html + """ + + def __init__(self): + mapping_path = os.path.join(os.path.dirname(__file__), "english.json") + self.mapping = json.load(open(mapping_path)) + + def __call__(self, s: str): + return " ".join(self.mapping.get(word, word) for word in s.split()) + + +class EnglishTextNormalizer: + def __init__(self): + self.ignore_patterns = r"\b(hmm|mm|mhm|mmm|uh|um)\b" + self.replacers = { + # common contractions + r"\bwon't\b": "will not", + r"\bcan't\b": "can not", + r"\blet's\b": "let us", + r"\bain't\b": "aint", + r"\by'all\b": "you all", + r"\bwanna\b": "want to", + r"\bgotta\b": "got to", + r"\bgonna\b": "going to", + r"\bi'ma\b": "i am going to", + r"\bimma\b": "i am going to", + r"\bwoulda\b": "would have", + r"\bcoulda\b": "could have", + r"\bshoulda\b": "should have", + r"\bma'am\b": "madam", + # contractions in titles/prefixes + r"\bmr\b": "mister ", + r"\bmrs\b": "missus ", + r"\bst\b": "saint ", + r"\bdr\b": "doctor ", + r"\bprof\b": "professor ", + r"\bcapt\b": "captain ", + r"\bgov\b": "governor ", + r"\bald\b": "alderman ", + r"\bgen\b": "general ", + r"\bsen\b": "senator ", + r"\brep\b": "representative ", + r"\bpres\b": "president ", + r"\brev\b": "reverend ", + r"\bhon\b": "honorable ", + r"\basst\b": "assistant ", + r"\bassoc\b": "associate ", + r"\blt\b": "lieutenant ", + r"\bcol\b": "colonel ", + r"\bjr\b": "junior ", + r"\bsr\b": "senior ", + r"\besq\b": "esquire ", + # prefect tenses, ideally it should be any past participles, but it's harder.. + r"'d been\b": " had been", + r"'s been\b": " has been", + r"'d gone\b": " had gone", + r"'s gone\b": " has gone", + r"'d done\b": " had done", # "'s done" is ambiguous + r"'s got\b": " has got", + # general contractions + r"n't\b": " not", + r"'re\b": " are", + r"'s\b": " is", + r"'d\b": " would", + r"'ll\b": " will", + r"'t\b": " not", + r"'ve\b": " have", + r"'m\b": " am", + } + self.standardize_numbers = EnglishNumberNormalizer() + self.standardize_spellings = EnglishSpellingNormalizer() + + def __call__(self, s: str): + s = s.lower() + + s = re.sub(r"[<\[][^>\]]*[>\]]", "", s) # remove words between brackets + s = re.sub(r"\(([^)]+?)\)", "", s) # remove words between parenthesis + s = re.sub(self.ignore_patterns, "", s) + s = re.sub(r"\s+'", "'", s) # when there's a space before an apostrophe + + for pattern, replacement in self.replacers.items(): + s = re.sub(pattern, replacement, s) + + s = re.sub(r"(\d),(\d)", r"\1\2", s) # remove commas between digits + s = re.sub(r"\.([^0-9]|$)", r" \1", s) # remove periods not followed by numbers + s = remove_symbols_and_diacritics(s, keep=".%$ยขโ‚ฌยฃ") # keep numeric symbols + + s = self.standardize_numbers(s) + s = self.standardize_spellings(s) + + # now remove prefix/suffix symbols that are not preceded/followed by numbers + s = re.sub(r"[.$ยขโ‚ฌยฃ]([^0-9])", r" \1", s) + s = re.sub(r"([^0-9])%", r"\1 ", s) + + s = re.sub(r"\s+", " ", s) # replace any successive whitespaces with a space + + return s diff --git a/lmms_eval/tasks/mix_evals/audio2text/_default_template_yaml b/lmms_eval/tasks/mix_evals/audio2text/_default_template_yaml new file mode 100644 index 00000000..459386ab --- /dev/null +++ b/lmms_eval/tasks/mix_evals/audio2text/_default_template_yaml @@ -0,0 +1,10 @@ +dataset_kwargs: + token: true +dataset_path: lmms-lab/MixEval-X-audio2text +lmms_eval_specific_kwargs: + default: + post_prompt: "" + pre_prompt: "" +metadata: + gpt_eval_model_name: gpt-4o-mini + version: 0 \ No newline at end of file diff --git a/lmms_eval/tasks/mix_evals/audio2text/mix_evals_audio2_text_freeform.yaml b/lmms_eval/tasks/mix_evals/audio2text/mix_evals_audio2_text_freeform.yaml new file mode 100644 index 00000000..8d69fd2b --- /dev/null +++ b/lmms_eval/tasks/mix_evals/audio2text/mix_evals_audio2_text_freeform.yaml @@ -0,0 +1,24 @@ +task: "mix_evals_audio2text_freeform" +test_split: free_form +output_type: generate_until +doc_to_visual: !function utils.mix_evals_audio2text_doc_to_audio +doc_to_text: !function utils.mix_evals_audio2text_doc_to_text +doc_to_target: !function utils.mix_evals_audio2text_doc_to_target +process_results: !function utils.mix_evals_audio2text_process_results_freeform +metric_list: + - metric: gpt_eval + aggregation: !function utils.mix_evals_audio2text_gpt_eval + higher_is_better: true + +generation_kwargs: + max_new_tokens: 64 + +include: _default_template_yaml + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "Answer the question using a single word or phrase." + gpt4v: + pre_prompt: "These are frames from a video. Please answer the following questions about the video with a short phrase." + post_prompt: "" diff --git a/lmms_eval/tasks/mix_evals/audio2text/mix_evals_audio2_text_freeform_hard.yaml b/lmms_eval/tasks/mix_evals/audio2text/mix_evals_audio2_text_freeform_hard.yaml new file mode 100644 index 00000000..8fdbd3c1 --- /dev/null +++ b/lmms_eval/tasks/mix_evals/audio2text/mix_evals_audio2_text_freeform_hard.yaml @@ -0,0 +1,24 @@ +task: "mix_evals_audio2text_freeform_hard" +test_split: free_form_hard +output_type: generate_until +doc_to_visual: !function utils.mix_evals_audio2text_doc_to_audio +doc_to_text: !function utils.mix_evals_audio2text_doc_to_text +doc_to_target: !function utils.mix_evals_audio2text_doc_to_target +process_results: !function utils.mix_evals_audio2text_process_results_freeform +metric_list: + - metric: gpt_eval + aggregation: !function utils.mix_evals_audio2text_gpt_eval + higher_is_better: true + +generation_kwargs: + max_new_tokens: 64 + +include: _default_template_yaml + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "Answer the question using a single word or phrase." + gpt4v: + pre_prompt: "These are frames from a video. Please answer the following questions about the video with a short phrase." + post_prompt: "" diff --git a/lmms_eval/tasks/mix_evals/audio2text/utils.py b/lmms_eval/tasks/mix_evals/audio2text/utils.py new file mode 100644 index 00000000..6210a51b --- /dev/null +++ b/lmms_eval/tasks/mix_evals/audio2text/utils.py @@ -0,0 +1,159 @@ +import datetime +import json +import os +import re +import sys +import time +from pathlib import Path + +import requests +import yaml +from loguru import logger as eval_logger + +import lmms_eval.tasks._task_utils.file_utils as file_utils +from lmms_eval.filters.extraction import ExtendedRegexFilter + +with open(Path(__file__).parent / "_default_template_yaml", "r") as f: + raw_data = f.readlines() + safe_data = [] + for i, line in enumerate(raw_data): + # remove function definition since yaml load cannot handle it + if "!function" not in line: + safe_data.append(line) + + config = yaml.safe_load("".join(safe_data)) + +NUM_SECONDS_TO_SLEEP = 5 +GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"] +API_TYPE = os.getenv("API_TYPE", "openai") + +if API_TYPE == "openai": + API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") + API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") + headers = { + "Authorization": f"Bearer {API_KEY}", + "Content-Type": "application/json", + } +elif API_TYPE == "azure": + API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken") + API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY") + headers = { + "api-key": API_KEY, + "Content-Type": "application/json", + } + +eval_prompt = """You are an AI assistant who will help me to evaluate the quality of a model response to a few candidate ground truth answers. + +Some criterion +- Response that perfectly reflect the meaning of the ground truth: 1 point +- Response that reflect none of the key points in the ground truth: 0 point +- Some part in the response are correct but some parts in the ground truth are not mentioned in the response: 0.5 point +- Some part in the response are correct but other parts in the response are not mentioned in the ground truth: 0.5 point + +Here're some examples about the scoring criterion and format: +model response: Steam Cleaning Services +ground truth: ["steam clean", "steam clean", "cleaning", "car", "steam clean"], +Point: 1 + +model response: A cowboy action shooter. +ground truth: ["man"] +Point: 1 + +model response: I'm sorry, but I can't assist with that request. +ground truth: ["quality"] +Point: 0 + +Let's begin this task: +model response: {model_response} +ground truth: {ground_truth} +Point:""" + + +def get_eval(model_response: str, ground_truth: str, max_tokens: int, retries: int = 5): + global headers + content = eval_prompt.format(model_response=model_response, ground_truth=ground_truth) + + messages = [ + {"role": "user", "content": content}, + ] + + payload = { + "model": GPT_EVAL_MODEL_NAME, + "messages": messages, + "temperature": 0.2, + "max_tokens": max_tokens, + } + + for attempt in range(retries): + try: + response = requests.post(API_URL, headers=headers, json=payload, timeout=60) + response.raise_for_status() + response_data = response.json() + + content = response_data["choices"][0]["message"]["content"].strip() + if content != "": + return content, response_data["model"] + break # If successful, break out of the loop + + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}") + if attempt < retries: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All {retries} attempts failed. Last error message: {e}") + return "0", "" + return "", "" + + +def mix_evals_audio2text_doc_to_audio(doc): + return [doc["audio"]] + + +def mix_evals_audio2text_doc_to_target(doc): + return doc["reference_answer"][0] + + +# This is the place where you format your question +def mix_evals_audio2text_doc_to_text(doc, lmms_eval_specific_kwargs=None): + if lmms_eval_specific_kwargs is None: + lmms_eval_specific_kwargs = {} + pre_prompt = "" + post_prompt = "" + if "pre_prompt" in lmms_eval_specific_kwargs: + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + if "post_prompt" in lmms_eval_specific_kwargs: + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + + user_prompt = doc["query"] + + if pre_prompt: + user_prompt = f"{pre_prompt}\n{user_prompt}" + + if post_prompt: + user_prompt = f"{user_prompt}\n{post_prompt}" + return user_prompt + + +def mix_evals_audio2text_process_results_freeform(doc, result): + pred = result[0] + ground_truth_str = doc["reference_answer"][0] + content = eval_prompt.format(model_response=pred, ground_truth=ground_truth_str) + eval_answer, model_name = get_eval(model_response=pred, ground_truth=ground_truth_str, max_tokens=1024) + return { + "gpt_eval": {"pred": pred, "id": doc["id"], "target": ground_truth_str, "eval_answer": eval_answer, "gpt_prompt": content}, + } + + +def mix_evals_audio2text_gpt_eval(results, args): + score = 0 + for result in results: + eval_answer = result["eval_answer"] + eval_score = re.search(r"([0-9.]+)", eval_answer).group(1) + try: + eval_score = float(eval_score) + except Exception as e: + eval_logger.error(f"Error parsing eval_score: {e}") + eval_score = 0.0 + score += eval_score + + return score / len(results) diff --git a/lmms_eval/tasks/muchomusic/muchomusic.yaml b/lmms_eval/tasks/muchomusic/muchomusic.yaml new file mode 100644 index 00000000..6876914d --- /dev/null +++ b/lmms_eval/tasks/muchomusic/muchomusic.yaml @@ -0,0 +1,25 @@ +dataset_path: lmms-lab/muchomusic +dataset_kwargs: + token: True + +task: "muchomusic" +test_split: test +doc_to_target: !function utils.muchomusic_doc_to_target +doc_to_visual: !function utils.muchomusic_doc_to_audio +doc_to_text: !function utils.muchomusic_doc_to_text +doc_to_choice: !function utils.muchomusic_doc_to_choice + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "\nAnswer with the option's letter from the given choices directly: " +metric_list: + - metric: accuracy + aggregation: mean + higher_is_better: true + +process_results: !function utils.muchomusic_process_results + +metadata: + gpt_eval_model_name: gpt-4o + version: 0.0 diff --git a/lmms_eval/tasks/muchomusic/utils.py b/lmms_eval/tasks/muchomusic/utils.py new file mode 100644 index 00000000..a01fd59f --- /dev/null +++ b/lmms_eval/tasks/muchomusic/utils.py @@ -0,0 +1,87 @@ +import datetime +import json +import os +import random +import re +import sys +import time +from pathlib import Path + +import numpy as np +import requests +import yaml +from loguru import logger as eval_logger + +from lmms_eval.tasks._task_utils.file_utils import generate_submission_file + + +def muchomusic_doc_to_audio(doc): + return [doc["context"]] + + +def muchomusic_doc_to_text(doc, lmms_eval_specific_kwargs): + question = doc["instruction"] + answers = doc["choices"] + question = f"{question}\n{answers}" + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}{question}{post_prompt}" + + +def muchomusic_doc_to_target(doc): + return doc["answer"][1] + + +def muchomusic_doc_to_choice(doc): + return ["A", "B", "C", "D"] + + +def muchomusic_process_results(doc, result): + response = result[0].strip() + all_choices = ["A", "B", "C", "D"] + pred = parse_multi_choice_response(response, all_choices) # AdaptfromMMMU + gt_ans = doc["answer"][1] + score = 1.0 if pred == gt_ans else 0.0 + return {"accuracy": score} + + +def parse_multi_choice_response(response, all_choices): + """ + Parse the prediction from the generated response. + Return the predicted choice letter e.g., A, B, C, D. + """ + # Clean response of unwanted characters + for char in [",", ".", "!", "?", ";", ":", "'"]: + response = response.strip(char) + response = " " + response + " " # Add space to avoid partial match + + candidates = [] + # Look for choices with parentheses, e.g., (A) + for choice in all_choices: + if f"({choice})" in response: + candidates.append(choice) + + # Look for simple choices, e.g., A, B, C + if len(candidates) == 0: + for choice in all_choices: + if f" {choice} " in response: + candidates.append(choice) + + # Look for choices with periods, e.g., A., B., C. + if len(candidates) == 0: + for choice in all_choices: + if f"{choice}." in response: + candidates.append(choice) + + # If no candidates, randomly choose one + if len(candidates) == 0: + pred_index = random.choice(all_choices) + elif len(candidates) > 1: + # If more than one candidate, choose the last one found + start_indexes = [response.rfind(f" {can} ") for can in candidates] + pred_index = candidates[np.argmax(start_indexes)] + else: + # If only one candidate, use it + pred_index = candidates[0] + + return pred_index diff --git a/lmms_eval/tasks/openhermes/openhermes.yaml b/lmms_eval/tasks/openhermes/openhermes.yaml new file mode 100644 index 00000000..0e94b6f7 --- /dev/null +++ b/lmms_eval/tasks/openhermes/openhermes.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/openhermes_instruction +dataset_kwargs: + token: True + +task: "openhermes" +test_split: test +doc_to_target: "answer" +doc_to_visual: !function utils.doc_to_audio +doc_to_text: !function utils.doc_to_text + +generation_kwargs: + max_new_tokens: 1024 + temperature: 0.2 + top_p: 1.0 + num_beams: 1 + +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "\nPlease give a detail answer to the question in the audio." +metric_list: + - metric: gpt_eval + aggregation: !function utils.openhermes_aggregate_results + higher_is_better: true + +process_results: !function utils.openhermes_process_results + +metadata: + gpt_eval_model_name: gpt-4o + version: 0.0 diff --git a/lmms_eval/tasks/openhermes/utils.py b/lmms_eval/tasks/openhermes/utils.py new file mode 100644 index 00000000..e5522c00 --- /dev/null +++ b/lmms_eval/tasks/openhermes/utils.py @@ -0,0 +1,139 @@ +import datetime +import json +import os +import random +import re +import sys +import time +from pathlib import Path + +import requests +import yaml +from loguru import logger as eval_logger + +import lmms_eval.tasks._task_utils.file_utils as file_utils +from lmms_eval.filters.extraction import ExtendedRegexFilter + + +def doc_to_audio(doc): + return [doc["context"]] + + +def doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}{post_prompt}" + + +with open(Path(__file__).parent / "openhermes.yaml", "r") as f: + raw_data = f.readlines() + safe_data = [] + for i, line in enumerate(raw_data): + # remove function definition since yaml load cannot handle it + if "!function" not in line: + safe_data.append(line) + + config = yaml.safe_load("".join(safe_data)) + +# specify api type and key in .env +GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"] +API_TYPE = os.getenv("API_TYPE", "azure") + +if API_TYPE == "openai": + API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") + API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") + headers = { + "Authorization": f"Bearer {API_KEY}", + "Content-Type": "application/json", + } + +elif API_TYPE == "azure": + API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken") + API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY") + headers = { + "api-key": API_KEY, + "Content-Type": "application/json", + } + +eval_prompt = """ + [Question] + {question} + + [Reference Answer] + {ground_truth} + + [Model Answer] + {model_response} + + [Task] + Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details. + Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance. + Score0: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference. + Score1: The answer shows minimal alignment, often misunderstanding or providing irrelevant details unrelated to the reference. + Score2: The answer recognizes the topic but diverges significantly from the reference in accuracy or relevance. + Score3: The answer aligns with the reference generally but lacks detail or precise accuracy in some aspects. + Score4: The answer is mostly accurate and relevant, closely following the reference but could be clearer or more detailed. + Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail. + + Your response should be formatted as follows: + Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...") + Rating: (int)""" + + +retries = 3 +NUM_SECONDS_TO_SLEEP = 5 + + +def get_eval(max_tokens: int, content: str, retries: int = retries): + global headers + + messages = [ + {"role": "user", "content": content}, + ] + + payload = {"model": GPT_EVAL_MODEL_NAME, "messages": messages, "temperature": 0.7, "max_tokens": max_tokens, "top_p": 0.95, "frequency_penalty": 0, "presence_penalty": 0, "stop": None} + + for attempt in range(retries): + try: + response = requests.post(API_URL, headers=headers, json=payload, timeout=60) + response.raise_for_status() + response_data = response.json() + + content = response_data["choices"][0]["message"]["content"].strip() + if content != "": + return content, response_data["model"] + break # If successful, break out of the loop + + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}") + if attempt < retries: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All {retries} attempts failed. Last error message: {e}") + return "", "" + return "", "" + + +def openhermes_process_results(doc, result): + pred = result[0] + ground_truth_str = doc["answer"] + content = eval_prompt.format(model_response=pred, ground_truth=ground_truth_str, question=doc["speech_instruction"]) + eval_answer, model_name = get_eval(max_tokens=1024, content=content) + return { + "gpt_eval": {"eval_answer": eval_answer, "model_name": model_name}, + } + + +def openhermes_aggregate_results(results): + score = 0 + for result in results: + try: + eval_answer = result["eval_answer"] + eval_score = re.search(r"([0-5])", eval_answer).group(1) + eval_score = float(eval_score) + except Exception as e: + eval_logger.error(f"Error parsing eval_score: {e}") + eval_score = 0.0 + score += eval_score + + return score / len(results) * 20 diff --git a/lmms_eval/tasks/people_speech/people_speech_val.yaml b/lmms_eval/tasks/people_speech/people_speech_val.yaml new file mode 100644 index 00000000..b680549d --- /dev/null +++ b/lmms_eval/tasks/people_speech/people_speech_val.yaml @@ -0,0 +1,29 @@ +dataset_path: lmms-lab/peoples_speech +dataset_kwargs: + token: True +task : "people_speech_val" +test_split: val +output_type: generate_until +doc_to_visual: !function utils.people_speech_doc_to_audio +doc_to_text: !function utils.people_speech_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.people_speech_process_result +metric_list: + - metric: wer + aggregation : !function utils.people_speech_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" diff --git a/lmms_eval/tasks/people_speech/utils.py b/lmms_eval/tasks/people_speech/utils.py new file mode 100644 index 00000000..af26f832 --- /dev/null +++ b/lmms_eval/tasks/people_speech/utils.py @@ -0,0 +1,182 @@ +import os +import re +import unicodedata + +import editdistance as ed +import zhconv + +from lmms_eval.tasks.librispeech.cn_tn import TextNorm +from lmms_eval.tasks.librispeech.whisper_normalizer.basic import BasicTextNormalizer +from lmms_eval.tasks.librispeech.whisper_normalizer.english import EnglishTextNormalizer + +# ImportError: To support decoding audio files, please install 'librosa' and 'soundfile'. +english_normalizer = EnglishTextNormalizer() +chinese_normalizer = TextNorm( + to_banjiao=False, + to_upper=False, + to_lower=False, + remove_fillers=False, + remove_erhua=False, + check_chars=False, + remove_space=False, + cc_mode="", +) +basic_normalizer = BasicTextNormalizer() + +dir_name = os.path.dirname(os.path.abspath(__file__)) + + +def people_speech_doc_to_audio(doc): + return [doc["audio"]] + + +def people_speech_doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}" + + +def people_speech_process_result(doc, result): + pred = result[0] if len(result) > 0 else "" + + gt = doc["gt"] + source = doc["source"] + task = doc["task"] + + data_dict = {"gt": gt, "pred": pred, "source": source, "task": task} + + return {"wer": data_dict} + + +PUNCS = "!,.?;:" + + +def remove_sp(text, language): + gt = re.sub(r"<\|.*?\|>", " ", text) + gt = re.sub(rf"\s+", r" ", gt) # Replace consecutive spaces in the text with a single space. + gt = re.sub(f" ?([{PUNCS}])", r"\1", gt) + gt = gt.lstrip(" ") + if language == "zh": + gt = re.sub(rf"\s+", r"", gt) + return gt + + +class EvaluationTokenizer(object): + """A generic evaluation-time tokenizer, which leverages built-in tokenizers + in sacreBLEU (https://github.com/mjpost/sacrebleu). It additionally provides + lowercasing, punctuation removal and character tokenization, which are + applied after sacreBLEU tokenization. + + Args: + tokenizer_type (str): the type of sacreBLEU tokenizer to apply. + lowercase (bool): lowercase the text. + punctuation_removal (bool): remove punctuation (based on unicode + category) from text. + character_tokenization (bool): tokenize the text to characters. + """ + + SPACE = chr(32) + SPACE_ESCAPE = chr(9601) + # ALL_TOKENIZER_TYPES = ChoiceEnum(["none", "13a", "intl", "zh", "ja-mecab"]) + + def __init__( + self, + tokenizer_type: str = "13a", + lowercase: bool = False, + punctuation_removal: bool = False, + character_tokenization: bool = False, + ): + from sacrebleu.tokenizers.tokenizer_13a import Tokenizer13a + from sacrebleu.tokenizers.tokenizer_char import TokenizerChar + from sacrebleu.tokenizers.tokenizer_intl import TokenizerV14International + from sacrebleu.tokenizers.tokenizer_ja_mecab import TokenizerJaMecab + from sacrebleu.tokenizers.tokenizer_none import NoneTokenizer + from sacrebleu.tokenizers.tokenizer_zh import TokenizerZh + + TOKENIZERS = { + "none": NoneTokenizer, + "13a": Tokenizer13a, + "intl": TokenizerV14International, + "zh": TokenizerZh, + "ja-mecab": TokenizerJaMecab, + "char": TokenizerChar, + } + + assert tokenizer_type in TOKENIZERS, f"{tokenizer_type}, {TOKENIZERS}" + self.lowercase = lowercase + self.punctuation_removal = punctuation_removal + self.character_tokenization = character_tokenization + self.tokenizer = TOKENIZERS[tokenizer_type] + # self.tokenizer = tokenizer_none + + @classmethod + def remove_punctuation(cls, sent: str): + """Remove punctuation based on Unicode category.""" + return cls.SPACE.join(t for t in sent.split(cls.SPACE) if not all(unicodedata.category(c)[0] == "P" for c in t)) + + def tokenize(self, sent: str): + tokenized = self.tokenizer()(sent) + + if self.punctuation_removal: + tokenized = self.remove_punctuation(tokenized) + + if self.character_tokenization: + tokenized = self.SPACE.join(list(tokenized.replace(self.SPACE, self.SPACE_ESCAPE))) + + if self.lowercase: + tokenized = tokenized.lower() + + return tokenized + + +def compute_wer(refs, hyps, language): + distance = 0 + ref_length = 0 + tokenizer = EvaluationTokenizer( + tokenizer_type="none", + lowercase=True, + punctuation_removal=True, + character_tokenization=False, + ) + for i in range(len(refs)): + ref = refs[i] + pred = hyps[i] + if language in ["yue"]: + ref = zhconv.convert(ref, "zh-cn") + pred = zhconv.convert(pred, "zh-cn") + if language in ["en"]: + ref = english_normalizer(ref) + pred = english_normalizer(pred) + if language in ["zh"]: + ref = chinese_normalizer(ref) + pred = chinese_normalizer(pred) + else: + ref = basic_normalizer(ref) + pred = basic_normalizer(pred) + ref_items = tokenizer.tokenize(ref).split() + pred_items = tokenizer.tokenize(pred).split() + if language in ["zh", "yue"]: + ref_items = [x for x in "".join(ref_items)] + pred_items = [x for x in "".join(pred_items)] + if i == 0: + print(f"ref: {ref}") + print(f"pred: {pred}") + print(f"ref_items:\n{ref_items}\n{len(ref_items)}\n{ref_items[0]}") + print(f"pred_items:\n{pred_items}\n{len(ref_items)}\n{ref_items[0]}") + distance += ed.eval(ref_items, pred_items) + ref_length += len(ref_items) + return distance / ref_length + + +def people_speech_wer(results, args): + refs, hyps = [], [] + for result in results: + lan = result["task"][4:] + gt = result["gt"] + response = result["pred"] + gt = remove_sp(gt, lan) + response = remove_sp(response, lan) + refs.append(gt) + hyps.append(response) + wer = compute_wer(refs, hyps, lan) + return wer * 100 diff --git a/lmms_eval/tasks/tedlium/tedlium_dev_test.yaml b/lmms_eval/tasks/tedlium/tedlium_dev_test.yaml new file mode 100644 index 00000000..18810f4c --- /dev/null +++ b/lmms_eval/tasks/tedlium/tedlium_dev_test.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/tedlium +dataset_kwargs: + token: True +task : "tedlium_dev_test" +test_split: val +dataset_name: tedlium_dev_test +output_type: generate_until +doc_to_visual: !function utils.tedlium_doc_to_audio +doc_to_text: !function utils.tedlium_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.tedlium_process_result +metric_list: + - metric: wer + aggregation : !function utils.tedlium_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/tedlium/tedlium_long_form.yaml b/lmms_eval/tasks/tedlium/tedlium_long_form.yaml new file mode 100644 index 00000000..f72983b7 --- /dev/null +++ b/lmms_eval/tasks/tedlium/tedlium_long_form.yaml @@ -0,0 +1,30 @@ +dataset_path: lmms-lab/tedlium +dataset_kwargs: + token: True +task : "tedlium_long_form" +test_split: val +dataset_name: tedlium_long_form +output_type: generate_until +doc_to_visual: !function utils.tedlium_doc_to_audio +doc_to_text: !function utils.tedlium_doc_to_text +doc_to_target: "gt" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.tedlium_process_result +metric_list: + - metric: wer + aggregation : !function utils.tedlium_wer + higher_is_better : false +metadata: + - version: 0.0 +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" + qwen2_audio: + pre_prompt: "" + post_prompt: " <|en|>" \ No newline at end of file diff --git a/lmms_eval/tasks/tedlium/utils.py b/lmms_eval/tasks/tedlium/utils.py new file mode 100644 index 00000000..606b15e3 --- /dev/null +++ b/lmms_eval/tasks/tedlium/utils.py @@ -0,0 +1,182 @@ +import os +import re +import unicodedata + +import editdistance as ed +import zhconv + +from lmms_eval.tasks.librispeech.cn_tn import TextNorm +from lmms_eval.tasks.librispeech.whisper_normalizer.basic import BasicTextNormalizer +from lmms_eval.tasks.librispeech.whisper_normalizer.english import EnglishTextNormalizer + +# ImportError: To support decoding audio files, please install 'librosa' and 'soundfile'. +english_normalizer = EnglishTextNormalizer() +chinese_normalizer = TextNorm( + to_banjiao=False, + to_upper=False, + to_lower=False, + remove_fillers=False, + remove_erhua=False, + check_chars=False, + remove_space=False, + cc_mode="", +) +basic_normalizer = BasicTextNormalizer() + +dir_name = os.path.dirname(os.path.abspath(__file__)) + + +def tedlium_doc_to_audio(doc): + return [doc["audio"]] + + +def tedlium_doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}Please recognize the speech and only output the recognized content:{post_prompt}" + + +def tedlium_process_result(doc, result): + pred = result[0] if len(result) > 0 else "" + + gt = doc["gt"] + source = doc["source"] + task = doc["task"] + + data_dict = {"gt": gt, "pred": pred, "source": source, "task": task} + + return {"wer": data_dict} + + +PUNCS = "!,.?;:" + + +def remove_sp(text, language): + gt = re.sub(r"<\|.*?\|>", " ", text) + gt = re.sub(rf"\s+", r" ", gt) # Replace consecutive spaces in the text with a single space. + gt = re.sub(f" ?([{PUNCS}])", r"\1", gt) + gt = gt.lstrip(" ") + if language == "zh": + gt = re.sub(rf"\s+", r"", gt) + return gt + + +class EvaluationTokenizer(object): + """A generic evaluation-time tokenizer, which leverages built-in tokenizers + in sacreBLEU (https://github.com/mjpost/sacrebleu). It additionally provides + lowercasing, punctuation removal and character tokenization, which are + applied after sacreBLEU tokenization. + + Args: + tokenizer_type (str): the type of sacreBLEU tokenizer to apply. + lowercase (bool): lowercase the text. + punctuation_removal (bool): remove punctuation (based on unicode + category) from text. + character_tokenization (bool): tokenize the text to characters. + """ + + SPACE = chr(32) + SPACE_ESCAPE = chr(9601) + # ALL_TOKENIZER_TYPES = ChoiceEnum(["none", "13a", "intl", "zh", "ja-mecab"]) + + def __init__( + self, + tokenizer_type: str = "13a", + lowercase: bool = False, + punctuation_removal: bool = False, + character_tokenization: bool = False, + ): + from sacrebleu.tokenizers.tokenizer_13a import Tokenizer13a + from sacrebleu.tokenizers.tokenizer_char import TokenizerChar + from sacrebleu.tokenizers.tokenizer_intl import TokenizerV14International + from sacrebleu.tokenizers.tokenizer_ja_mecab import TokenizerJaMecab + from sacrebleu.tokenizers.tokenizer_none import NoneTokenizer + from sacrebleu.tokenizers.tokenizer_zh import TokenizerZh + + TOKENIZERS = { + "none": NoneTokenizer, + "13a": Tokenizer13a, + "intl": TokenizerV14International, + "zh": TokenizerZh, + "ja-mecab": TokenizerJaMecab, + "char": TokenizerChar, + } + + assert tokenizer_type in TOKENIZERS, f"{tokenizer_type}, {TOKENIZERS}" + self.lowercase = lowercase + self.punctuation_removal = punctuation_removal + self.character_tokenization = character_tokenization + self.tokenizer = TOKENIZERS[tokenizer_type] + # self.tokenizer = tokenizer_none + + @classmethod + def remove_punctuation(cls, sent: str): + """Remove punctuation based on Unicode category.""" + return cls.SPACE.join(t for t in sent.split(cls.SPACE) if not all(unicodedata.category(c)[0] == "P" for c in t)) + + def tokenize(self, sent: str): + tokenized = self.tokenizer()(sent) + + if self.punctuation_removal: + tokenized = self.remove_punctuation(tokenized) + + if self.character_tokenization: + tokenized = self.SPACE.join(list(tokenized.replace(self.SPACE, self.SPACE_ESCAPE))) + + if self.lowercase: + tokenized = tokenized.lower() + + return tokenized + + +def compute_wer(refs, hyps, language): + distance = 0 + ref_length = 0 + tokenizer = EvaluationTokenizer( + tokenizer_type="none", + lowercase=True, + punctuation_removal=True, + character_tokenization=False, + ) + for i in range(len(refs)): + ref = refs[i] + pred = hyps[i] + if language in ["yue"]: + ref = zhconv.convert(ref, "zh-cn") + pred = zhconv.convert(pred, "zh-cn") + if language in ["en"]: + ref = english_normalizer(ref) + pred = english_normalizer(pred) + if language in ["zh"]: + ref = chinese_normalizer(ref) + pred = chinese_normalizer(pred) + else: + ref = basic_normalizer(ref) + pred = basic_normalizer(pred) + ref_items = tokenizer.tokenize(ref).split() + pred_items = tokenizer.tokenize(pred).split() + if language in ["zh", "yue"]: + ref_items = [x for x in "".join(ref_items)] + pred_items = [x for x in "".join(pred_items)] + if i == 0: + print(f"ref: {ref}") + print(f"pred: {pred}") + print(f"ref_items:\n{ref_items}\n{len(ref_items)}\n{ref_items[0]}") + print(f"pred_items:\n{pred_items}\n{len(ref_items)}\n{ref_items[0]}") + distance += ed.eval(ref_items, pred_items) + ref_length += len(ref_items) + return distance / ref_length + + +def tedlium_wer(results, args): + refs, hyps = [], [] + for result in results: + lan = result["task"][4:] + gt = result["gt"] + response = result["pred"] + gt = remove_sp(gt, lan) + response = remove_sp(response, lan) + refs.append(gt) + hyps.append(response) + wer = compute_wer(refs, hyps, lan) + return wer * 100 diff --git a/lmms_eval/tasks/vocalsound/_default_template_yaml b/lmms_eval/tasks/vocalsound/_default_template_yaml new file mode 100644 index 00000000..73a813a1 --- /dev/null +++ b/lmms_eval/tasks/vocalsound/_default_template_yaml @@ -0,0 +1,14 @@ +dataset_path: lmms-lab/vocalsound +dataset_kwargs: + token: True +doc_to_target: "answer" +doc_to_visual: !function utils.doc_to_audio +doc_to_text: !function utils.doc_to_text +doc_to_choice: !function utils.doc_to_choice +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "Classify the human vocal sound to VocalSound in English: " + +metadata: + version: 0.0 diff --git a/lmms_eval/tasks/vocalsound/utils.py b/lmms_eval/tasks/vocalsound/utils.py new file mode 100644 index 00000000..33f89389 --- /dev/null +++ b/lmms_eval/tasks/vocalsound/utils.py @@ -0,0 +1,81 @@ +import datetime +import json +import os +import random +import re +import sys +import time +from collections import defaultdict +from pathlib import Path + +import numpy as np +import requests +import yaml +from loguru import logger as eval_logger + +from lmms_eval.tasks._task_utils.file_utils import generate_submission_file + + +def doc_to_audio(doc): + return [doc["audio"]] + + +def doc_to_text(doc, lmms_eval_specific_kwargs): + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}{post_prompt}" + + +classes = ["Laughter", "Sniff", "Throat", "Cough", "Sigh", "Sneeze"] + + +def doc_to_choice(doc): + return ["Laughter", "Sniff", "Throat clearing", "Cough", "Sigh", "Sneeze"] + + +def vocalsound_process_results(doc, result): + response = result[0].strip() + gt_ans = doc["answer"] + pred = get_answer(response) + score = 1.0 if pred == gt_ans else 0.0 + return {"accuracy": {"overall": score, "age": doc["age_group"], "spk_id": doc["spk_id"]}} + + +def vocalsound_aggregate_results(results): + total_correct = 0 + group_totals = defaultdict(int) + group_correct = defaultdict(int) + + for result in results: + accuracy = result["overall"] + total_correct += accuracy + + # Gender grouping + if result["spk_id"][0] == "f": + group_totals["female"] += 1 + group_correct["female"] += accuracy + else: + group_totals["male"] += 1 + group_correct["male"] += accuracy + + # Age grouping + age_group = f"age{str(result['age'])}" + group_totals[age_group] += 1 + group_correct[age_group] += accuracy + + return { + "overall_accuracy": total_correct / len(results), + "categorical_accuracy": { + "male_accuracy": round(group_correct["male"] / group_totals.get("male", 1), 5), # Avoid division by zero + "female_accuracy": round(group_correct["female"] / group_totals.get("female", 1), 5), + "age_18_25_accuracy": round(group_correct["age1"] / group_totals.get("age1", 1), 5), + "age_26_48_accuracy": round(group_correct["age2"] / group_totals.get("age2", 1), 5), + "age_49_80_accuracy": round(group_correct["age3"] / group_totals.get("age3", 1), 5), + }, + } + + +def get_answer(response): + for temp in classes: + if temp.lower() in response.lower(): + return temp if temp != "Throat" else "Throat clearing" diff --git a/lmms_eval/tasks/vocalsound/vocalsound_test.yaml b/lmms_eval/tasks/vocalsound/vocalsound_test.yaml new file mode 100644 index 00000000..6e2f24a5 --- /dev/null +++ b/lmms_eval/tasks/vocalsound/vocalsound_test.yaml @@ -0,0 +1,13 @@ +task: "vocalsound_test" +test_split: test +metric_list: + - metric: accuracy + aggregation: !function utils.vocalsound_aggregate_results + higher_is_better: true + # - metric: submission + # aggregation: !function utils.vocalsound_aggregate_results_for_submission + # higher_is_better: true + +process_results: !function utils.vocalsound_process_results + +include: _default_template_yaml diff --git a/lmms_eval/tasks/vocalsound/vocalsound_val.yaml b/lmms_eval/tasks/vocalsound/vocalsound_val.yaml new file mode 100644 index 00000000..22e83b9f --- /dev/null +++ b/lmms_eval/tasks/vocalsound/vocalsound_val.yaml @@ -0,0 +1,13 @@ +task: "vocalsound_val" +test_split: val +metric_list: + - metric: accuracy + aggregation: !function utils.vocalsound_aggregate_results + higher_is_better: true + # - metric: submission + # aggregation: !function utils.vocalsound_aggregate_results_for_submission + # higher_is_better: true + +process_results: !function utils.vocalsound_process_results + +include: _default_template_yaml diff --git a/lmms_eval/tasks/wavcaps/utils.py b/lmms_eval/tasks/wavcaps/utils.py new file mode 100644 index 00000000..09a4333f --- /dev/null +++ b/lmms_eval/tasks/wavcaps/utils.py @@ -0,0 +1,134 @@ +import os +import re +import time +from pathlib import Path + +import requests +import yaml +from loguru import logger as eval_logger + + +def wavcaps_doc_to_audio(doc): + return [doc["context"]] + + +def wavcaps_doc_to_text(doc, lmms_eval_specific_kwargs): + question = doc["instruction"] + pre_prompt = lmms_eval_specific_kwargs["pre_prompt"] + post_prompt = lmms_eval_specific_kwargs["post_prompt"] + return f"{pre_prompt}{question}{post_prompt}" + + +# functions for the clotho_asqa_v2 task, need to be tested later + +with open(Path(__file__).parent / "wavcaps.yaml", "r") as f: + raw_data = f.readlines() + safe_data = [] + for i, line in enumerate(raw_data): + # remove function definition since yaml load cannot handle it + if "!function" not in line: + safe_data.append(line) + + config = yaml.safe_load("".join(safe_data)) + + +NUM_SECONDS_TO_SLEEP = 5 +GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"] +API_TYPE = os.getenv("API_TYPE", "azure") + +if API_TYPE == "openai": + API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions") + API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY") + headers = { + "Authorization": f"Bearer {API_KEY}", + "Content-Type": "application/json", + } +elif API_TYPE == "azure": + API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken") + API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY") + headers = { + "api-key": API_KEY, + "Content-Type": "application/json", + } + +eval_prompt = """ + [Question] + {question} + + [Reference Answer] + {ground_truth} + + [Model Answer] + {model_response} + + [Task] + Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details. + Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance. + Score0: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference. + Score1: The answer shows minimal alignment, often misunderstanding or providing irrelevant details unrelated to the reference. + Score2: The answer recognizes the topic but diverges significantly from the reference in accuracy or relevance. + Score3: The answer aligns with the reference generally but lacks detail or precise accuracy in some aspects. + Score4: The answer is mostly accurate and relevant, closely following the reference but could be clearer or more detailed. + Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail. + + Your response should be formatted as follows: + Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...") + Rating: (int)""" + + +# gpt-4 +def get_eval(max_tokens: int, content: str): + global headers + + messages = [ + {"role": "user", "content": content}, + ] + + payload = {"model": GPT_EVAL_MODEL_NAME, "messages": messages, "temperature": 0, "max_tokens": max_tokens, "n": 1} + + for attempt in range(5): + try: + response = requests.post(API_URL, headers=headers, json=payload, timeout=60) + response.raise_for_status() + response_data = response.json() + + content = response_data["choices"][0]["message"]["content"].strip() + if content != "": + return content, response_data["model"] + break # If successful, break out of the loop + + except Exception as e: + eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}") + if attempt < 5: # If we have retries left, sleep and then continue to next attempt + time.sleep(NUM_SECONDS_TO_SLEEP) + else: # If this was the last attempt, log and return empty + eval_logger.error(f"All 5 attempts failed. Last error message: {e}") + return "", "" + return "", "" + + +def wavcaps_process_results(doc, results): + pred = results[0] + ground_truth_str = doc["answer"] + content = eval_prompt.format(model_response=pred, ground_truth=ground_truth_str, question=doc["instruction"]) + eval_answer, model_name = get_eval(max_tokens=1024, content=content) + return { + "gpt_eval": {"eval_answer": eval_answer, "model_name": model_name}, + } + + +def wavcaps_aggregate_results(results): + score = 0 + for result in results: + eval_answer = result["eval_answer"] + + try: + match = re.search(r"Rating:\s*([0-5])\s*$", eval_answer) + eval_score = match.group(1) if match else 0 + eval_score = float(eval_score) + except Exception as e: + eval_logger.error(f"Error parsing eval_score: {e}") + eval_score = 0.0 + score += eval_score + + return score / len(results) diff --git a/lmms_eval/tasks/wavcaps/wavcaps.yaml b/lmms_eval/tasks/wavcaps/wavcaps.yaml new file mode 100644 index 00000000..c142e2bb --- /dev/null +++ b/lmms_eval/tasks/wavcaps/wavcaps.yaml @@ -0,0 +1,27 @@ +dataset_path: AudioLLMs/wavcaps_test +dataset_kwargs: + token: True +task : "wavcaps" +test_split: test +output_type: generate_until +doc_to_visual: !function utils.wavcaps_doc_to_audio +doc_to_text: !function utils.wavcaps_doc_to_text +doc_to_target: "answer" +lmms_eval_specific_kwargs: + default: + pre_prompt: "" + post_prompt: "" +generation_kwargs: + max_new_tokens: 256 + temperature: 0 + top_p: 1.0 + num_beams: 1 + do_sample: false +process_results: !function utils.wavcaps_process_results +metric_list: + - metric: gpt_eval + aggregation: !function utils.wavcaps_aggregate_results + higher_is_better: true +metadata: + gpt_eval_model_name: gpt-4o + version: 0.0 diff --git a/lmms_eval/utils.py b/lmms_eval/utils.py index 1df086ae..1604f9a1 100755 --- a/lmms_eval/utils.py +++ b/lmms_eval/utils.py @@ -641,6 +641,7 @@ def load_yaml_config(yaml_path=None, yaml_config=None, yaml_dir=None, mode="full yaml_dir = os.path.dirname(yaml_path) assert yaml_dir is not None + assert yaml_config is not None if "include" in yaml_config: include_path = yaml_config["include"] diff --git a/pyproject.toml b/pyproject.toml index bccee680..0ae7a953 100755 --- a/pyproject.toml +++ b/pyproject.toml @@ -74,6 +74,13 @@ dependencies = [ ] [project.optional-dependencies] +audio = [ + "more-itertools", + "editdistance", + "zhconv", + "librosa", + "soundfile" +] metrics = [ "pywsd", "spacy", diff --git a/test_parse.py b/test_parse.py new file mode 100644 index 00000000..ae69e337 --- /dev/null +++ b/test_parse.py @@ -0,0 +1,20 @@ +from lmms_eval.filters.extraction import MultiChoiceRegexFilter + + +def parse_multi_choice_answer(answer): + # Example responses and documents + model_responses = [["The answer is (B)", "I believe it is (A)", "(C) seems correct"], ["Answer is: B!", "Answer: B", "Answer: B"]] # Model response set 1 # Model response set 2 + + documents = [{"choices": ["A. Apple", "B. Banana", "C. Cherry"]}, {"choices": ["A. Alpha", "B. Beta", "C. Gamma"]}] # Multiple choice options for question 1 # Multiple choice options for question 2 + + # Instantiate the filter + multi_choice_filter = MultiChoiceRegexFilter(regex_pattern=r"\(([A-D])\)", group_select=0, ignore_case=False, ignore_punctuation=True) + + filtered_responses = multi_choice_filter.apply(model_responses, documents) + + # Print the filtered answers + for i, filtered in enumerate(filtered_responses): + print(f"Question {i+1} filtered responses: {filtered}") + + +parse_multi_choice_answer("a") diff --git a/tools/make_audio_hf_dataset.ipynb b/tools/make_audio_hf_dataset.ipynb new file mode 100644 index 00000000..ad939f87 --- /dev/null +++ b/tools/make_audio_hf_dataset.ipynb @@ -0,0 +1,139 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import datasets\n", + "import json\n", + "\n", + "# Define the dataset features (audio, text, and source)\n", + "# change the data structure according to your needs, only important changes here is using datasets.Audio to load audio file\n", + "# And provide audio path in the data construction\n", + "# once loaded through datasets.Audio, we can access audio data, in the form of np.array(float32) using doc[\"audio\"][\"array\"]\n", + "features = datasets.Features(\n", + " {\n", + " \"audio\": datasets.Audio(sampling_rate=16000),\n", + " \"prompt\": datasets.Value(\"string\"),\n", + " \"gt\": datasets.Value(\"string\"),\n", + " \"source\": datasets.Value(\"string\"),\n", + " \"task\": datasets.Value(\"string\"),\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# loading data into dict form\n", + "def load_audio_data(data_path):\n", + " with open(data_path, 'r') as f:\n", + " data_lines = f.readlines()\n", + "\n", + " audio_list = []\n", + " prompt_list = []\n", + " gt_list = []\n", + " source_list = []\n", + " task_list = []\n", + "\n", + " for line in data_lines:\n", + " json_data = json.loads(line.strip())\n", + "\n", + " audio_list.append(json_data['audio']) # Path to the actual audio file\n", + " prompt_list.append(\"<|audio_bos|><|AUDIO|><|audio_eos|>\" + json_data['prompt'])\n", + " gt_list.append(json_data['gt'])\n", + " source_list.append(json_data['source'])\n", + " task_list.append(json_data['task'])\n", + "\n", + " # Return a dictionary where keys are features and values are lists of data\n", + " return {\n", + " 'audio': audio_list,\n", + " 'prompt': prompt_list,\n", + " 'gt': gt_list,\n", + " 'source': source_list,\n", + " 'task': task_list\n", + " }" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# load data according to different task\n", + "def load_audio_data_task(data_path, task):\n", + " with open(data_path, 'r') as f:\n", + " data_lines = f.readlines()\n", + "\n", + " audio_list = []\n", + " prompt_list = []\n", + " gt_list = []\n", + " source_list = []\n", + " task_list = []\n", + "\n", + " for line in data_lines:\n", + " json_data = json.loads(line.strip())\n", + " if json_data['source'] == task: \n", + "\n", + " \n", + " audio_list.append(json_data['audio']) # Path to the actual audio file\n", + " prompt_list.append(\"<|audio_bos|><|AUDIO|><|audio_eos|>\" + json_data['prompt'])\n", + " gt_list.append(json_data['gt'])\n", + " source_list.append(json_data['source'])\n", + " task_list.append(json_data['task'])\n", + "\n", + " # Return a dictionary where keys are features and values are lists of data\n", + " return {\n", + " 'audio': audio_list,\n", + " 'prompt': prompt_list,\n", + " 'gt': gt_list,\n", + " 'source': source_list,\n", + " 'task': task_list\n", + " }\n", + "\n", + "\n", + "tasks = ['librispeech_test_other', 'librispeech_dev_other', 'librispeech_test_clean', 'librispeech_dev_clean']\n", + "\n", + "# description_root\n", + "data_description_path = \"./librispeech_eval.jsonl\"\n", + "\n", + "data_dict = {}\n", + "for task in tasks:\n", + "\n", + " # Load the dataset into a Hugging Face Dataset object\n", + " data = load_audio_data_task(data_description_path, task)\n", + "\n", + " # Create a Dataset from the data and features\n", + " dataset = datasets.Dataset.from_dict(data, features=features)\n", + "\n", + " # Verify the dataset structure\n", + " print(dataset)\n", + "\n", + " data_dict[task] = dataset\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = datasets.DatasetDict(data_dict)\n", + "data.push_to_hub(\"Alarak/librispeech\")" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}