A Python toolkit for creating a personalized AI clone of yourself using WhatsApp chat data. This tool analyzes your chat patterns, writing style, and personality traits to create a fine-tuned language model that can interact just like you.
-
Advanced Style Analysis:
- Message length patterns
- Emoji usage frequency
- Slang and abbreviation patterns
- Capitalization habits
- Punctuation patterns
- Common phrases and expressions
- Response patterns
-
Multiple LLM Support:
- Llama-2
- Llama-3
- Mistral
- Falcon
- GPT
- Qwen
-
Optimization Features:
- Parameter-Efficient Fine-Tuning (LoRA)
- 8-bit quantization support
- Gradient accumulation
- Mixed precision training
pip install pandas transformers torch datasets tensorboard peft
Your WhatsApp chat export should follow this format:
[MM/DD/YY, HH:MM:SS AM/PM] Author: Message text here
Example:
[01/13/24, 12:24:48 AM] Alex: Have you finished that project for work?
[01/13/24, 12:52:48 AM] Jamie: I love those! Send me the title later.
- Standard WhatsApp format:
[MM/DD/YY, HH:MM:SS AM/PM]
- Without seconds:
[MM/DD/YY, HH:MM AM/PM]
- International format:
[DD/MM/YY, HH:MM:SS AM/PM]
- ISO-like format:
[YYYY-MM-DD, HH:MM:SS AM/PM]
- Open WhatsApp chat
- Tap ⋮ (three dots) > More > Export chat
- Choose 'Without media'
- Save the .txt file
-
✅ Include:
- Regular text messages
- Emoji messages
- URLs (they will be cleaned automatically)
- Normal conversation text
-
❌ Exclude:
- Media messages (images, videos, documents)
- System messages
- Group settings changes
- Contact cards
- Location shares
Basic usage:
python parser.py chat.txt "YourName" "OtherPerson" "YourName"
Advanced usage:
python parser.py chat.txt "YourName" "OtherPerson" "YourName" \
--llm_format mistral \
--context_length 5
This generates:
output_YYYYMMDD_HHMMSS.csv
: Original conversation pairsformatted_YourName.jsonl
: Training datastyle_metrics_YourName.json
: Style analysis
After parsing your chat data, you can fine-tune a model:
python finetune.py \
--data_path formatted_YourName.jsonl \
--model_name "mistralai/Mistral-7B-v0.1" \
--output_dir "./my_chatbot" \
--style_metrics_path style_metrics_YourName.json \
--use_peft \
--use_8bit
-
For Personal Use (Lower Resources):
- Llama3-8B (16GB VRAM)
- Qwen-7B (16GB VRAM)
- Mistral 7B (16GB VRAM)
- Llama-2 7B (16GB VRAM)
-
For Better Quality (Higher Resources):
- Llama3-70B (80GB VRAM)
- Qwen-14B (28GB VRAM)
- Llama-2 13B (24GB VRAM)
- Falcon 40B (Multiple GPUs)
The tool performs a comprehensive analysis of your chat style through multiple layers:
-
Length Patterns
- Average message length
- Message length distribution
- Typical response lengths
-
Emoji Usage
- Detection and frequency analysis
- Favorite emoji patterns
- Contextual emoji usage
-
Slang and Abbreviations
- Common internet slang (e.g., 'lol', 'omg', 'idk')
- Personal abbreviations
- Informal language patterns
-
Capitalization
- Sentence start patterns
- Stylistic caps usage
- Name/proper noun capitalization
-
Punctuation
- End-of-sentence patterns
- Multiple punctuation usage (!!!, ???)
- Informal punctuation style
-
Common Phrases
- Frequent expressions
- Conversation starters/enders
- Personal catchphrases
-
8-bit Quantization
python finetune.py \ --data_path formatted_YourName.jsonl \ --use_8bit \ --batch_size 2
-
Gradient Accumulation
python finetune.py \ --gradient_accumulation_steps 8 \ --batch_size 1
- BF16 precision training
- Flash Attention 2
- 8192 token context
- Example:
python finetune.py \ --model_name "meta-llama/Llama-3-8b" \ --use_flash_attention \ --bf16
- Custom chat template
- Flash Attention
- Example:
python finetune.py \ --model_name "Qwen/Qwen-7B" \ --use_flash_attention
-
Memory Errors
- Reduce batch size
- Enable 8-bit training
- Use gradient accumulation
python finetune.py \ --batch_size 1 \ --use_8bit \ --gradient_accumulation_steps 8
-
Training Issues
- Adjust learning rate
- Increase training data
- Try different models
python finetune.py \ --learning_rate 1e-5 \ --num_epochs 5
-
Parsing Issues
- Check date format
- Verify chat export
- Clean input data
-
GPU Memory Requirements
- 7B models: 16GB VRAM
- 13B models: 24GB VRAM
- 70B models: 80GB VRAM
- Multiple GPUs: Falcon 40B
-
Recommended Setup
- NVIDIA GPU with CUDA support
- 32GB+ System RAM
- SSD for faster data loading
- Use at least 1000 messages
- Include diverse conversations
- Clean irrelevant messages
- Maintain conversation context
- Start with smaller models
- Use default hyperparameters
- Monitor with TensorBoard
- Save checkpoints regularly
- Keep emoji patterns
- Maintain punctuation style
- Preserve message length patterns
- Retain personal phrases
-
Data Format
- WhatsApp format changes
- Regional date formats
- Media message handling
-
Resource Requirements
- High VRAM usage
- Long training times
- System RAM needs
-
Model Access
- Llama model approval
- Usage restrictions
- License requirements
-
Quality Factors
- Short conversation impact
- Mixed language handling
- Group chat dynamics
Feel free to:
- Submit issues
- Propose features
- Share improvements
- Report bugs
This project is licensed under the MIT License. See LICENSE file for details.
- HuggingFace Transformers
- Meta AI (Llama models)
- WhatsApp chat format documentation