Speech Recognition is an important component of Natural Languge Processing. It involves training a model on audio files and their respective transcriptions. In this task we will combine few-shot learning and transfer learning to train a model on our urdu dataset. For this we will fine tune the pretrained XLSR-Wav2Vec2 model available on HuggingFace. This model has been trained on powerful speech representations in more than 50 languages. After tuning, model will be tested on provided dataset.