This is repo for the SemEval-2024 Task 2 paper "Evaluating Clinical Inference Capabilities of Large Language Models". This work is rather a survey to evaluate the success of LLMs, examines LLM success in clinical domain by dissecting the results on dev set. We also classified some interesting examples related to medical domain such as medical abbreviations and general NLU such as numerical expression evaluation. Allld etails can be found in our paper.