In this repository you can find the code for building The Stack v2 dataset, as well as the extra sources used to make StarCoder2data: the training corpus of the StarCoder2 family of models.
This reposirory is a follow-up of on the work in bigcode-dataset used for The Stack v1 and StarCoderData.