Submitted by Huu Nguyen 9 MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources Ontocord.AI 7 3