Guided by Surprisal: Active Curriculum Language Modeling over a Hybrid Pre-training Method. Contributions to the BabyLM Challenge
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This study investigates language modeling on developmentally plausible corpora under low-resource constraints, conducted within the scope of the BabyLM shared task, specifically focusing on the Strict-small track. We examine Active Curriculum Language Modeling (ACLM) and its impact on a hybrid model, GPT-BERT . A range of models trained on fewer than 100 million tokens are evaluated on both zero-shot and fine-tuning tasks. The experiments divulge the impact of key architectural and hyperparameter choices, including causal-to-masked objective ratios, tokenizer vocabulary sizes, batch sizes, and sequence length schedules. Results indicate that ACLM consistently enhances performance, while also revealing the strengths and limitations of this training paradigm. Overall, the findings suggest that ACLM is a promising pre-training strategy for data and parameter limited language modeling, producing models that demonstrate some human-like generalization patterns while remaining computationally efficient.