Conferencia Invitada: Data quality in the era of Foundational Models

Saúl Marcelo Calderón Ramírez

Saúl Marcelo Calderón Ramírez Instituto Tecnológico de Costa Rica

Resumen

Deep learning models usually need extensive amounts of data, and these data have to be labeled, becoming a concern when dealing with real-world applications. It is known that labeling a dataset is a costly task in time, money, and resource-wise. Foundational models are becoming a strong trend in different application fields, from natural language processing to image analysis. Commonly, foundational models are pre-trained with very large datasets in self-supervised fashion, with multi-modal data (text, images, audio, etc.). The usage of these models in target domains and tasks decreases the need of labeling very large target datasets, even more when using scarcely labeled data regimes: semi-supervised, self-supervised, few-shot learning, etc. However, data quality for both training and evaluation of the model even in these settings is important. In this talk, we address different data quality attributes for both training and evaluation of the model, which are still relevant for systems based upon foundational models.