Please use this identifier to cite or link to this item: http://repositorio.ufla.br/jspui/handle/1/56065
Title: Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça
Other Titles: Data labeling for the task of named entity recognition in the domain of cachaça beverage
Authors: Pereira, Denilson Alves
Merschmann, Luiz Henrique de Campos
Brito, Mozar José de
Dalip, Daniel Hasan
Keywords: Reconhecimento de entidades nomeadas
Cachaça
Aprendizagem de máquina
Processamento de Linguagem Natural (PLN)
Processamento de Linguagem Natural
Named Entity Recognition (NER)
Machine learning
Natural Language Processing (NLP)
Issue Date: 27-Feb-2023
Publisher: Universidade Federal de Lavras
Citation: SILVA, P. de S. Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça. 2022. 111 p. Dissertação (Mestrado em Ciência da Computação)–Universidade Federal de Lavras, Lavras, 2022.
Abstract: Named Entity Recognition (NER) is the task of identifying tokens in free text and classifying them according to a set of predefined categories such as person name, organization and location. Datasets labeled for this task are essential for training supervised machine learning models. However, although there are many datasets labeled with texts in English, for the Portuguese language they are still scarce. Therefore, this work contributes with the creation and evaluation of a manually labeled dataset for the NER task, with texts written in Brazilian Portuguese, in the specific domain of the distilled beverage cachaça. Essa é uma bebida popular no Brasil e de grande importância econômica. The dataset proposed in this work is the first in Portuguese in the field of beverages and may be useful for other types of beverages with categories of entities similar to cachaça, such as wine and beer. This work describes the process of textual data collection and extraction, creation and labeling of the NER data set and its experimental evaluation. As a result, a dataset called cachacaNER was obtained, which contains more than 180,000 tokens labeled in 17 categories of named entities specific to the cachaça context and generic categories. According to Fleiss’ Kappa metric, the agreement (0.857) obtained between the different labelers was almost perfect, guaranteeing the reliability of the dataset in relation to manual labeling. The size of the dataset, as well as the result of its experimental evaluation, are comparable to other datasets in Portuguese, although the one in this work has a greater number of categories of named entities. In addition to manual labeling, an automatic entity labeling technique was also evaluated, with cachacaNER data, in order to propose faster labeling with less manual work. As a result, it was identified that the NER model trained with automatically labeled data performed well (F1 of 0.808), considering the result of the same model trained with manually labeled data (F1 of 0.899).
URI: http://repositorio.ufla.br/jspui/handle/1/56065
Appears in Collections:Ciência da Computação - Mestrado (Dissertações)



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.