Please use this identifier to cite or link to this item:
http://repositorio.ufla.br/jspui/handle/1/56065
Title: | Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça |
Other Titles: | Data labeling for the task of named entity recognition in the domain of cachaça beverage |
Authors: | Pereira, Denilson Alves Merschmann, Luiz Henrique de Campos Brito, Mozar José de Dalip, Daniel Hasan |
Keywords: | Reconhecimento de entidades nomeadas Cachaça Aprendizagem de máquina Processamento de Linguagem Natural (PLN) Processamento de Linguagem Natural Named Entity Recognition (NER) Machine learning Natural Language Processing (NLP) |
Issue Date: | 27-Feb-2023 |
Publisher: | Universidade Federal de Lavras |
Citation: | SILVA, P. de S. Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça. 2022. 111 p. Dissertação (Mestrado em Ciência da Computação)–Universidade Federal de Lavras, Lavras, 2022. |
Abstract: | Named Entity Recognition (NER) is the task of identifying tokens in free text and classifying them according to a set of predefined categories such as person name, organization and location. Datasets labeled for this task are essential for training supervised machine learning models. However, although there are many datasets labeled with texts in English, for the Portuguese language they are still scarce. Therefore, this work contributes with the creation and evaluation of a manually labeled dataset for the NER task, with texts written in Brazilian Portuguese, in the specific domain of the distilled beverage cachaça. Essa é uma bebida popular no Brasil e de grande importância econômica. The dataset proposed in this work is the first in Portuguese in the field of beverages and may be useful for other types of beverages with categories of entities similar to cachaça, such as wine and beer. This work describes the process of textual data collection and extraction, creation and labeling of the NER data set and its experimental evaluation. As a result, a dataset called cachacaNER was obtained, which contains more than 180,000 tokens labeled in 17 categories of named entities specific to the cachaça context and generic categories. According to Fleiss’ Kappa metric, the agreement (0.857) obtained between the different labelers was almost perfect, guaranteeing the reliability of the dataset in relation to manual labeling. The size of the dataset, as well as the result of its experimental evaluation, are comparable to other datasets in Portuguese, although the one in this work has a greater number of categories of named entities. In addition to manual labeling, an automatic entity labeling technique was also evaluated, with cachacaNER data, in order to propose faster labeling with less manual work. As a result, it was identified that the NER model trained with automatically labeled data performed well (F1 of 0.808), considering the result of the same model trained with manually labeled data (F1 of 0.899). |
URI: | http://repositorio.ufla.br/jspui/handle/1/56065 |
Appears in Collections: | Ciência da Computação - Mestrado (Dissertações) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
DISSERTAÇÃO_Rotulação de dados para a tarefa de reconhecimento de entidades nomeadas no domínio da bebida cachaça.pdf | 5,58 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.