CachacaNER: a dataset for named entity recognition in texts about the cachaça beverage

Silva, Priscilla; Franco, Arthur; Santos, Thiago; Brito, Mozar; Pereira, Denilson

Please use this identifier to cite or link to this item: http://repositorio.ufla.br/jspui/handle/1/58410

Full metadata record

DC Field	Value	Language
dc.creator	Silva, Priscilla	-
dc.creator	Franco, Arthur	-
dc.creator	Santos, Thiago	-
dc.creator	Brito, Mozar	-
dc.creator	Pereira, Denilson	-
dc.date.accessioned	2023-10-11T17:30:40Z	-
dc.date.available	2023-10-11T17:30:40Z	-
dc.date.issued	2023	-
dc.identifier.citation	SILVA, P. et al. CachacaNER: a dataset for named entity recognition in texts about the cachaça beverage. Language Resources and Evaluation, [S.l.], 2023.	pt_BR
dc.identifier.uri	https://link.springer.com/article/10.1007/s10579-023-09665-0#citeas	pt_BR
dc.identifier.uri	http://repositorio.ufla.br/jspui/handle/1/58410	-
dc.description.abstract	Named Entity Recognition (NER) is the task of identifying and classifying tokens in texts corresponding to a set of pre-defined categories, such as names of people, organizations and locations. Datasets labeled for this task are essential for training supervised machine learning models. Although there are many datasets labeled with texts for English, in the Portuguese language they are scarcer. This work contributes to the creation and evaluation of a manually labeled dataset for the NER task, with texts in Brazilian Portuguese, in the specific domain of the beverage called Cachaça. This is a popular drink in Brazil, and of great economic importance. This is the first NER dataset in the beverage domain, and can be useful for other types of beverages with similar entity categories, such as wine and beer. We describe the process of data collection, creation of the dataset and its experimental evaluation. As a result, we created a dataset containing over 180,000 tokens labeled in 17 entity categories. The labeling obtained an agreement coefficient of 0.857 among the labelers, according to the Fleiss’ Kappa metric, which is considered almost perfect. In our experimental evaluation, we obtained a micro-F1 value equal to 0.933 in the test set. The size of the dataset, as well as the result of its experimental evaluation, are comparable to other datasets in the Portuguese language, even though ours has a greater number of entity categories.	pt_BR
dc.language	en_US	pt_BR
dc.publisher	Springer	pt_BR
dc.rights	restrictAccess	pt_BR
dc.source	Language Resources and Evaluation	pt_BR
dc.subject	Named Entity Recognition (NER)	pt_BR
dc.subject	Dataset	pt_BR
dc.subject	Labeled data	pt_BR
dc.subject	Cachaça	pt_BR
dc.title	CachacaNER: a dataset for named entity recognition in texts about the cachaça beverage	pt_BR
dc.type	Artigo	pt_BR
Appears in Collections:	DQI - Artigos publicados em periódicos

Files in This Item:

There are no files associated with this item.

Show simple item record Recommend this item