![]() They are not stored in one central database, they come in many different data formats and it is difficult to interpret their class labels and how to reuse these labels in other projects. Despite recent efforts to create web pages that provide an overview, most publications still use only a single dataset. The wide variety of the task (different languages, different labeling processes and schemes) has led to a large amount of heterogeneous datasets that can be used for training and testing very specific settings. Publisher = "Association for Computational Linguistics",Ībstract = "With the rise of research on toxic comment classification, more and more annotated datasets have been released. Cite (Informal): Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format (Risch et al., WOAH 2021) Copy Citation: BibTeX Markdown MODS XML Endnote More options… PDF: Video: Code julian-risch/toxic-comment-collection Data Hate = "Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format",īooktitle = "Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)", Association for Computational Linguistics. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 157–163, Online. Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in One Unified Format. Anthology ID: 2021.woah-1.17 Volume: Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) Month: August Year: 2021 Address: Online Venue: WOAH SIG: Publisher: Association for Computational Linguistics Note: Pages: 157–163 Language: URL: DOI: 10.18653/v1/2021.woah-1.17 Bibkey: risch-etal-2021-toxic Cite (ACL): Julian Risch, Philipp Schmidt, and Ralf Krestel. Another advantage of that tool is that it gives an overview of properties of available datasets, such as different languages, platforms, and class labels to make it easier to select suitable training and test data. To overcome these issues, we present a collection of more than thirty datasets in the form of a software tool that automatizes downloading and processing of the data and presents them in a unified data format that also offers a mapping of compatible class labels. ![]() Abstract With the rise of research on toxic comment classification, more and more annotated datasets have been released.
0 Comments
Leave a Reply. |