String Kernels

String Kernels Software Package

by Radu Tudor Ionescu and Marius Popescu

Download the String Kernels 1.0 software package released under the GNU Public License. The String Kernels 1.0 software package contains a Java implementation of the efficient algorithm for computing string kernels presented in [1, 2, 3, 4, 7].

If you use this software (or a modified version of it) in any scientific work, please cite at least one of the corresponding works:

[1] Marius Popescu and Radu Tudor Ionescu. The Story of the Characters, the DNA and the Native Language. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 270-278, 2013. [BibTeX]

[2] Radu Tudor Ionescu, Marius Popescu, and Aoife Cahill. Can characters reveal your native language? A language-independent approach to native language identification. Proceedings of EMNLP, pp. 1363-1373, 2014. [BibTeX]

[3] Radu Tudor Ionescu and Marius Popescu. Knowledge Transfer between Computer Vision and Text Mining: Similarity-based Learning Approaches. Springer, 2016. [BibTeX]

[4] Radu Tudor Ionescu, Marius Popescu, Aoife Cahill. String Kernels for Native Language Identification: Insights from Behind the Curtains. Computational Linguistics, vol. 42, no. 3, pp. 491-525, 2016. [BibTeX]

[5] Radu Tudor Ionescu, Marius Popescu. UnibucKernel: An Approach for Arabic Dialect Identification based on Multiple String Kernels. In Proceedings of the VarDial Workshop, pp. 135-144, 2016. [BibTeX]

[6] Radu Tudor Ionescu, Andrei M. Butnaru. Learning to Identify Arabic and German Dialects using Multiple Kernels. In Proceedings of the VarDial Workshop of EACL, pp. 200-209, 2017. [BibTeX]

[7] Marius Popescu, Cristian Grozea, Radu Tudor Ionescu. HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages. In Proceedings of KES, pp.1755-1763, 2017. [BibTeX]

[8] Radu Tudor Ionescu, Marius Popescu. Can string kernels pass the test of time in Native Language Identification? In Proceedings of the BEA-12 Workshop of EMNLP, pp.224-234, 2017. [BibTeX]

[9] Mădălina Cozma, Andrei M. Butnaru, Radu Tudor Ionescu. Automated essay scoring with string kernels and word embeddings. In Proceedings of ACL, pp. 503-509, 2018. [BibTeX]

[10] Andrei M. Butnaru, Radu Tudor Ionescu. UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row. In Proceedings of the VarDial Workshop, pp.77-87, 2018. [BibTeX]

[11] Radu Tudor Ionescu, Andrei M. Butnaru. Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set. In Proceedings of EMNLP, 2018. [BibTeX]

[12] Andrei M. Butnaru, Radu Tudor Ionescu. MOROCO: The Moldavian and Romanian dialectal corpus. In Proceedings of ACL, pp. 688-698, 2019.

[13] Mihaela Găman, Sebastian Cojocariu, Radu Tudor Ionescu. UnibucKernel: Geolocating Swiss German Jodels Using Ensemble Learning. In Proceedings of the VarDial Workshop, pp.84-95, 2021.

[14] Mihaela Găman, Radu Tudor Ionescu. The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification. International Journal of Intelligent Systems, 2021.

The blended spectrum string kernel and the blended presence bits string kernel have been used for native language identification in [1]. In the same paper, a kernel based on Local Rank Distance was also used. The blended intersection string kernel was introduced in [2]. An extensive presentation of string kernels is provided in [3,4]. As shown in [3,4], string kernels obtain state-of-the-art native language identification performance for different L2 languages: English, Arabic, Norwegian. In [7], we present our efficient string kernel algorithm named HASKER and we show that it is about 4 times faster than a suffix trie implementation. More recently, string kernels have also been shown to obtain state-of-the-art results in automated essay scoring [9] and in cross-domain settings [11].

Using string kernels, we obtained good rankings at several international competitions:

3rd place in the Native Language Identification Shared Task of the BEA-8 Workshop of NAACL 2013. Our system is described in [1].
2nd place in the Arabic Dialect Identification Shared Task, which is part of the DSL 2016 Challenge of the COLING 2016 VarDial Workshop. Our system is described in [5].
1st place in the Arabic Dialect Identification Shared Task, which is part of the DSL 2017 Challenge of the EACL 2017 VarDial Workshop. Our system is described in [6].
1st place in all three tracks (essay, speech and fusion) of the Native Language Identification Shared Task of the BEA-12 Workshop of EMNLP 2017. Our system is described in [8].
1st place in the Arabic Dialect Identification Shared Task, which is part of the DSL 2018 Challenge of the COLING 2018 VarDial Workshop. Our system is described in [10].