String Kernels Software Package

by Radu Tudor Ionescu and Marius Popescu

Download the String Kernels 1.0 software package released under the GNU Public License. The String Kernels 1.0 software package contains a Java implementation of the efficient algorithm for computing string kernels presented in [1, 2, 3, 4, 7].

If you use this software (or a modified version of it) in any scientific work, please cite at least one of the corresponding works:

[1] Marius Popescu and Radu Tudor Ionescu. The Story of the Characters, the DNA and the Native Language. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 270-278, 2013. [BibTeX]

[2] Radu Tudor Ionescu, Marius Popescu, and Aoife Cahill. Can characters reveal your native language? A language-independent approach to native language identification. Proceedings of EMNLP, pp. 1363-1373, 2014. [BibTeX]

[3] Radu Tudor Ionescu and Marius Popescu. Knowledge Transfer between Computer Vision and Text Mining: Similarity-based Learning Approaches. Springer, 2016. [BibTeX]

[4] Radu Tudor Ionescu, Marius Popescu, Aoife Cahill. String Kernels for Native Language Identification: Insights from Behind the Curtains. Computational Linguistics, vol. 42, no. 3, pp. 491-525, 2016. [BibTeX]

[5] Radu Tudor Ionescu, Marius Popescu. UnibucKernel: An Approach for Arabic Dialect Identification based on Multiple String Kernels. In Proceedings of the VarDial Workshop, pp. 135-144, 2016. [BibTeX]

[6] Radu Tudor Ionescu, Andrei M. Butnaru. Learning to Identify Arabic and German Dialects using Multiple Kernels. In Proceedings of the VarDial Workshop of EACL, pp. 200-209, 2017. [BibTeX]

[7] Marius Popescu, Cristian Grozea, Radu Tudor Ionescu. HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages. In Proceedings of KES, pp.1755-1763, 2017. [BibTeX]

[8] Radu Tudor Ionescu, Marius Popescu. Can string kernels pass the test of time in Native Language Identification? In Proceedings of the BEA-12 Workshop of EMNLP, pp.224-234, 2017. [BibTeX]

[9] Mădălina Cozma, Andrei M. Butnaru, Radu Tudor Ionescu. Automated essay scoring with string kernels and word embeddings. In Proceedings of ACL, pp. 503-509, 2018. [BibTeX]

[10] Andrei M. Butnaru, Radu Tudor Ionescu. UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row. In Proceedings of the VarDial Workshop, pp.77-87, 2018. [BibTeX]

[11] Radu Tudor Ionescu, Andrei M. Butnaru. Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set. In Proceedings of EMNLP, 2018. [BibTeX]

[12] Andrei M. Butnaru, Radu Tudor Ionescu. MOROCO: The Moldavian and Romanian dialectal corpus. In Proceedings of ACL, pp. 688-698, 2019.

[13] Mihaela Găman, Sebastian Cojocariu, Radu Tudor Ionescu. UnibucKernel: Geolocating Swiss German Jodels Using Ensemble Learning. In Proceedings of the VarDial Workshop, pp.84-95, 2021.

[14] Mihaela Găman, Radu Tudor Ionescu. The unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification. International Journal of Intelligent Systems, 2021.

The blended spectrum string kernel and the blended presence bits string kernel have been used for native language identification in [1]. In the same paper, a kernel based on Local Rank Distance was also used. The blended intersection string kernel was introduced in [2]. An extensive presentation of string kernels is provided in [3,4]. As shown in [3,4], string kernels obtain state-of-the-art native language identification performance for different L2 languages: English, Arabic, Norwegian. In [7], we present our efficient string kernel algorithm named HASKER and we show that it is about 4 times faster than a suffix trie implementation. More recently, string kernels have also been shown to obtain state-of-the-art results in automated essay scoring [9] and in cross-domain settings [11].

Using string kernels, we obtained good rankings at several international competitions: