String Kernels Software Package

by Radu Tudor Ionescu and Marius Popescu

Download the String Kernels 1.0 software package released under the GNU Public License. The String Kernels 1.0 software package contains a Java implementation of the efficient algorithm for computing string kernels presented in [1, 2, 3].

If you use this software (or a modified version of it) in any scientific work, please cite the corresponding works:

[1] Radu Tudor Ionescu, Marius Popescu, and Aoife Cahill. Can characters reveal your native language? A language-independent approach to native language identification. Proceedings of EMNLP, pp. 1363–1373, 2014. [BibTeX]

[2] Marius Popescu and Radu Tudor Ionescu. The Story of the Characters, the DNA and the Native Language. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 270–278, 2013. [BibTeX]

[3] Radu Tudor Ionescu and Marius Popescu. Knowledge Transfer between Computer Vision and Text Mining: Similarity-based Learning Approaches. Springer, 2016. [BibTeX]

[4] Radu Tudor Ionescu, Marius Popescu, Aoife Cahill. String Kernels for Native Language Identification: Insights from Behind the Curtains. Computational Linguistics, 2016.

[5] Radu Tudor Ionescu, Marius Popescu. UnibucKernel: An Approach for Arabic Dialect Identification based on Multiple String Kernels. In Proceedings of the VarDial Workshop of COLING, 2016.

The blended spectrum string kernel and the blended presence bits string kernel have been used for native language identification in [2]. In the same paper, a kernel based on Local Rank Distance was also used. The blended intersection string kernel was introduced in [1]. An extensive presentation of string kernels is provided in [3,4]. As shown in [3,4], string kernels obtain state-of-the-art native language identification performance for different L2 languages: English, Arabic, Norwegian.

Using string kernels, we ranked on the 2nd place in the Arabic Dialect Identification Shared Task, which is part of the DSL 2016 Challenge of the COLING 2016 VarDial Workshop. Our system is described in [5].