There are a lots of NLP libraries out there but the basic tasks like stemming, tokenization, stopword filtering are not available for Nepali language and missing these basic but important tasks make the whole Nepali NLP task a bit unfruitful. So I have developed a library that can be used for tokenization, stopword removal and stemming.
- Stemming algorithm is based on the algorithm published in my report “Text Stemming in Nepali” (Download pdf report: Text Stemming in Nepali). Currently, only suffixes are removed. Prefix removal has not yet been done.
- Tokenization is done considering all non-Nepali characters as splitting points.
- Stopword removal is done based on stop words that I have collected. See details here.
The library is available in Java and Rapidminer platform as an extension. You can Download Java library if you intend to develop text processing application in Java. After downloading the library, you might want to check this post about the classes and functions in the library.
Or, Download Rapidminer extension for processing Nepali text in Rapidminer. Please note that the extension depends on “Text Processing Extension”. If you do not have installed it already then install it first. To install this extension you need to copy the jar file to the “plugins” directory of Rapidminer. It is usually located in “C:\Program Files (x86)\Rapid-I\RapidMiner5\lib\plugins”. The path may vary depending on the OS you use and the folder where you have installed Rapidminer.