Firoj Alam, Shammur Absar Chowdhury, Sheak Rashed Haider Noori
ICCIT 2016
Publication year: 2016

Part-of-speech (POS) information is one of the fundamental components in the natural language processing pipeline, which helps in extracting higher-level information such as named entities, discourse, and syntactic structure of a sentence. For some languages, such as English, Dutch, and Chinese, it is considered as a solved problem due to the higher accuracy (97%) of the predicted system. Significant efforts have been made for such languages in terms of making the data publicly accessible and also organizing evaluation campaigns. Compared to that there are very fewer efforts for Bangla (ethnonym: Bangla; exonym: Bengali). In this paper, we present a knowledge poor approach for POS tagging, which we evaluated using publicly accessible dataset from LDC. The motivation of our approach is that we did not want to rely on any existing resources such as lexicon or named entity recognizer for designing the system as they are not publicly available and difficult to develop. We have not used any hand-crafted features, rather we employed distributed representations of word and characters. We designed the system using Long Short Term Memory (LSTM) neural networks followed by Conditional Random Fields (CRFs) for designing the model with an inclusion of pre-trained word embedded model. We obtained promising results with an accuracy of 86.0%.

Leave a Reply

Your email address will not be published. Required fields are marked *

eighteen − 9 =