Report on POS-Tagger for Nepali Text - 06/23/07
INTRODUCTION
Part-of-speech tagging (POS tagging or POST), also called grammatical tagging, morphosyntactic categorization or syntactic wordclass tagging. It is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph. A POS analysis is the very basic grammatical task of assigning every word in a sentence or text to the correct morphosyntactic category – noun, verb, adjective, adverb, and so on. In POS tagging, labels or tags are added to every word in a text to indicate their category.
While it is possible to assign these tags manually, it is highly desirable to automate the process, as otherwise the process of applying a POS analysis to a large corpus becomes prohibitively work intensive. Some of the POS tagger available are :
● Stanford POS tagger
● TreeTagger
● TnT - A Statistical Part-of-Speech Tagger
● Unitag
● Brill's Tagger
● Memory-based tagger etc
TOOLS USED
To do our project first we used Unitag as a tool for doing Nepali POS tagging. But due to some complication we encountered during the use of Unitag, we then used Brill's tagger.
UNITAG
This unified tagging system, originally developed to tag Urdu, is now entirely language-independent, and based entirely on Unicode. It consists of a powerful morphological and lexical analysis system, and twin disambiguation modules, one based on hand-written rules and the other using a probabilistic system based on a Markov model.
We tried to use the unitag for Nepali POS tagging. While doing so, it was unable to tag the Nepali corpus properly.
USAGE:
Lexicon File
i400006 अभषक NP
i400007क ट NN
i400008 ह CN
i400009। YF
Input : अभषक क ट ह ।
Output:
s00001 w007 अभषक A50 FX
s00001 w001 क ट A10 NN
s00001 w008 ह A10 CN
s00001 w009 । A50 FX
The tags we have used in above lexicon file were NN,CN,YF and NP. FX was never defined in lexicon file but appeared in the output file. Since the Unitag was designed for Urdu language, we were not been able to track this problem. Thus Brill's tagger was our next alternative.
BRILL's TAGGER
The Brill tagger is a method for doing part-of-speech tagging. It can be summarized as an "error-driven transformation-based tagger". It is
● error-driven in the sense that it recourses to supervised learning
● transformation-based in the sense that a tag is assigned to each word and changed using a set of predefined rules. Note: If the word is known, it first assigns the most frequent tag, or if the word is unknown, it naively assigns the tag "noun" to it. Applying over and over these rules, changing the incorrect tags, a quite high accuracy is achieved.
USAGE:
Lexicon file:
अभषक NP
क ट NN
ह CN
। YF
Input : अभषक क ट ह ।
Output: अभषक/NP क ट /NN ह /CN ।/YF
TAGSET
The first prerequisite for an automated POS tagger is a tagset – that is, a set of exhaustive categories into which any token in the language can be placed. While the nature of language is such that there will always be words that are hard to classify or ambiguous between two categories, the tagset categories should be designed in such a way as to minimize these problems.
We have used the Nelralec tagset for the purpose of our project.
THE NELRALEC TAGSET
The Nepali tagset used on the Nelralec project was developed by a team of linguists from Tribhuvan University (especially Yogendra Yadava, Ram Lohani, and Bhim Regmi) and Lancaster University (Andrew Hardie).
The tagset is fully hierarchical - that is, in a tag such as VVYN1F, the first letter (V-) indicates the class of all verbs, the first two letters (VV-) indicate finite verbs, the first three letters (VVY-) indicate third person finite verbs, and so on, until at the lowest level of the hierarchy the fully specific tag VVYN1F indicates a very tightly defined, narrow category (feminine singular non-honorific third person finite verbs, such as che).
The tagset has two main structural features of note. Firstly, the Nepali postpositions, which are preferentially written as affixes on the noun or other word that they govern, are treated as separate tokens in this scheme of analysis. This gives the tagset the flexibility needed to handle the very large array of potentially possible configurations of case.
Secondly, tense, aspect and modality are not marked up on finite verbs, which are classified solely according to their agreement marking - a necessary simplification for dealing with the very complex verbal inflections of Nepali, which, together with the use of compound verbs, could not be indicated by the tagset without the use of thousands of additional categories.
The Brill's POS-Tagger has four text files, namely:
a. lexicon.project – contains the lexicons defined with its tags.
b. LEXICALRULEFILE.project – for writing lexical rules of Nepali grammar.
c. CONTEXTUALRULEFILE.project – for writing contextual rules of Nepali grammar.
d. test.txt – our custom corpus.
The files has to be encoded by UTF-8 encoding.
CONCLUSION
POS-Tagger for Nepali grammar has been done using Brill's Tagger with the help of Nelralec tagset. Our custom corpus consist of 58 words and 209 lexicons with its corresponding tags.
REFERENCES
Nepali Grammar Project [http://www.lancs.ac.uk/staff/hardiea/nepali/index.php]
Brill's Tagger [http://en.wikipedia.org/wiki/Brill_Tagger]
Part-of-speech tagging [http://en.wikipedia.org/wiki/Part-of-speech_tagging]
Group Members:
Prabin Gautam
Pravab Dhakal
CS-Batch:2003
Part-of-speech tagging (POS tagging or POST), also called grammatical tagging, morphosyntactic categorization or syntactic wordclass tagging. It is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph. A POS analysis is the very basic grammatical task of assigning every word in a sentence or text to the correct morphosyntactic category – noun, verb, adjective, adverb, and so on. In POS tagging, labels or tags are added to every word in a text to indicate their category.
While it is possible to assign these tags manually, it is highly desirable to automate the process, as otherwise the process of applying a POS analysis to a large corpus becomes prohibitively work intensive. Some of the POS tagger available are :
● Stanford POS tagger
● TreeTagger
● TnT - A Statistical Part-of-Speech Tagger
● Unitag
● Brill's Tagger
● Memory-based tagger etc
TOOLS USED
To do our project first we used Unitag as a tool for doing Nepali POS tagging. But due to some complication we encountered during the use of Unitag, we then used Brill's tagger.
UNITAG
This unified tagging system, originally developed to tag Urdu, is now entirely language-independent, and based entirely on Unicode. It consists of a powerful morphological and lexical analysis system, and twin disambiguation modules, one based on hand-written rules and the other using a probabilistic system based on a Markov model.
We tried to use the unitag for Nepali POS tagging. While doing so, it was unable to tag the Nepali corpus properly.
USAGE:
Lexicon File
i400006 अभषक NP
i400007क ट NN
i400008 ह CN
i400009। YF
Input : अभषक क ट ह ।
Output:
s00001 w007 अभषक A50 FX
s00001 w001 क ट A10 NN
s00001 w008 ह A10 CN
s00001 w009 । A50 FX
The tags we have used in above lexicon file were NN,CN,YF and NP. FX was never defined in lexicon file but appeared in the output file. Since the Unitag was designed for Urdu language, we were not been able to track this problem. Thus Brill's tagger was our next alternative.
BRILL's TAGGER
The Brill tagger is a method for doing part-of-speech tagging. It can be summarized as an "error-driven transformation-based tagger". It is
● error-driven in the sense that it recourses to supervised learning
● transformation-based in the sense that a tag is assigned to each word and changed using a set of predefined rules. Note: If the word is known, it first assigns the most frequent tag, or if the word is unknown, it naively assigns the tag "noun" to it. Applying over and over these rules, changing the incorrect tags, a quite high accuracy is achieved.
USAGE:
Lexicon file:
अभषक NP
क ट NN
ह CN
। YF
Input : अभषक क ट ह ।
Output: अभषक/NP क ट /NN ह /CN ।/YF
TAGSET
The first prerequisite for an automated POS tagger is a tagset – that is, a set of exhaustive categories into which any token in the language can be placed. While the nature of language is such that there will always be words that are hard to classify or ambiguous between two categories, the tagset categories should be designed in such a way as to minimize these problems.
We have used the Nelralec tagset for the purpose of our project.
THE NELRALEC TAGSET
The Nepali tagset used on the Nelralec project was developed by a team of linguists from Tribhuvan University (especially Yogendra Yadava, Ram Lohani, and Bhim Regmi) and Lancaster University (Andrew Hardie).
The tagset is fully hierarchical - that is, in a tag such as VVYN1F, the first letter (V-) indicates the class of all verbs, the first two letters (VV-) indicate finite verbs, the first three letters (VVY-) indicate third person finite verbs, and so on, until at the lowest level of the hierarchy the fully specific tag VVYN1F indicates a very tightly defined, narrow category (feminine singular non-honorific third person finite verbs, such as che).
The tagset has two main structural features of note. Firstly, the Nepali postpositions, which are preferentially written as affixes on the noun or other word that they govern, are treated as separate tokens in this scheme of analysis. This gives the tagset the flexibility needed to handle the very large array of potentially possible configurations of case.
Secondly, tense, aspect and modality are not marked up on finite verbs, which are classified solely according to their agreement marking - a necessary simplification for dealing with the very complex verbal inflections of Nepali, which, together with the use of compound verbs, could not be indicated by the tagset without the use of thousands of additional categories.
The Brill's POS-Tagger has four text files, namely:
a. lexicon.project – contains the lexicons defined with its tags.
b. LEXICALRULEFILE.project – for writing lexical rules of Nepali grammar.
c. CONTEXTUALRULEFILE.project – for writing contextual rules of Nepali grammar.
d. test.txt – our custom corpus.
The files has to be encoded by UTF-8 encoding.
CONCLUSION
POS-Tagger for Nepali grammar has been done using Brill's Tagger with the help of Nelralec tagset. Our custom corpus consist of 58 words and 209 lexicons with its corresponding tags.
REFERENCES
Nepali Grammar Project [http://www.lancs.ac.uk/staff/hardiea/nepali/index.php]
Brill's Tagger [http://en.wikipedia.org/wiki/Brill_Tagger]
Part-of-speech tagging [http://en.wikipedia.org/wiki/Part-of-speech_tagging]
Group Members:
Prabin Gautam
Pravab Dhakal
CS-Batch:2003
Comments
Post a Comment