Tagging numbers in different languages

A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as Ncmsan for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no. Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech and found that about as many words were ambiguous in that language as in English. In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English.

Other tagging systems use a smaller number of tags and ignore fine differences or model them as features somewhat independent from part-of-speech. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). In some tagging systems, different inflections of the same root word will get different parts of speech, resulting in a large number of tags. In many languages words are also marked for their " case" (role as subject, object, etc.), grammatical gender, and so on while verbs are marked for tense, aspect, and other things. For nouns, the plural, possessive, and singular forms can be distinguished. However, there are clearly many more categories and sub-categories. Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. Grammatical context is one way to determine this semantic analysis can also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and 2) an action applied to the object "hatch" (in this context, "dogs" is a nautical term meaning "fastens (a watertight door) securely"). For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:Ĭorrect grammatical tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun. This is not rare-in natural languages (as opposed to many artificial languages), a large percentage of word-forms are ambiguous. Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, by a set of descriptive tags. In corpus linguistics, part-of-speech tagging ( POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.Ī simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

JSTOR ( March 2021) ( Learn how and when to remove this template message).

Unsourced material may be challenged and removed.įind sources: "Part-of-speech tagging" – news Please help improve this article by adding citations to reliable sources.

This article needs additional citations for verification.