ロシア語データベースの作成とその運用 : lemmatized concordanceの場合

概要

論文の詳細を見る
The Russian language has a system of inflection which is considered to be too complicated for systematic processing by computer. For this reason, it is necesary to add the proper lexical item to each individual word-form in the text (lemmatization). In this paper, we have reported on the construction of three independent data base tables which supplement each other operated on the relational data base software dBXL for personalcomputer. kwic. dbf (the contextual data base, cf. example 3). From this data base we can recall both the right and the left context of each word by appointing their absolute addresses (REF: page, line, the ordinal number of the item) which never overlap in the same work. This data is used in order to understand or verify the information the formalization of which is difficult, like the meaning or the usage of a word, by indicating its contextual environment. lex. dbf (the vocabulary data base, cf. example 4). This data base is an electronic dictionary, in which the lexical items and the information on their inflection are included. These data can be recalled by the conjunction of each lexical item with a number to identify the homonyma (LEX2). Later also the lexical meaning and the stylistic value will be all registered here. maon. dbf (main data base, cf. example 7). All word-forms in the processed text are supplied with their absolute addresses, the corresponding lexical items, parts of speech, morphological information (INF), and so on. This data base is made by application of morphological analysis to the text and by reduction of the homonyma with the aid of the contextual and lexical data base. This data base can be widely used as a dynamic index. In this paper we have reported on the compilation of the so-called lemmatized concordance, making use of two data bases on "Captain's daughter" of A. S. Pushkin (cf. example 11). This concordance is obtained by setting a multiple index on main. dbf by LEX2 and INF and then by setting the relation between main. dbf and kwic. dbf by means of REF. Generally speaking, in the computerized analysis of a text, the frequency lisy of used vocabulary is an important basic material. But hitherto there has been no method of varifying the numerical value of the data, even in case the statistics were published. A reliable frequency list can be obtained on the basis of this concordance, because the frequency of each lexical item is shown together with all its word-forms in their context.
日本ロシア文学会の論文
1996-10-01

著者

浦井康男
北海道大学

ロシア語データベースの作成とその運用 : lemmatized concordanceの場合

スポンサーリンク

概要

著者

関連論文

スポンサーリンク