Institute of Information Theory and Automation

You are here

Machine learning for language description

2019-03-11 14:00
Name of External Lecturer: 
František Kratochvíl, MA, Ph.D.
Affiliation of External Lecturer: 
Filozofická fakulta Univerzity Palackého v Olomouci

Linguistics as a discipline faces a great challenge in the rapid loss of global linguistic diversity. The loss caused by similar forces as climate change, global urbanisation or disappearance of traditional knowledge and ways of life. Linguistics responded by systematic language documentation, so that at least a snapshot of the present diversity is preserved for posteriority. Nevertheless, the task is overwhelming and therefore optimalization utilising the newest technologies is necessary, which is what this project is about.

We introduce tools from computational linguistics and machine learning to process and mine the language data collected for information about the language structure that is not otherwise obtainable. Linguistic documentation produces three types of data.  The first is elicited or prompted, through various methods, with a particular goal in mind. There is some discussion in linguistics about the reliability of this data. The second is natural unprompted speech. Both type are either manually parsed or not, because of the high manpower and time costs of this process.

We create a workflow where we use computational methods to derive automatic parsers to parse all data. We then use machine learning to overlay the three types of data (elicited, natural, and natural-without-annotation) and to discover patterns in the data, while calculating in the differences in reliability of our dataset.

The workflow also enables us to take a position on several longstanding methodological questions. The first is the question of sample size: it boils down to how much language material must be collected to capture the complexity of the respective language. The second question is concerned with metrics, i.e. whether a sample can be assessed in terms of information yield or saturation in relation to a particular phenomenon, in our case the morphological system of the language. Using our methods, we calculate this type of information for any dataset or text. We will apply these methods to three under-resourced languages familiar to us: Abui and Sawila (Papuan, Timor-Alor-Pantar) and Malay/Indonesian (Austronesian).

Keywords: language description, morphology, morphological analyser, inflectional classes, machine learning, uncertainty

2019-03-08 11:59