A simple grammar learner

Grammars for natural languages appear to be too complicated to write them down completely. Nevertheless humans are able to tell whether a sentence is grammatical or not. The idea exposed in the leightweight language learner program NatLan is to present invented sentences to a human and ask whether they are grammatical sound or not, hereby acquiring the grammar of a language.

A corpus of training sentences has to be provided. In order to facilitate finding grammatical categories, sections of a sentence can be embraced by square brackets []. Nesting of brackets is allowed. In principle the brackets could be inserted randomly, but the number of opening and closing ones must be equal, and the number of closing ones must never exceed the opening ones when reading left to right.

Version	Date	Remarks
1	2013-09-16	First draft

Grammars

A grammar consists of a repository of symbols, usually divided into nonterminals and terminals, and rules that exchange sequences of those symbols by sequences of other symbols. Such a process begins with a start symbol (here: with one of the start symbols), performs a couple of applications of the rules, and ends when no nonterminal is left in the sequence; all symbols are then terminals.

The difference between a nonterminal symbol and a terminal symbol is merely that a nonterminal will be replaced by a rule whereas a terminal can stay. Terminals are usually strings or characters. An example:

sentence -> nominalphrase verbphrase
nominalphrase -> article noun
verbphrase -> verb nominalphrase
article -> 'the'
article -> 'a'
noun -> 'dog'
noun -> 'cat'
verb -> 'chases'
verb -> 'teases'

In this example, the nonterminals are sentence, nominalphrase, verbphrase, article, nound, and verb. Terminals are 'the','a', 'dog, 'cat, 'chases', 'teases'. The above (context free) grammar defines a language: the set of all sequences of terminals that can be derived starting with the nonterminal sentence and replacing any nonterminal by the right side of a rule.

Grammatical Categories

A property of a grammatical category is that its members may be exchanged without changing the property of the sentence being grammatical or ungrammatical. The example grammar would allow the dog chases the cat but also the dog chases the dog. Here one of the nouns was replaced by another one.

Such a replacement, when performed randomly, would most probably lead to a failure, an ungrammatical sentence. So, if replacing two grammatical symbols in a grammar allows the derivation of a sound sentence, it's an argument for the validity of the hypothesis that the symbols are in the same grammatical category. If an ungrammatical sentence would be derived, we should downrate the replacement. Not dismiss it completely, because the replacement might be valid but the grammar as a whole could contain errors.

Extraction of grammar rules from sample sentences

A sentence like Susan loves Peter can be interpreted as a grammar rule. Every word in an example sentence could be regarded as a grammatical category. Susan stands for everything that might be placed in this word's stead that results in a valid sentence, e.g. {Susan, Mary, The girl from the senventh floor, A shoe}. A shoe loves Peter makes no sense, but that's not required here. Relevant is only that a produced sentence is grammatically correct.

Introducing square brackets

Consider the sentence the dog chases a cat. If we took the as the grammatical category of an article, the example sentence would only stand for sentences beginning with the, a, or an. How could a sentence like Susan chases Peter be brought forth? One answer is: gather a bigger text corpus. Another idea proposed in Frank, Bod & Christiansen [1] is embracing parts of a sentence with square brackets '['...']'.

In the paper by Frank, Bod, and Christiansen, the case is made for a rather sequential structure of natural language that doesn't rely too much on nesting.

Since the human working memory has a capacity of around seven plus or minus two (some authors suggest only four), it should be clear that humans are very bad "stack machines", very bad at executing recursive functions.

With a system like NatLan such theories can be tested: NatLan semi-automatically tries to find an optimal way of producing a variety of sentences that are then evaluated by a human. How the grammar evolves, whether toward a sequential or hierarchical architecture, will then be subject of observation.

Generating random sentences

The program NatLan invents sentences in order to present them to the user and to receive an evaluation. At first, one of the start symbols is chosen. There can be many of them. Traditional grammars have only one start symbol. That's not a serious restriction. Consider S->S1, S->S2, S->S3. We could have a single start symbol S but equally well many start symbols S1, S2, S3... In a repeated process each of the nonterminals is replaced, either by applying a rule and using the right side of an according rule, or by using a replacement.

Software

NatLan was developed in C#, with MS Visual Studio 2010 Express Edition. The C# programming language is very similar to Java. The source code is available from here.

The program expects a file corpus.txt in its executable directory. In it, sample sentences like

[[the] [little red robot]] [observed] [a] [[strange] [phenomenon]]
[[a] [lucky cat]] [chased] [[the] [[curious] dog]]

have to be provided. Square brackets suggest which parts of a sentence should be produced by a single nonterminal.

Literature

[1] Frank, S.L., Bod, R. & Christiansen, M.H. How hierarchical is language use?
Proceedings of the Royal Society B, 2012
Online: http://rspb.royalsocietypublishing.org/content/early/2012/09/05/rspb.2012.1741.full.pdf (accessed Sept. 16th, 2013)

CorpusTokenizer	Splits a sentence into smaller units like words and punctuation marks.
CorpusParser	Builds a tree structure from the tokens.
EquivalenceOfCategories	Expresses that a pair of Nonterminals is in the same grammatical category. The reason why something like a simple equivalence deserves a separate object instance is that is may be applicable only with a certain likelihood and might be dependent on a certain context. The class EquivalenceOfCategories makes it possible to introduce context dependency.
Nonterminal	Represents a grammatical category. Since we deal with context free grammars, one corresponding production rule is associated with a Nonterminal as well. If alternative productions are desired, create another Nonterminal and express with one EquivalenceOfCategories instance that the categories are equal.
Form1	Simple User interface