Regular Expressions and Transducers for Natural Language Processing
The Artificial Intelligence Laboratory developped SLPToolkit, a toolkit for natural language processing,containing various routines for the treatment of natural language. One essential part of this toolkit is the finite state automaton (FSA) library, FSAs being are a very efficient way of representing and manipulating lexical data.
The automaton library in the toolkit was nearly complete, offering all necessary operations on FSAs one wants to perform when treating natural language. However, what was missing was a compact way of creating automata accepting a given language: specifying them as regular expressions.
Goal: integrate a regular expression module into SLPToolkit allowing to
Flex is a powerful tool for constructing lexical analyzers out of regular
expression language specifications.
I implemented a way of extracting the C-code-embedded FSA it produces and transforming it into a SLPToolkit automaton, so one has the full Flex regular expression support (character classes, aliases to build complex expression in a readable way, etc.) when creating automata in SLPToolkit. Just specify the desired language in Flex, run it and import its result as an automaton for SLPToolkit with my routine.
On the other hand, I implemented exportation of automata in form of regular expressions with a classical algorithm that is covered in (almost) every book dealing with automata theory. However, the regular expressions it produces are ones that are not quite so easy to read, especially if they have large subexpressions that are repeatedly used... Finding a suitable postreatment for making them "hierarchical", such as in Flex for example, would improve readability greatly and may be an idea for future work.
A fundamental concept in the treatment of natural language are the transducers, which are used for representing language morphology. Since they are a special kind of FSA, it is only natural to implement them when having access to a FSA library.
Goal: extend SLPToolkit by a finite state transducer (FST) module
I developped an appropriate FST data structure and started building a FST library based on the existing FSA routines in SLPToolkit. The current situation of the library is: