Regular Expressions and Transducers for Natural Language Processing

Author Martin Keschenau
Supervisor Dr. Jean-Cédric Chappelier
Location Winter semester 2000/2001
Artificial Intelligence Laboratory, Department of Computer Science, EPFL
Part I: Regular Expressions Part II: Transducers

Part I: Regular Expressions


The Artificial Intelligence Laboratory developped SLPToolkit, a toolkit for natural language processing,containing various routines for the treatment of natural language. One essential part of this toolkit is the finite state automaton (FSA) library, FSAs being are a very efficient way of representing and manipulating lexical data.

The automaton library in the toolkit was nearly complete, offering all necessary operations on FSAs one wants to perform when treating natural language. However, what was missing was a compact way of creating automata accepting a given language: specifying them as regular expressions.

Goal: integrate a regular expression module into SLPToolkit allowing to

Project Results

Flex is a powerful tool for constructing lexical analyzers out of regular expression language specifications.
I implemented a way of extracting the C-code-embedded FSA it produces and transforming it into a SLPToolkit automaton, so one has the full Flex regular expression support (character classes, aliases to build complex expression in a readable way, etc.) when creating automata in SLPToolkit. Just specify the desired language in Flex, run it and import its result as an automaton for SLPToolkit with my routine.

On the other hand, I implemented exportation of automata in form of regular expressions with a classical algorithm that is covered in (almost) every book dealing with automata theory. However, the regular expressions it produces are ones that are not quite so easy to read, especially if they have large subexpressions that are repeatedly used... Finding a suitable postreatment for making them "hierarchical", such as in Flex for example, would improve readability greatly and may be an idea for future work.


Part II: Transducers


A fundamental concept in the treatment of natural language are the transducers, which are used for representing language morphology. Since they are a special kind of FSA, it is only natural to implement them when having access to a FSA library.

Goal: extend SLPToolkit by a finite state transducer (FST) module

Project Results

I developped an appropriate FST data structure and started building a FST library based on the existing FSA routines in SLPToolkit. The current situation of the library is:

As we can see, there is still a large amount of work to do for the FST library which can be took up in future projects.

Martin Keschenau
Last modified: Mon Feb 12 15:49:32 MET 2001