Keywords: textual information retrieval, textual data visualization, distributional semantics, textual data analysis, curvilinear component analysis, probabilistic natural language processing, knowledge discovery, text mining Contact Person: Martin Rajman Phone: (+41 21) 693-5277 E-mail: Martin.Rajman@epfl.ch Partners:
The main objective of this project is the study of efficient techniques for automatic user-sensitive structuring of large collections of textual data.
The general framework of the project is the extension of an existing distributional semantics based textual information retrieval system, the D-SIR system, jointly developed at the Ecole Nationale Supérieure des Télécommunications (ENST-Paris) and the Ecole Polytechnique Fédérale (EPF-Lausanne). The project will especially concentrate on the following issues:
- structuring of a textual information base according to implicit neighborhood properties (e.g. proximities) automatically extracted from the information base itself;
- dynamic adaptation of the produced structures to an implicit model of the user automatically derived from the analysis of his interactions with the information system.
The long term objective of the project is to progressively integrate enhanced structuring and visualization techniques designed in the domain of Information Retrieval and Textual Data Analysis into the framework of Text Mining and Knowledge Discovery in textual databases. Information structuring is essential because it constitutes a necessary basis for efficient visualization of large amounts of textual data, which would not be otherwise practically manageable. Efficient visualization also strongly conditions the navigation possibilities within the information base that are offered to the user.
Moreover, as structuring is a central step towards information clustering and filtering, information structuring is also necessary to provide the user with efficient access possibilities to information and avoids him to be overwhelmed by huge amounts of retrieved data.
The proposed approach will focus on automatic identification of neighborhood properties in the high dimensional vector space that represents the information base in the distributional semantics model used in the system D-SIR.
Techniques based on curvilinear analysis will be considered for the extraction of the intrinsic topological information out of the textual base. Global topological properties will be used for the visualization of the base as a whole, whereas local topological properties will be considered for the visualization of portions of the base selected after retrieval or during navigation. Adaptation to the user will be introduced through the integration, in the global and local topology extraction, of knowledge about interactions between the user and the information system. For example, knowledge such as lists of texts previously selected or a priori structured portions of the base will be taken into account.
User-sensitivity brings an important flexibility into the information structuring process which can then integrate, not only static structures a priori identified in the data, but also dynamic knowledge representing user specificities.
The proposed approach will rely on the design of techniques allowing automatic translation of the knowledge about user/system interactions into a weighting scheme of the similarity measure used in the high dimensional vector space representing the information base. In addition to the aspects mentioned above, Natural Language Processing techniques will also be used to pre-process the textual information base (lemmatization, identification of syntactic structures, ...) in order to improve the quality of the basic textual units manipulated by the information system.
This project is funded by grant FNRS #21-50766.97.