From document to entity retrieval : improving precision and performance of focused text search


Rode, Henning (2008) From document to entity retrieval : improving precision and performance of focused text search. thesis.

open access
Abstract:Text retrieval is an active area of research since decades. Several issues have
been studied over the entire period, like the development of statistical models
for the estimation of relevance, or the challenge to keep retrieval tasks efficient with ever growing text collections. Especially in the last decade, we have also seen a diversification of retrieval tasks. Passage or XML retrieval systems allow a more focused search. Question answering or expert search systems
do not even return a ranked list of text units, but for instance persons with expertise on a given topic. The sketched situation forms the starting point of this thesis, which presents a number of task-specific search solutions and tries to set them into more generic frameworks. In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.
In the first case, we show how different types of context information can
be incorporated in the retrieval of documents. When users are searching for
information, the search task is typically part of a wider working process. This
search context, however, is often not reflected by the few search keywords
stated to the retrieval system, though it can contain valuable information for
query refinement. We address with this work two research questions related
to the aim of developing context-aware retrieval systems. First, we show
how already available information about the user’s context can be employed
effectively to gain highly precise search results. Second, we investigate how
such meta-data about the search context can be gathered. The proposed
“query profiles” have a central role in the query refinement process. They
automatically detect necessary context information and help the user to explicitly
express context-dependent search constraints. The effectiveness of
the approach is tested with retrieval experiments on newspaper data.
When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is attractive to
try to develop retrieval systems that make use of the additional structure
information. Structured retrieval first asks for the design of a suitable language
that enables the user to express queries on content and structure. We
investigate here existing query languages, whether and how they support
the basic needs of structured querying. However, our main focus lies on the
efficiency of structured retrieval systems. Conventional inverted indices for
document retrieval systems are not suitable for maintaining structure indices.
We identify base operations involved in the execution of structured queries
and show how they can be supported by new indices and algorithms on a
database system. Efficient query processing has to be concerned with the
optimization of query plans as well. We investigate low-level query plans of
physical database operators for the execution of simple query patterns. Furthermore,
It is demonstrated how complex queries benefit from higher level
query optimization.
New search tasks and interfaces for the presentation of search results,
like faceted search applications, question answering, expert search, and automatic
timeline construction, come with the need to rank entities instead of
documents. By entities we mean unique (named) existences, such as persons,
organizations or dates. Modern language processing tools are able to automatically
detect and categorize named entities in large text collections. In
order to estimate their relevance to a given search topic, we develop retrieval
models for entities which are based on the relevance of texts that mention the
entity. A graph-based relevance propagation framework is introduced for this
purpose that enables to derive the relevance of entities. Several options for
the modeling of entity containment graphs and different relevance propagation
approaches are tested, demonstrating the usefulness of the graph-based
ranking framework.
Item Type:Thesis
Electrical Engineering, Mathematics and Computer Science (EEMCS)
Research Group:
Link to this item:
Official URL:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page

Metis ID: 250850