'Re: Natural language processing tech for the desktop!'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Natural language processing tech for the desktop!
From:       Mario Fux <foxman () lugo ! ch>
Date:       2008-10-22 19:52:39
Message-ID: 200810222152.39374.foxman () lugo ! ch
[Download RAW message or body]

Am Dienstag, 21. Oktober 2008 19:01 schrieb Jordi Polo:

Good morning Jordi

> I __might__ be able to choose a project to work one year (I mean one year,
> every day, several hours/day). The project must be something related to NLP
> (Natural Language Processing). I love KDE and did some kde programming
> before (still some todo item in the kde 4.2 feature list...cough cough ).
> I am pretty sure that something can take advantage of language processing
> on the whole KDE desktop. I just need to convince my teacher that working
> on that can be called research.
> Any idea will be welcome, crazy ideas very wellcome (one year is a lot of
> time)...
>
> So far, asking in the IRC channel, I got a couple of them
>
> - Krunner as a natural language CLI. I already thought about this time ago,
> but for most practical purposes a simple verb + object system would be
> enough IMHO and that it not a one year project for sure.
> - A semantic map: how close is semantically the information, for instance
> songs of the same artist are closer than songs of the same year. I think
> this is cool and will go into Nepomuk sometime if it isn't already in, but
> I fail to see the language processing (except for emails, text, pdf,
> webpages and the like).
>
> - A conversational agent. A small guy in the corner of you plasma desktop
> that gives conversation and recommends items related to what you are doing
> or tips, etc. Even maybe learn what you like from what you do and then
> suggest music, blogs, or whatever other information.
>
> Any opinion about these ideas is very much welcomed.

That sounds very interesting. I myself study at the university of Zurich 
computer sciences (main focus AI, ALife, Semantic Web) and Computional 
Linguistics (BTW my main subject is education but that's not interesting 
here) and have several ideas for NLP stuff in KDE and Co.

See as well my summer of code proposal 2008, a NL file query system.

Here some of my ideas:
- a general speech recognition infrastructure for KDE (or freedesktop.org?)
- a general speech synthesis infrastructure (KTTSD)
- a personification infrastructure which learns about the preferences of the 
user and it's abilities (e.g. when the user only surfes on english webpages 
he or she seems to speak english or at least can read it)
- natural language file or document queries
- an interaction agent based on AIML or eliza

These are only some of them
Greets
Mario

["GSoC2008-Proposal-NL-Queries.txt" (text/plain)]

Title:
------
Natural language (file) queries

Project:
--------
KDE

Abstract:
---------
As our desktop aggregates more and more files (documents, pictures, music, movies, \
etc.) and it's getting more difficult to find these files we need something to master \
this upcoming chaos. KDE provides for these purposes Strigi (an indexer) and NEPOMUK \
(a social semantic desktop framework). But wouldn't it be simpler to "ask" questions \
to the computer, to these tools? I think so. But this is just the user interface \
part.

The technical part is much more difficult. As computational linguistics shows in its \
decade old history there are a lot of easier tasks than understanding what a natural \
language sentence or quesion respectively realy means. You need to understand the \
word borders analyse it on a syntactical and semantic front.

But I think (and want to proof ;-) that if we restrict to a certain domain we can get \
good enough results with just pattern or word matching and a sophisticated \
statistical model. The domain is set: (file) queries. We don't need to understand the \
world just stuff like "contains", "is older/younger", "last Monday", "rated as good" \
etc.

So my proposal for a GSoC project would be to develop and test a natural language \
query system for KDE. But as such systems depend largely on user feedback I want to \
begin with a user questionnaire which results in approx. 100 possible "nl queries". \
Based on these results I want to develop a statistical word matching model which I'll \
test at the end of the period with another group of people to see if the model \
results in good enough search results for everyday usage.

The work will consist of the development of a tokenizer (if there's none or no good \
enough one in Qt) and a parser or analyser. The analyser will have two parts. First \
the word matching part and second the part which will check how many words are being \
matched. Depending on this values the system begins to search, reformulates the query \
and or ask for another query.

Of course the whole framework will be modularized and pluggable so that there can \
come tokenizers and analyser for other languages than English and German which will \
be the two first supported once.

Idea/proposal:
--------------

Additional information to the model:
------------------------------------

After the evaluation or development of a tokenizer and a parser or analyser I'll \
concentrate on the word matching model. This will be on the following assumption: if \
the analyser matches more than a certain amount of the words it searches for the \
files, if the matching rate is between a certain level it reformulates the question \
and searches anyway, if the level is deeper, it reformulates and waits for user \
acknowledgment and if the matching level is to deep it asks for a reformulation. At a \
later date it should be possible that the analyser learns but this is not planned for \
this SoC.

The word matching will have several configurable layer. Beginning with the \
recognition of the punctuation and words like "and" and "or" it goes on to \
quantifiers like "one", "all" or "7". Afters the first two layers the analyser goes \
on to the verbs and nouns which can be something like: "text contains Psychology".

I've been already in contact with Sebastian Trueg of KDE and NEPOMUK and he agreed to \
be my mentor and is interested in the idea (hi Sebastian ;-). About the work on my \
own and with a mentor: I'm familiar working for myself as most of my jobs (and the \
study) of this way. I'll plan to have regular chat meetings (e.g. IRC) with my \
mentor, weekly status reports and perhaps even a meeting at akademy.

Schedule:
---------
April-May: As soon as I know that my proposal is accepted I'll begin to develop the \
questionnaire and afterwards the survey. Goal is to have the results at the beginning \
of the SoC period.  Another goal till the beginning of the SoC period is to have a \
working KDE 4 development environment and finish the reading of necessary docs as \
rdf, sources, xesam docu, etc.  Available time: as in the end of April and May I've \
still lectures I'll only work in my spare time on the project (see below for a \
definition of my "spare time" ;-).

June:  Development of the model based on the questionnary results. Afterwards the \
                implementation of the parts (tokenizer, parser, analyser, etc.) \
                begins.
       End of June: Having a working proof of concept query possibility in Dolphin or \
Nepomuks own search client.  Available time: As I've tests and the beginning of a \
practical training in this month I'll plan to spend most of my spare time and 10-20 \
additional hours a week for the project.

July:  Implementation of a modularized natural language query framework for English \
                and German.
       Available time: After the first two weeks of July (last weeks of the practical \
training) I'll spend my whole time on the project.

August: Evaluation of the new framework with "normal" users (no power users or (KDE) \
                developers).
        Available time: Whole August is planned for the project (with a possible \
Akademy visit ;-)

Disclaimer: Even if the European Football Cup is located in Switzerland and Austria \
(I'm Swiss) in Junue this year this won't be a problem as I'm no football fan ;-).

Definition of my spare time: 10-20 hours a week

Vision:
-------

After the SoC I'd like to further contribute to KDE particularly in the area of \
natural language processing. Possible extensions to the system would be a analyser \
model configuration possibility, syntactical and semantical processing and analysing.
Further ideas are a linguistic annotation tool with wiktionnary support, eliza \
integration and other natural language frameworks as speech recognition or stuff with \
the nltk.

Biography:
----------

I'm a 29 years old Swiss guy grown up in the Alpes and studying in Zurich.
The study at the Univerisity of Zurich is my second schooling with the main subject \
in education and minors in computer sciences (articial life and intelligence) and \
computational linguistics. I'm working and advocating Free Software since almost 10 \
years in different areas whereas Free Software and education is my expert topic (see \
gg: +Mario +Fux ;-). I know KDE since the 1.x times and read almost 20 KDE mailing \
lists (technical once as well of course) since month. My first real life contact with \
the KDE croud was at last years Akademy in Glasgow where I had a presentation about a \
small primary school which a migrated to the KDE Desktop. My programming skills are \
broad but now yet very deep (worked with Basic, Pascal, Bash, C, C++, Java, etc.). At \
the moment I relearn C++ and Qt and will use it as well at a university project. I \
see this proposal and project as a serious beginning with the development of KDE \
software.

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

[prev in list] [next in list] [prev in thread] [next in thread]