[an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] (none) [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] (none) [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive]
 
[an error occurred while processing this directive] [an error occurred while processing this directive]
Skåne Sjælland Linux User Group - http://www.sslug.dk Home   Subscribe   Mail Archive   Forum   Calendar   Search
MhonArc Date: [Date Prev] [Date Index] [Date Next]   Thread: [Date Prev] [Thread Index] [Date Next]   MhonArc
 

Ett nytt sätt att hantera ordlistorna



Jag arbetar på ett nytt system för att lagra och manipulera ordlistor.
Arbetet går dock inte särskilt fort, så jag är tacksam ifall någon
skulle vilja hjälpa mig. En beskrivning av systemet är bifogad nedan.
Den kod jag skrivit hittills (Perl+MySQL) finns att hämta på adressen
http://217.215.183.103/XDtools-0.1.tar.gz

--
Göran


Document: eXtensible Dictionary (XD) Specification
Version: 0 -- USE ONLY TO CREATE SMALL PROTOTYPES
Status: Draft -- UNFINISHED, NOT READY FOR USE
Date: July 21, 2002
Author: Göran Andersson <sslug@sslug>


About This Version
==================

The purpose of this version of the specification (XDSv0) is to
provide a prototype system which can be used in small scale
projects until the first official version (XDSv1) is released.
The intention is that it shall be possible to automatically
convert databases created according to XDSv0 into databases
that complies with XDSv1; this, however, cannot be guaranteed.

This version of the specification describes only the core features
and it is deliberately very unspecific.  Thus it should be more
flexible so that it can easily be reworked into something that is
ready for production work, i.e. to build real, large databases.


What Is XD?
===========

The XD specification describes how to store and manipulate
natural language data.

An "XD database" is a collection of data, describing a specific
natural language, ordered according to the XD specification.

An "XD implementation" is a set of programs/tools that complies
with the specification.  An implementation is used to manipulate
XD databases.  There may exist several implementations; databases
can be converted automatically from one implementation to
another without any (significant) loss of information.

XD databases will be usable by all sorts of programs that handle
text in natural languages: spell checkers, grammar checkers,
translation programs etc.  Of course they may also be used as
ordinary dictionaries.

Features
========

o Independent of any specific language.

o Extensible:

    Whenever there is a gap between the expressive power of
    the database and the information conveyed by a particular
    piece of data, the gap is easily detected.
    Thus a database stored according to the XD specification
    will be easily extensible. Data can be added in small
    pieces that are neither very precise or complete.  Pieces
    of data that are not fully precise or complete can easily
    be refined.  The expressive power of the database can be
    increased at any time without invalidating the current
    data.

o Forked databases can easily be merged.

o XD databases require an awful lot of storage space.
  However, the database needs not be installed locally;
  it may be used over the network.

XD Databases
============

An XD database is separated into two parts: on one hand,
the grammar data; on the other hand, the dictionary data.
The intention is to keep the grammar more stable.
Database forks that use the same grammar will be
possible to merge automatically with few conflicts; forks that
use different grammar can be merged after first updating them
to use the same grammar.


The Grammar
===========

The grammar consists of classification trees and inflectional paradigms. 

Classification Trees
--------------------
A "classification tree" which is used to classify words.
Each tree has a name, and it may (and should!)
have an accompanying description.
There must exist at least one classification tree,
named "grammar", which is used to classify words according to
part of speech (e.g. noun or adjective).

Each classification tree contains a hierarcically ordered set of
"nodes".  Each node may be thought of as a class to which a word
may or may not belong.  Each node either has a "parent node"
which is contained in the same tree, or it will be called a
root node.  There may exist several root Nodes in each tree.
A node is not allowed to be ancestor to itself.
Among all the children of the same parent node (and also among all
root nodes), there shall be defined an order to use e.g. when
presenting the list of children to a "user" of the database.
Each node may (and should) have an accompanying description.


For example, an english XD database might contain a classification
tree named "vintage" containing nodes named "current", "contemporary",
"modern", "old-fashion", "archaic", and "historic". It may be ordered
like this:

       current         ("currently correct word")
         contemporary  ("ordinary, contemporary word")
	   modern      ("recently invented word")
	 old-fashion   ("old-fashioned and no longer frequently used")
	 archaic       ("correct but archaic and nowadays seldomly used")
       historic        ("word that has been, but is no longer, correct")

I.e. "current" is the first root node, "historic" the second; "contemporary",
"old-fashion", and "archaic" are the first, second and third children
of "current"; finally, "modern" is the first and only child of the
"contemporary" node.
A certain word might at first be classified as "current". Then it may
decided as not being "archaic" -- it could still be "contemporary" or
"old-fashion".  Finally if might be classified as "contemporary" but
not "modern". Now assume  a new child node of "contemporary" is
created.  The word is still "contemporary" but not "modern"; however
it is yet undecided whether the word belongs to the newly created
class.

Inflectional Paradigms
----------------------
The grammar may contain "inflectional classes". Each inflectional
class has a name (e.g. "genitive singular" or "plural") and,
optionally, a description.
For each node in the grammar classification tree, there may exist
an ordered list of references to different inflectional classes;
such a reference means that a word belonging to the node will
(normally) have a wordform (an inflection) in that inflectional
class.

The grammar may contain "inflection templates" which can be used
to generate wordforms from a root word. E.g. in Swedish, we
might have a template named "(-a)an,or" which may be used to
generate the wordforms "visa", "visas", "visan", "visans", "visor",
"visors", "visorna", and "visornas" from the noun "visa".



Dictionary Data
===============

The dictionary data provides (i) words, and (ii) classification
of words according to classification trees in the grammar.

Words
-----
To be somewhat precise, we will speak of "HeadWords",
"AbstractWords", "Wordforms", and "DefinedWords".

A HeadWord is the main form of the word as would be listed in a
dictionary, not an inflection.

An AbstractWord is a variant of a HeadWord.  It contains a
reference to a node in the grammar classification tree.
E.g. the HeadWord "long" may be common to four AbstractWords:
a noun, a verb, an adjective, and an adverb.
Each set of AbstractWords that have a common HeadWord shall
be ordered in the XD database.  In fact, this is exactly what
ordinary dictionaries do:
	 long (I) noun: ... (II) verb: ... 

An AbstractWord may have a Wordform in each of the inflectional
classes that are applicable to the grammar node it belongs to.
It shall be possible to mark a specific Wordform as illegal
(i.e. that a particular AbstractWord does not have a Wordform
in a particular inflectional class.)

Two different AbstractWords with a common HeadWord may
belong to the same grammar node if they have
different Wordforms or some other syntactic difference.

An AbstractWord is abstract in that it has only syntax, not
semantics.

A DefinedWord is an AbstractWord and a definition.
E.g. long (II): "To yearn for". Each AbstractWord may
have several definitions.

Classification
--------------

A classification of a word is a pair of references; one reference
to a word in the dictionary data, and one reference to a node
in some classification tree.

The grammar classification tree shall be used to classify
AbstractWords. Other classification trees will in most cases
classify DefinedWords.

Implementations shall (at least try to) keep track not only of
which class a word belongs to, but also of which child classes
(including grandchildren) it is known not to belong to.


 
Home   Subscribe   Mail Archive   Index   Calendar   Search

 
 
Questions about the web-pages to <www_admin>. Last modified 2005-08-10, 20:53 CEST [an error occurred while processing this directive]
This page is maintained by [an error occurred while processing this directive]MHonArc [an error occurred while processing this directive] # [an error occurred while processing this directive] *