dbcreator command line

DESCRIPTION

This perl script creates database or tables for Cyclotis database. This is not a cgi/module : it must be run in the machine hosting the db by a user (or a script) with the correct rights.

USAGE

The general format for the command line is perl dbcreator.pl [command] [database-name] [other parameters]

MODES

First, you must decide in which mode the database will work.

The mode decides which field Postgres may add to every line in order to make the distinction between tables.

To do this, the best solution is to define CYCLOTIS_MODE environment variable. You can also give the mode as first parameter of dbcreator.pl, before command name. However, do not forget that this must be decided before you create the first table, and using dbcreator.pl on an already created database with a different mode may cause errors.

Existing modes:

by_void

Tables contain no discriminant at all, so takes less space on disk. This mode is ideal for direct access with your favorite plugin (since in this mode each connected client always access to the same table) unless you make use of inheritance and want to know, when you retreive a result from a parent table, from which child table it comes from.

by_id::noRules

Tables contain an integer as discriminant. All lines contain this value, so it takes more space on disk.

This mode is ideal if you make direct connections but with use of inheritance : when you do a query on a parent table, data coming from child tables are identified by their ID and joining with meta_info you can retreive all info such as child table name.

by_id::withRules

Tables contain an integer as discriminant. All lines contain this value, so it takes more space on disk.

This mode is ideal if you use the REST server, with a lot of tables: thanks to the id, the server can minimize the number of different SQL queries it has in RAM. Inheritance with personal tables is supported but implies lot of work at database side, so it is not ideal. Note that you must use Postgresql 9.2 or later to use this.

by_path:noLangs

Tables contain a string as discriminant. All lines contain this value, so it takes more space on disk.

This mode is ideal if you use the REST server, with a frequent use of personal table inheritance: when querying a parent table, the path is used to make the distinction between entries coming from different children, even if you call it from the REST service.

Path contains only qualified table names.

by_path:withLangs

Tables contain a string as discriminant. All lines contain this value, so it takes more space on disk.

Compared to "by_path::noLangs", it adds also languages in the path. Useful only if you use inheritance by language (for example table for en-fr inherits from en-all)

by_code

Tables contain a 64-bit integer as discriminant. You can see it as an intermediate between by_id::withRules (which takes less space but has poor support for inheritance of personal tables) and by_path (which supports inheritance of personal tables but takes lot of space in the disk)

Note: by_code is an abstract mode, you have to write a subclass depending on the tree you want to create. The two following modes are samples of by_code subclasses.

by_code::theme

In this mode, 3 themes are defined : agriculture, transport and technology. Each memory inherits from one of these themes. Each theme can contain up to 1 billion memories.

You can change the list of themes by looking in the code and changing @THEMES array. But do it before any table creation.

by_code::lang

In this mode, the tree is built upon source and target language : any new table inherits from MEM_%SRC_%TRA where %SRC and %TRA are replaced by source and target language.

Limited to 5 languages and 1 billion tables per language pair, but you can easily change this inside the code (remember that this mode is only an example).

Note : this mode will die if you did not specify source or target language.

Database name

By default the name you type is considered as the catalog name, and we suppose that server is on localhost, port 5432, connecting as current user without password.

If it is no true, you can use the following format:

  • a host, separated by '/' (for example, h/base means 'dbname = base', host = 'h')

  • a port, separated by ':'

    For example:

    ':5433/base' means 'dbname = 'base', port=5433;

    'h:5433/base' means 'dbname = base', host = 'h', port=5433

  • a user, separated by @

    For example

    me@base means 'user = me, dbname = base'

    me@h/base means 'user = me, dbname = base, host = h'

  • user password, separated from user by :

    me:123@base means 'user = me, password = 123, dbname = base'

COMMANDS

perl dbcreator.pl create-db [database name]

Creates an empty database with only procedures loaded

perl dbcreator.pl create-schema [database] [schema-name]

Creates a Postgresql schema (i.e. namespace for tables)

Since any DBI pool (dancer's, Apache's, etc.) use one connection per database, even in the same server, it is more efficient to have only one database with all tables accessed by the same server via HTTP (the situation is different if you use direct Postgresql access).

On the other hand, all are accessed by the same Postgresql user, so it makes impossible to define privileges so they must be defined at HTTP level.

perl dbcreator.pl create-table [database] [table] [parents]

Creates the table, and its parents if not yet created.

[database] uses the same format as previous command.

Values for parent tables (you can inherit from two tables, separated by ','):

MEM

(standard, without context or modification data) : by default.

MEMUP

(memory with modification author/date) : may be associated to a rule to keep one version of a segment

MEMX

(memory with properties and note): stores same content as a TMX

MEMID

(memory with context): stores an information to distinguish segments Context is not an identifier: the identifier is (src,context)

GLOS

(with context and note, but no author/date): contains a glossary, for terminology

or

any already created table

In case you use MEMX or MEMID, you have also the possibility to add one of the two above parameters :

props=(json|hstore|text)

Use another type for properties. Possible values are:

text (default)

Properties are stored in a TEXT column. Format is similar to URL. This format preserves order and supports multiple properties.

hstore

Uses Postgresql's type HSTORE. Loads the extension in the database, it not yet existing. This binary format has better performances than pure text, but does not preserve order, and does support only one value per property.

json

Uses Postgresql's tpe JSON. Requires Postgresql 9.2 or more. This format preserves order and supports multiple properties (this is non-strict JSON value). It is a text format, but more readable than the first type.

tool_info=(json|hstore|text|null)

Adds a column which will contain data specific to CAT clients. Format is the same as for properties, except that default value is null, meaning that we do not want to create this column at all.

context=(text|char\d|varchar\d|int)

Type for the context field.

For context, you can use "char" or "varchar", with a number of characters. Be sure your client uses the correct size.

You can also use an int, in which case the context will be reduced to its Java hash code.

Structured types (json or hstore) are not usable because Postgresql does not implement "=" operator for them.

In this case, if the standard tables already exist, and if almost one column type does not match parameters, new tables with names suffixed by the first letter of type names will be created.

You can also specify scrLang and traLang which will be stored in meta_info table.

perl dbcreator.pl create-rule [database] [table] [keyfield] [action]

Creates a rule associated to the given table, saying in which situation we consider a segment as duplicate which should be treated as an update (instead of an insertion).

[database] uses the same format as previous command. Table name can contain a schema prefix, else we use 'public'.

Possible keyfields:

src

keeps only one version for each source text (i.e. if anybody inserts a segment with same source, it will replace the previous one)

author

keeps only one version for each source+author pair (i.e. if the same author inserts a segment with same source, it will replace the previous one)

context

keeps only one version for each source+context pair (i.e. if anybody inserts a segment with same source and same context, it will replace the previous one)

Possible actions:

replace-only

A new line with same value for key fields will replace the old one. Deletions are physically applied.

replace-null

A new line with same value for key fields will replace the old one. Deletions are equivalent to setting translation to null.

This option is useful for applications which use 'all' query to build a cache (like OmegaT in Project mode) as it enables to retreive deleted segments as well.

version

Keep both versions of the segment (old and new one), with changedate as a discriminant. Deletions are equivalent to affect a changedate without creating a new version.

Next parameters indicate what to do with properties or notes in case the rule implies to update a line instead of insertion.

Default is to replace the properties/notes as well, so what the user will read is the very last inserted properties and notes. Alternatives include:

concat

Only if the column is of type TEXT or VARCHAR, the new value is the concatenation of old and new value.

merge

Only if the value is of type HSTORE, the new value is a key-per-key replacement. Values which are not in the new entry are kept as is.

keep

Do not replace the value, keep the old one.

perl dbcreator.pl create-function [database] [function-name]

Creates one of the standard Cyclotis SQL functions. As they are not necessary all the time they are not created by default.

Existing functions:

tokenize

cut string by words, and transform them to stems (without grammatical suffixes)

score

alternative score based on levenshtein distance

perl dbcreator.pl create-index [database] [table] [idxname] [lang]

Creates an index for the given table

Loads GIST extension if necessary

Values for idxname:

fuzzy
trigram GIST index in the source
concSrc
stemming index of the source
concTra
stemming index of the translation

operations indicate if you want to index not the field itself but a calculated value. Possibilities include (concatenable using ',')

untag
Removes tags in text; next parameter can indicate which regular expression you consider as a tag; defaults to <.+?>
tokenize
Postgresql will replace string by succession of stems (terms without grammatical suffixes)

lang parameter used only for concordance indexes: it contains language name in english; by default the value is 'simple'.

perl dbcreator.pl create-(ispell-)?dict [database] [langName] [dict-file]? [affix-file]? [stop-words]?

Loads an ispell dictionnary

The files must already be in tsearch_data directory from Postgres.

perl dbcreator.pl create-text-search-config [database] [langName] [origin] [token=dictionnaries]+

Create a text search configuration, usable in concordance queries

perl dbcreator.pl exec-script [database] [script file]

The file is a succession of commands, similar to the previous ones, except that we do not repeat database name. create-db is not scriptable.

Sample script:

    create-schema mySchema
    create-table mySchema.table1 memx
    create-rule mySchema.table1 author replace-only
    create-index mySchema.table1 date
    create-index mySchema.table1 concTra fr
    create-table mySchema.table2 memup