Culter is a library to implement segmentation algorithms:

  1. A very simple segmenter which supports usual end of phrases (dot, ! or ?) but no rules
  2. An implementation of the SRX format
  3. The Culter Segmentation Compatible Format
  4. The Culter Segmentation Extended Format

The package contains the libraries in Ruby language, which can be used for your own project in the conditions of the EUPL. It also includes some small programs:

  • culter : a small script which acts as a filter between standard input and output, considering that each new line is a paragraph;
    this script has a verbose mode which tells you which rule is applied in each position, which can be used to understand why a SRX or CS[CE] file gives unexpected results;
  • culter-conv : a script to make conversions between SRX and CSCX. During conversion you can also
    • Uncascade a file, i.e. transform all language maps by copying entries which would have come later in the cascade. Useful for compatibility with CAT tools which do not support cascade, or simply to check that the rules are applied in the order you expected!
    • When converting from CS formats to SRX, you can select between
      • Human mode: contains multiple small rules, because it is easier to read and modify by an human
      • Machine mode: contains all items in a single rule, which is harder to read but most SRX segmenters will treat this faster
    • when converting to CSCX, if you give a model with some rule templates, check existing big set of rules and try to convert them to application of a rule template
  • ensis : a GUI which enables to test segmentation rules.

About the license: The Ruby code is under EUPL 1.1, like most of our programs. The schemas are under license Creative Commons Attibution-NoDerivatives : feel free to make your own implementation of our formats, but if you plan to make improvements in the schemas, please discuss with us before.

Source code: 

Add new comment