Document home

Segmentation

Segmentation

Segmentation (Wikipedia) is the process of dividing text into meaningful units. In most cases this meaningful unit is a sentence. This is why Soluling uses sentences as a default segments. Segmentation uses segmentation rules to specify how text is spitted into segments. SRX (Wikipedia) standard is used to represent segmentation rules. There can be three kinds or rules:

Rule kind Description
Break A rule that specifies a segment boundary. If a rule provided a positive match then all exceptions are also checked. If no exception matches then the break rules set a segment break.
Break no exception Like above but no exception is possible. This rule is much faster than standard break rule that must also check for each exception rules.
Exception An exception rule of standard break rule. Here you can specify exceptions. For example "He is Dr. Muller." is one segment but without exception rules a period rule would break it into two segments: "He is Dr." and "Muller.". This is what we don't want so we specify an exception rule that sets all Dr. to not to be a segment boundary.
Exception rules are most often language specific so each language contains its own set of exception rules.

Soluling has built-in segmentation rules. The rules support for most common languages and most common exception. These rules acts like a base for your rules. You can add new ones on a product or source level.

Soluling uses segmentation in the following situations: translation memory, text localization and subtitle localizaiton.

Translation memory

When you create a translation memory you specify if the memory should use segmentation. Default is to use segmentation. If segmentation is used then each text is first broken into segments and only then the segments are added into translation memory one by one. By using segmentation with translation memory makes you translation memory much more useable. By storing short segments instead of long text translation memory can find much better matches and this way helps you to reuse translations.

Text data

If you localize plain text you have an option to read whole text as a single row or break the text into several segments and adding each segment into its own row. If you text is very show then either way is useable. However if you text is long having all in a single element could make translation difficult. in that case it is better to turn on segmentation.

Subtitles

When localizing subtitle files you have an option to use segmentation.