Soluling home   Document home

Segmentation

Segmentation

Segmentation (Wikipedia) is the process of dividing the text into meaningful units. In most cases, this meaningful unit is a sentence. This is why Soluling can break sentences into segments. Segmentation uses segmentation rules to specify how text is spitted into segments. SRX (Wikipedia) standard is used to represent segmentation rules. There can be three kinds of rules:

Rule kind Description
Break A rule that specifies a segment boundary. If a rule provided a positive match, then all exceptions are also checked. If no exception matches, then the break rules set a segment break.
Break no exception Like above, but no exception is possible. This rule is much faster than the standard break rule that must also check for each exception rule.
Exception An exception rule of standard break rule. Here you can specify exceptions. For example, "He is Dr. Muller." is one segment, but without exception rules, a period rule would break it into two segments: "He is Dr." and "Muller.". This is what we don't want, so we specify an exception rule that sets all Dr. to not to be a segment boundary.
Exception rules are most often language-specific, so each language contains its own set of exception rules.

Soluling has built-in segmentation rules. The rules support for most common languages and most common exceptions. These rules act like a base for your rules. You can add new ones on a product or source level.

Soluling uses segmentation in the following situations: translation memory, text localization, and subtitle localization.

Translation memory

When you create a translation memory, you specify if the memory should use segmentation. The default is to use segmentation. If segmentation is used, then each text is first broken into segments, and only then the segments are added into translation memory one by one. By using segmentation with translation, memory makes your translation memory much more useable. By storing short segments instead of long text, translation memory can find much better matches and this way helps you to reuse translations.

Text data

If you localize plain text, you have an option to read the whole text as a single row or break the text into several segments and adding each segment into its row. If your text is very short, either way is useable. However, if your text is long, having it all in a single element could make translation difficult. In that case, it is better to turn on segmentation.

Subtitles

When localizing subtitle files, you have an option to use segmentation.