Document home

Text

Text File Localization

Text file (Wikipedia) is a file that contains characters. Text file can contain any number of lines where each line contains any number of character. A text file can either cotain plain text or structured text. A plan text file can contains any characters at any order. In most cases such files are documents that contains some text. The structured text file contains one or more records. Each record contains one or more fields. Soluling can localize both types. Simplest way to scan a plain file is to read the compete file into one Soluling row. This works well when the file size is small and the string length of the row remains relatively small. However when the file gets bigger storing everything in a single row get unpractical. In this case it is better to use segmentation. Unlike plain text files defined files are more like databases. They contain records. Each record contains field. Some fields such as string fields should be localized. Some fields such as id fields should not be localized. Soluling uses text definitions to specify the structure of a defined text file. Soluling localization tool and service support text files and data.

Localization Process of Plain Text Files

When we localize a plain text file Soluling reads the whole file into one string and inserts that string into the project. Let's have an example.

Sports

1. Soccer
Soccer is a sport played between two teams of eleven players with a spherical ball. Team contains 10 field players and a goalie. Soccer originates from England.
2. Ice hockey
Ice hockey is a team sport played on ice, in which skaters use sticks to direct a puck into the opposing team's goal. Team contains 5 field players and a goalie. Ice hockey originates from Canada.
3. Basketball
Basketball is a team sport in which two teams of five players try to score points by throwing a ball through the top of a basketball hoop while following a set of rules. Team contains 5 field players. Basketball originates from United States.

When you create a new project that contains the above file leave Scan mode to Scan as single string that is the default value.

Project Wizard

When Soluling creates the project there is only one row that contains the complete text file.

Project

This file is quite small so eveything fits into a single cell. However if the file gets larger this kind of localization gets inpractical. In that case it is better to use segmentation.

Localization Process of Segmented Text Files

When we localize a segmented text file Soluling reads the whole file into one string and then breaks it into segments. Each segment is added into separate rows in the project. Let's have an example.

Sports

1. Soccer
Soccer is a sport played between two teams of eleven players with a spherical ball. Team contains 10 field players and a goalie. Soccer originates from England.
2. Ice hockey
Ice hockey is a team sport played on ice, in which skaters use sticks to direct a puck into the opposing team's goal. Team contains 5 field players and a goalie. Ice hockey originates from Canada.
3. Basketball
Basketball is a team sport in which two teams of five players try to score points by throwing a ball through the top of a basketball hoop while following a set of rules. Team contains 5 field players. Basketball originates from United States.

When you create a new project that contains the above file set Scan mode to Use segmentation.

Project Wizard

When Soluling creates the project there is only one row that contains the complete text file.

Project

Instead of one row of the plain scan mode we now have several rows. At first this looks right but we see that the segmentation is not completely right. There are two errors. First there should be a segment boundary after each new line. Secondly the header numbers (1. 2. and 3.) generate a segment break when in this case there should not be a break. Fortunately Soluling uses the standard segmentation rules (SRX) and we can easily modify the rules. Initially Soluling assign the default rules that catch most breaks and handle most exceptions. In order to edit the rules right click Plain.txt in the project tree and choose Properties.

A segmented text project

The dialog shows the current rules. Soluling contains set of so called default rules. Some of those rules are language independent and some are language specific. Because the original language of Plain.txt is English the rules are combination of the language independent and English rules. To add a rule click Add button. There are two kind of break. The first is normal break. The second is an exception for a normal break. When using it the segmentation engine also checks if there is an active exception for each case. If there is then there is not break. Exception checking takes some time. To avoid this you can select the break to be without exception. In our sample new line break is without exception so lets selects Break no exception. Before break and After break fields contain the regular expression that specify the break or exception. Enter \r\n into Before break field. Finally you can give a description for the rule by entering any text to Description field.

New line rule

To add an exception is similar. The exception must catch <beginning of line>N.<white space> where N is a number. The following image shows the rule settings.

Header number rule

How our rules are complete and the source dialog looks like this.

Segmented project

The two rules we added are custom rules and that's why they are show on the top of the list and are drawn on bold typeface. Finally click OK to close the source dialog. Now Soluling detects that you have changed the properties of the source such way that affects scanning. This is why Soluling shows a message box that recommends rescanning. Click Yes to rescan and finally our file is segmented correctly.

The final segmented project

Fine tuning the segmentation rules to match your file require some work but compared to plain scanning segmented scanning gives you many advantages. The project is easier to edit, and it is more unlikely that the translator will make an error.

Localization Process of Defined Text Files

Some text file do not contain plain text but data records. Such a file contains one or more records. Each record contain one or more fields. Let's have an example.

soccer      Soccer      10  1  England        Soccer is a sport played between two teams of eleven players with a spherical ball.
hockey      Ice hockey  5   1  Canada         Ice hockey is a team sport played on ice, in which skaters use sticks to direct a puck into the opposing team's goal.
basketball  Basketball  5   0  United States  Basketball is a team sport in which two teams of five players try to score points by throwing a ball through the top of a basketball hoop while following a set of rules.

Each line contain one record. Each record contains five field separated by tab character. The first field is an id field. It must not be localized. The second field is the name field. It should be localized. The next two fields are the player count and goalie count fields. These must not be localized. The last two fields are the origin and description fields. They should be localized. Of course we could scan this as plain text or even segmented text. The problem is that each string would contain more that one data and translation would be difficult. It is very likely that translator would break the format when translating. This is why we better use defined scanning. It means that we teach Soluling the format of the file by entering definition rules that define the format. Each rule specifies one field. Together the rules define the format of record and using that the parser can correctly parse the file. Like segmentation rules the definition rules also use regular expressions.

When you create a new project that contains the above file set Scan mode to Use definition. Soluling reads your file to analyze its format. Finally Soluling populates the definition rules based on the result of the file analyze.

A defined text project

A definition is a set of field rules. Each rule specifies one field in the record. A rule can have a before and/or after expression. Expression specify the character(s) that delimit the field values. Our sample record used tab character as a field delimiter. The first field contains the context value. We dod not need a before or after expression. The second field is the name of the sport. Here we need an before expression: \t. This regular expression specifies a table character. The rest of the fields also need a before expression. The last field is ended with a new line characters. This is why we add an after expression \r\n.

We could also use only after expressions instead of before expressions. The following definition handles the same record.

The defined text project

The default definition that Soluling populates localizes all columns but first. Our sample file contains also integer numbers (amount of field players and goalie). We do not have to locale them. Select the third column, right click and choose Ignore. Repeat the same for the fourth column.

The final defined text project

Now our definition is complete. When Soluling creates the project it correctly scans each string fields in our file.

Samples

<data-dir>\Samples\Text contains following text file sample directories:

Directory Description
Simple A simple defined text file. Study this first.
Comment A defined text file that contains comment fields.
Escape A defined text file that contains escape characters.
Japanese A defined text file that contains Shift JIS encoded Japanese text.
Plain A sample text file. The directory contains two project files:
Plain.ntp scans the file as a plain text
Segmented.ntp scans the file using segmentation.
Sport A defined text file that contains sport data.

Configuring Text File Localization

You can configure how to localize your text file by selecting the item in the project tree, right-clicking, and choosing the Options menu. A source dialog appears that lets you edit the options. This source uses the following option sheets.

Settings

Read more about other data files such as XML, JSON, YAML, INI, Excel, SVG, TMX, XLIFF, text and binary files.