Under the hood: TM technology ran locally by CATs, anything new?
Autor vlákna: Philippe Locquet
Philippe Locquet
Philippe Locquet  Identity Verified
Portugalsko
Local time: 04:23
angličtina -> francouzština
+ ...
Sep 22, 2021

Hi all,
I was chatting with a colleague the other day, and the question of Translation Memory technology improvements came up.
Although TM had been considered earlier, it started to when four commercial TM systems appeared on the market in the early 1990s: The TranslationManager from IBM, the Transit system from Star, the Eurolang Optimizer and the Translator’s Workbench from Trados (according to a paper I dug up).
Since then, improvements have been made to the way a TM is se
... See more
Hi all,
I was chatting with a colleague the other day, and the question of Translation Memory technology improvements came up.
Although TM had been considered earlier, it started to when four commercial TM systems appeared on the market in the early 1990s: The TranslationManager from IBM, the Transit system from Star, the Eurolang Optimizer and the Translator’s Workbench from Trados (according to a paper I dug up).
Since then, improvements have been made to the way a TM is searched, leveraged etc. But have there been significant improvements?
TMX has been widely used for sharing TMs, and it’s quite a good format. But that doesn’t mean that the CAT runs the local TM in that format.
So, I decided to start this thread to see what is the current landscape with the technology regarding what format/approach each tool is using natively, locally on the machine (not import-export).
If you know what’s under the hood of what you’re using, please share 😊
Collapse


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugalsko
Local time: 04:23
angličtina -> francouzština
+ ...
AUTOR TÉMATU
Wordfast Sep 22, 2021

So here are the native formats that are run locally by each CAT:

_Wordfast Pro 6: solr-Lucene (originally an Apache format. Not a file. It’s a collection of files with an index.)
_Wordfast Pro 3: txt
_Wordfast Classic: txt
_Wordfast Server: txt
_Wordfast Anywhere: txt


 
Samuel Murray
Samuel Murray  Identity Verified
Nizozemsko
Local time: 05:23
Člen (2006)
angličtina -> afrikánština
+ ...
TMX Sep 22, 2021

Philippe Locquet wrote:
TMX has been widely used for sharing TMs, and it’s quite a good format.


TMX is a terrible format. Firstly, it's not very extensible. Secondly, it's an extremely wasteful format.

I took a random TMX file off of my computer and did some math:
- file size: approx. 1.5 million characters
- number of TUs: 3250
- meta data (date, time, user ID, language codes, client codes etc.): 170 000 characters
- actual content: 650 000 characters
- unnecessary codes and stuff: 630 000 characters

The same TM in Wordfast Classic's TXT format (with no data loss): 820 000 characters (45% smaller than the TMX file).

For your list:
* Wordfast Classic: Tab-delimited TXT file with header in first line and thereafter one TU per line. Each column has a specific function (e.g. one column is the date/time, another column is the source language code, another is the source language text, etc.)
* Wordfast Pro 3: The format is practically identical to that of Wordfast Classic.
* Wordfast Pro 6: It's a set of folders and subfolders with various files in them that only Wordfast Pro can read, as far as I can tell. I have not been successful in figuring out how to read a WFP6 TM in any other application.
* Wordfast Anywhere: Unknown (it's an online TM, so from a user's perspective, it doesn't really have a TM format -- only various import and export formats)
* Trados 2007: It's a TMW file, with four additional files MTF, MWF, IIX and MDF. They are all binary files and none of them are zip files.
* Trados 2009+: It's an SDLTM file. Some kind of database, possibly "SQLite format 3".
* MemoQ: It is a mystery where MemoQ stores its TM files.
* OmegaT: TMX file only.
* CafeTran: AFAIK it's a TMX file.

[Edited at 2021-09-22 20:45 GMT]


Hans Lenting
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugalsko
Local time: 04:23
angličtina -> francouzština
+ ...
AUTOR TÉMATU
Yes and no Sep 22, 2021

Thanks! for tracking, it's better if we stick with the topic: what the CAT is running natively on the machine.

Samuel Murray wrote:
* Trados 2007: It's a TMW file, with four additional files MTF, MWF, IIX and MDF. They are all binary files and none of them are zip files.
* Trados 2009+: It's an SDLTM file. Some kind of database, possibly "SQLite format 3".
* MemoQ: It is a mystery where MemoQ stores its TM files.
* OmegaT: TMX file only.
* CafeTran: AFAIK it's a TMX file.

[Edited at 2021-09-22 20:45 GMT]

Thanks for that Hopefully someone will know more about MemoQ

Samuel Murray wrote:
* Wordfast Anywhere: Unknown

As my above post states, it runs on txt. It can be accessed via Wf Pro too, in which case it will be read/written in txt.

Samuel Murray wrote:
* Wordfast Pro 6: I have not been successful in figuring out how to read a WFP6 TM in any other application.

Please refer the above post. The format is solr Lucene. There are tools to read the data, but they are not CATs. It also allows to store context match in the data.

Samuel Murray wrote:
an extremely wasteful format.

There have been different generations of TMX. But in my experience, you loose more when converting to txt (I've seen that with creation and modification date). The issue with txt is that not every tool puts the metadata in the same spots (except for most important data off-course) which involves editing columns or data loss. TMX seems more resilient to this between tools, but it's heavy. Cleaning up TMs for my customers has never been a problem with TMX up to 1GB in size. Then either the tmx editing tool find its limits or the computer is chocked. This can off-course be overcome using other tools or power text editors but that's leaving the realm or what a vast number translators do.

Be well


 
Hans Lenting
Hans Lenting
Nizozemsko
Člen (2006)
němčina -> nizozemština
CafeTran Sep 23, 2021

Samuel Murray wrote:

* CafeTran: AFAIK it's a TMX file.

[Edited at 2021-09-22 20:45 GMT]


Correct. It’s a valid TMX file, with every TU in a separate paragraph. No line breaks after the individual items of the TU. Looks messy. Saves a little space and thus RAM.


 
Hans Lenting
Hans Lenting
Nizozemsko
Člen (2006)
němčina -> nizozemština
Waste Sep 23, 2021

Samuel Murray wrote:

TMX is a terrible format. Firstly, it's not very extensible. Secondly, it's an extremely wasteful format.

...

The same TM in Wordfast Classic's TXT format (with no data loss): 820 000 characters (45% smaller than the TMX file).



The waste only occurs when you want to store additional info in properties. When you limit TUs only to the source and target segment content (no formatting etc. stored), the waste isn't that large. BTW: You can also create separate TMX files per co-worker, subject field, document version etc. (kind of the Transit approach), thus making the TMX properties unnecessary.

But, true, Wordfast Classic's TXT format is beautiful. I've even suggested that CafeTran Espresso would get a way to store TMs in such a compact format (which, I guess, reduces the RAM load significantly). The fact that you can run operations on the source and target column independently, is very handy.

Transit has a different approach: source and target are saved in XML files, a TM is created on the fly and it is binary.


Philippe Locquet
 
Hans Lenting
Hans Lenting
Nizozemsko
Člen (2006)
němčina -> nizozemština
MSJet? Sep 23, 2021

Philippe Locquet wrote:

Thanks for that Hopefully someone will know more about MemoQ



Isn't that, just like with Déjà Vu, an Access database, created with MSJet technology?

If I'm not mistaken, these databases can be opened with Ms Access.


 
Samuel Murray
Samuel Murray  Identity Verified
Nizozemsko
Local time: 05:23
Člen (2006)
angličtina -> afrikánština
+ ...
@Philippe Sep 23, 2021

Philippe Locquet wrote:
Samuel Murray wrote:
* Wordfast Anywhere: Unknown

As my above post states, it runs on txt.

How do you know this?


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugalsko
Local time: 04:23
angličtina -> francouzština
+ ...
AUTOR TÉMATU
Dev Sep 23, 2021

Samuel Murray wrote:

Philippe Locquet wrote:
Samuel Murray wrote:
* Wordfast Anywhere: Unknown

As my above post states, it runs on txt.

How do you know this?


I never reveal my sources JJJJJJJJJJJJ
I'm in touch with the devs, so, as good as it gets in terms of reliability...


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
Velká Británie
Local time: 04:23
Člen (2004)
angličtina -> italština
Devs Sep 23, 2021

is that short for "devils"?

Adieu
 
expressisverbis
expressisverbis
Portugalsko
Local time: 04:23
Člen (2015)
angličtina -> portugalština
+ ...
Yes, Sep 23, 2021

Giovanni Guarnieri MITI, MIL wrote:

is that short for "devils"?


the devil developers!

[Edited at 2021-09-23 12:09 GMT]


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugalsko
Local time: 04:23
angličtina -> francouzština
+ ...
AUTOR TÉMATU
jjj Sep 23, 2021

Giovanni Guarnieri MITI, MIL wrote:

is that short for "devils"?


Looks like there are some nasty programmers out there... XD


 


To report site rules violations or get help, contact a site moderator:

Moderátor/moderátoři tohoto fóra
Maria Castro[Call to this topic]
Nawal Kramer[Call to this topic]

You can also contact site staff by submitting a support request »

Under the hood: TM technology ran locally by CATs, anything new?







Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »