A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.

How to create DGT translation memories

In November 2011, the European Commission’s Directorate-General for Translation (DGT) published parallel texts (bitexts) from which huge translation memories can be generated in any language pair (23 supported languages). This video shows how to download the necessary files (28 ZIP files totaling 2.3 GB) and how to use the provided Windows utility to create a Finnish-Slovenian TMX translation memory with 2 million translation units. Using the free Wordfast Converter, the TMX memory is converted to the Wordfast format and the resulting memory is displayed in Wordfast Classic’s data editor.

Relevant links:
DGT-Translation memories
Wordfast Converter
How to create a DGT TMX on a Mac

18 comments to How to create DGT translation memories

  • Thank you, Dominique! I do have and regularly use the previous batches of DGT TMs (2007 and 2011), but your post reminded me that I still need to download and index the 2012 batch. 🙂 This is a really useful resource.

  • Very helpful video – thank you. I got through all the steps OK on my (old) PC, but when it comes to opening the .txt file with Word (on my Mac; my PC doesn’t have Word) I don’t get a decent file – its gobbledygook. Do I need a special add-on for Word? Or am I doing something wrong?

    • Dominique

      Text files often cause problems when traveling between PC and Mac, due to encoding differences. This is the case with Wordfast TM’s and glossaries. Try the following: 1) open your TXT in Word for Windows, 2) save it as DOCX, 3) move the DOCX from PC to Mac, 4) open the DOCX in Word for Windows, 5) save it as TXT.

      Note that 1) may stress your PC a lot, if it’s old and under-powered (by today’s standards)!

      Try first with a smallish TM, for instance one created from a single ZIP (as opposed to the full version created from the 28 ZIP’s).

      • The file I’m trying with is smallish (20MB), product of a single ZIP. Like I said though, I don’t have Word on my Windows PC. On the PC I’ve tried opening with LibreOffice and OpenOffice but it’s no better. On my Mac I’ve tried saving it as a RTF file, but that doesn’t work either.

        • Dominique

          This is strange: I tried with the smallest ZIP and my Finnish-Slovenian TM was displayed just fine on my Mac (with Word 2011 14.2.5). You could try creating the TMX directly on your Mac. You need to use the Java utility provided on the DGT site.

  • I use Wordfast Classic. I was not aware of the conversion tool that you mention. The instructions say that you simply select the TMX file as your translation memory and it does the conversion automatically. So why is the conversion tool required. Where is it, by the way?

    Other issues:
    Searching the data editor using Ctrl F or clicking on the Find-Replace tab is non-functioning.
    These files use the EN-UK rather than the EN-US languages tags, and I am assuming that I have no control over which language tag to assign the English output when producing the TMX from the AC. This is important when merging these files with files that I already have that use the EN-US tag.

    • Dominique

      One advantage of the converter is that it is very fast. This is especially noticeable with very large TM’s such as the DGT TM.

      Wordfast’s data editor was never intended to deal with TM’s with more than 1.5 million TU’s. In fact, Wordfast isn’t even supposed to support TM’s that large. If you need to search the TM for certain words, you can always use the concordance search feature.

      Wordfast Classic doesn’t really care about language codes, so from its point of view, EN-UK is just as good as EN-GB. Wordfast Pro, on the other hand, only accepts valid language codes, and it would therefore reject EN-UK as invalid. You can use a text editor’s find/replace function to change the language code. Use an editor that is capable of dealing with large text files, eg. UltraEdit or EmEditor.

  • Tks for prompt reply.
    I was trying to find the Wfconverter at Wordfast.net or in the Yahoo group (files). I even googled it. Do you have this URL for download?

    The two files that I merged with the language tags EN-UK and EN-US, respectively (this time around it worked, a couple of years ago, it did not), show these tags: , , etc., and is in fact pretty bumpy when concordance searching.

    Here is my MO: I’d rather not use this huge TM for ordinary tasks, because doing a concordance search actually then produces too many results, but when I decide I would like to use it for certain words or phrases, I’d have to open it, after closing the current TM. Was wondering if this could be made easier somehow. So it becomes step-intensive. Hope I explained this okay.

    • Dominique

      The link to the converter was mentioned in this blog post.

      The best way to use large TM’s such as the DGT TM in Wordfast Classic is to copy it to the same folder where your main TM is located. Then you can enable the following setting in Terminology > Other: Search concordance in all sibling translation memories. This way, you don’t have to merge it into your main working TM.

  • Oh, I did find the exe file here on this site, clicking my way through.
    Let me type these tags again, they did not show up properly before.
    TAG-A, TAG_C, TAG-D, etc., enclosed by the greater- and smaller-than signs dots the concordance lookup file.
    Sorry..

  • Good, concise videos!
    I now have three AC files, they seem to be grouped that way by EU, i.e., the 2007 release, the 2011, and 2012 releases respectively. So it makes sense to make three TMX files from those, and hence three Wf memories. These can all be placed in the same folder and accessed during CS as sibling memories.
    Q. It will not slow down the process that you have three or more sibling TMs for CS to run through.

  • Christophe

    Ooops! Looks like the link (http://langtech.jrc.ec.europa.eu/ECDC-TM.html) is dead 🙁
    (Do you have the TM available? Could you upload it through Dropbox or any other system?
    Thanx in advance
    Cris

  • Christophe

    Ooops! Sorry…thank you for the repost. Btw the link’s been repaired 🙂

  • Robert

    Thank you Dominique. I’ve downloaded and converted all the files. I’ve made one big TM. There is just question I’d like to ask: Is it possible to use 2 million TU’s (1.2 Gigabytes) in my Wordfast Classic? I think I can upload “only” 1 million. Is using this large TM as my BTM a good solution? How large can a BTM be? I would appreciate any suggestions as to how to make the most of this huge TM. I hope you have time to send me an answer. Thank you. Robert

    • Dominique

      Hi Robert,
      The best use for such TM’s in Wordfast Classic is for concordance search, via the sibling TM feature. No need to define the TM as BTM: just copy it to the same folder as your main TM, and tick Search concordances in all sibling TM’s under Terminology > Other. You may also want to tick Prompt for each TM, as searching for concordances from TM’s that large can take quite some time, and you don’t necessarily need/want to do it with all searches. Hope this helps, Dominique

  • Robert

    Hi Dominique. First of all, thank you for taking the time to send me a reply. I’ve copied the DGT file in the same folder as my main TM. Now I’m using it for my Concordance searches. I’ve also ticked “Prompt for each TM” so after the first search is done on the main TM, the search continues with the remaining files in the same folder. The problem is that WF scans all the TM’s one after another non-stop, without asking me if I want to scan them individually. Is that how it is supposed to work? In addition to that, while the search is in progress, my esc button doesn’t seem to be working when I want to interrupt the search (very odd). What do you think?

  • Robert

    Hi Dominique. First of all, thank you for taking the time to send me a reply. I did exactly as you told me and I really believe it’s the best solution. Just a couple of things: although the “Prompt for each TM” is ticked, the concordance search continues to search in all the TM’s located in the same folder NONSTOP(without asking whether I want to proceed with the next TM or not). This means that I have to wait until all TM’s are scanned before I can continue my work (which can take a very long time). What’s more, I can’t seem to be able to interrupt the searching process when I hit the ESC key. What do you think?

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>