Who is This Tutorial For?

The intended audience for this tutorial is members of production environments for language services who are not necessarily professional translators, but are interested in learning about the processes and tools involved in evaluating finished translations. We include links to one possible system of tools that can be used to this end. The initial intended audience for this tutorial is the machine translation community at the 20th Machine Translation Summit in Geneva, Switzerland, taking place in June 2025.

Part A of this tutorial will aid anyone in understanding the basic principles of TQE and different roles in a TQE system (in particular the project manager’s and the technician’s) via a guided, concrete example, even if they will never fill the role of evaluator. Part B enables readers to contribute to TQE systems by preparing tools, specifications, and metrics for evaluators, and by producing quality ratings based on the error data received from evaluators.

After following this tutorial, a non-translator who wishes to conduct a TQE must find a professional translator to at least fill the role of evaluator. If the reader considers themselves already well-versed in the principles and theory behind conducting a TQE, they may proceed directly to Part B, referring to Part A as needed for examples.

Part A: TQE Basics with a Concrete Example

Overview

Although there are many methods with which to evaluate the quality of translations, this tutorial is focused on conducting a TQE (Translation Quality Evaluation) based on MQM (Multidimensional Quality Metrics). It consists of the following sections:

What is Translation Quality (TQ)?
Roles in TQE
What is TQE?
A concrete example for Part A
A system of tools for all steps in a TQE

Note that this tutorial will provide information for using MQM in a TQE, but a full description of MQM of is available at the MQM website.

A1 What is Translation Quality (TQ)?

If you are not familiar with the idea of translation quality (TQ) and its measurement, the following key terms are foundational:

What is translation? “Translation … is a cover term for the creation of written output that corresponds to source content according to agreed-upon specifications” (International Federation of Translators).
What are specifications? They are a detailed blueprint produced by a pre-production dialog between stakeholders (the requester and the provider’s project manager):
- Translation parameters are standardized aspects of a translation project to be addressed in the pre-production dialog.
- Translation specifications are the use-case-specific requirements developed, agreed upon, and documented before a translation project begins. They are used to guide the production of a translation, as well as to guide evaluation of the translation in a post-production phase.
- It may be useful to look at this list of translation parameters.
What is translation quality? The degree to which a translation meets the agreed-on specifications.

For more information, see the following articles on the definition of translation, the general measurement of quality, and the application of quality measurement to the translation industry:

A2 Roles in TQE

A clear distinction should be made between professional translators and other personnel involved in the language service industry. Relevant to TQE are:

Project managers, who oversee the translation, the translation evaluation, or both.
Technicians, who set up and run the mechanics of a TQE, creating and maintaining tools that are critical to all parts of the process.
Evaluators, who examine the source and target texts to identify errors relative to the specifications and produce annotation metadata.

One person may fill more than one of these roles, but it is critical that an evaluator have at least the competence of a professional translator for the languages, subject field, and type of text in question. Non-translators might fill the roles of technician or project manager.

A3 What is TQE?

TQE refers to the quality management activity of inspecting and measuring a translation product in order to determine whether stakeholder requirements, that is, specifications, have been fulfilled (ASTM WK46396 New Practice for Analytic Evaluation of Translation Quality §3.2.4 and §3.2.5).

The goal of the evaluation process is to identify errors in the text. The most obvious way to do this is with an annotation tool. In our example, we annotate using TRG Annotation Tool. This tool must first be configured with a bitext, a metric, and an optional but recommended specifications file. The files are prepared in a preliminary step and handed off to an evaluator, who can then use the tool as a visual interface to highlight and assign errors to portions of the text. The annotation data is then exported.

A4 A Concrete Example

Below is a very short French text, sampled from a larger document, which discusses several aspects of European Union law, as well as its English translation. Both source and target text are segmented into translation units (TUs) that are aligned side-by-side. Together, these segments make up the text that will be used to illustrate the principal parts of a TQE in Part A of this tutorial. Many intermediate files are generated over the course of a TQE. The files for Part A are provided for the reader in a public GitHub repository.

As will be discussed in Part B, the concrete example in Part A is far too short for a real TQE. While such a small sample size is not sufficient for a statistically significant evaluation, it is used here to demonstrate the basic principles of TQE.

TU	Source Text (French)	Target Text (English)
1	Le Parlement européen a adopté, le 11 novembre 2015, une résolution sur la réforme de la loi électorale de l’Union européenne.	On November 11th, 2015, the European Parliament adopted a resolution on the reform of the laws of the European Union.
2	Plusieurs principes ont alors été retenus:	Several principles were retained:
3	(1) l’organisation des élections sur la base d’un scrutin de liste ou d’un vote unique transférable de type proportionnel;	(1) conducting elections on the basis of proportional representation; using a list system or a single transferable ballot system;
4	(2) la suppression du cumul de tout mandat national avec celui de député européen;	(2) prohibiting the cumulation of any national office with one as Member of the European Parliament;
5	(3) la liberté pour les États membres de constituer des circonscriptions au niveau national;	(3) upholding the freedom of Member States to draw up constituencies at national level;

In this concrete example, the evaluator works in a webapp to annotate segments of the above text for errors. For the sake of simplicity, this tutorial provides a read-only HTML version of what this webapp environment would look like after the evaluator has finished error annotation. It retains a level of interactability, so that the user may inspect errors by clicking on one of the yellow or orange error buttons to see the erroneous text highlighted.

It contains a source text on the left side and a target text on the right side. They have been segmented, meaning that they have been split into translation units, which are manageable chunks of text that correspond roughly to the size of a clause or short sentence. They have also been aligned, meaning that the source and target texts’ segments have been set side-by-side. Together, they form a text that has been annotated for translation errors and is now ready to be assigned a quality rating.

After annotation, the exported data can be scored. In our example, we calculate scores using an automatic tool that receives the annotation webapp’s exports and returns both a numeric quality score and a pass/fail quality rating. It also exports a summary of its calculations, as well as an error count table, in the form of an Excel spreadsheet.

The scoring process concludes the TQE. Each of these steps will be broken down further in Section A5.

A5 Putting the Concrete Example Together

This section guides the reader through each step of the TQE process described in Section A4. It includes all of the files and artifacts generated throughout the process, which can be found in this public GitHub repository. Part B of this tutorial gives a deeper explanation of how each of these files is created on the pathway towards calculating a quality rating, and why they are necessary in TQE.

As a matter of practicality, and as implied in Section A4, a TQE is split into three stages:

Preliminary Stage
Error-Annotation Stage
Automatic Calculation & Follow-Up Stage

A5.1 Preliminary Stage

The purpose of the Preliminary Stage is to:

Formalize the specifications negotiated when the translation was created,
Select a metric to be used in later scoring, and
Prepare the sample for evaluation by segmenting and aligning it.

This is all done in preparation for the Error-Annotation Stage. For the concrete example, this stage produces

a structured translation specifications (STS) .xml file, which contains detailed information on the source and target texts, as well as the translation process,
a metric .xml file, which contains a list of errors that the evaluator will look for. It also includes information that will later be used to score the translation based on the evaluator’s reporting of those errors, such as the cutscore (the minimum passing score), and
a bitext .txt file, which includes the source and target texts, concatenated line-by-line, with tab characters as delimiters.

A5.2 Error-Annotation Stage

The purpose of the Error-Annotation Stage is for the evaluator to annotate the bitext using errors from the metric file. The three files from the Preliminary Stage are uploaded to the TRG Annotation Tool, and the evaluator highlights and annotates the errors. In this concrete example, the evaluator determined that there were six errors:

A minor Organizational Style error in TU1: The style guide for this project requires that the date be formatted “11 November 2015,” not “November 11th, 2015.”
A major Omission error in TU1: The target text omits the word “electoral” when describing the types of laws mentioned in the EU resolution.
A major Mistranslation error in TU2: “Retained” is not an acceptable translation of “retenus.” It is a false cognate. An acceptable translation here would be “included.”
A minor Punctuation error in TU3: The semicolon after “representation” should be a comma.
A major Unidiomatic Style error in TU4: Although it accurately conveys the intended meaning of the source text, the target text is unwieldy because of the literal word-for-word translation of the source material.
- The evaluator offered an idiomatic alternative: “Prohibiting individuals from holding office as a member of a national parliament and as a Member of the European Parliament at the same time.”
A minor Awkward Style error in TU5: The phrase “draw up” is unclear. A better alternative might be “establish.”

Once the annotations are complete, the TRG Annotation Tool exports its data as a JSON file. This data can be converted into a TEI file (TEI is a widely used XML format in the Digital Humanities), then converted to read-only HTML so that it can be visually inspected for data integrity. This read-only HTML was used in Section A4 to introduce the concrete example text with error annotations.

A5.3 Automatic Calculation & Follow-Up Stage

The purpose of the Automatic Calculation Stage is to obtain a quality rating by comparing a calculated overall quality score to a minimum threshold value called the cutscore. For this concrete example, we use a tool called the TQE Calculator, which takes in two files as inputs:

the TEI file containing error data exported by the TRG Annotation Tool, and
the metric file, containing scoring model parameters.

The TQE Calculator automatically parses these files and generates an error-count table, which summarizes the number and type of errors that were annotated in the TRG Annotation Tool, as well as those error types’ weights and the penalty multipliers for each severity type.

The error-count table contains all the information necessary to calculate the Absolute Penalty Total (APT), where each error, initially worth one penalty point, is first multiplied by its weight (here, all weights are one), and then by its severity multiplier (here, we use ×1 for minor errors and ×5 for major errors). The TQE Calculator automatically computes the APT using the error count table.

Then, the TQE Calculator automatically extracts the inputs to be used in the scoring model. The only parameter that needs to be input is the length of the target-text portion of the document in words, which is 74. The tool calculates an overall quality score, which is then compared to the cutscore to provide a quality rating:

The Overall Quality Score (OQS) falls short of the established cutscore of 80 and thus, receives a quality rating of “fail.”

It is possible to download a summary of this whole process as an Excel spreadsheet. (The spreadsheet summary for the concrete example is here.)

In this example, we used a linear noncalibrated scoring model. The exact formulas used in different the scoring models are outside the scope of Part A. To learn about scoring models, including calibrated scoring models, see Part B.

Again, this concrete example is too short a sample to be statistically meaningful, but is provided as a demonstration of the TQE process.

Aside: MQM vs. BLEU

TQE using MQM focuses on producing analytic quality scores based on annotated, individual errors and in relation to human perceptions of quality. The type of evaluation explained here is different from that normally used to evaluate incremental versions of a machine-translation engine or separate engines using the same source text and reference translation (e.g., a BLEU score¹).

In brief, MQM is a framework for conducting TQEs in a way that is both analytic and reference-free. Analytic means that it produces a quality rating by tabulating penalties for individual, annotated errors (as opposed to a non-analytic approach, which would produce a quality rating for the translation as a whole). Reference-free means that it does not require translations to be evaluated in comparison to another translation, such as a gold standard or previous translation of the same source text.

Although it is technically possible to conduct a TQE with MQM in many settings—anywhere from a python script to the back of an envelope—there are advantages to using tools specifically designed to work with it. The advantage of working with such tools is that they highlight the analytic approach of MQM: It gives immediate feedback that is both 1) actionable enough to be used in human-centric quality improvement, and 2) specific enough to preserve the root cause of errors and enable a “stack trace” for a more thorough evaluation of a human or system’s translation competency.

Now, Part B will describe how to expand these procedures to perform a TQE on any real-world translation, when the reader feels ready.

[at some point we will link to Error Root Causes – MQM (Multidimensional Quality Metrics)]

Part B: Expanded Theory & Your Own TQE

After studying the critical components of a TQE and their applications to the concrete example provided, the reader might pass to their own TQE.

Overview

This section shows how to conduct a TQE on a text of the reader’s choosing by putting together the same steps and tools as in Section A5. It explains not only how to make each of the files, but also why each is necessary from a theoretical standpoint, and how the tools used to generate the files apply that that theory. A technician’s view of each tool in this tutorial is at TQE Tools.

The subsections of Part B are analogous to the structure of Section A5:

Preliminary Stage
1. Formalizing Specifications
2. Selecting a TQE Metric
3. Preparing the Document for Annotation
Error Annotation Stage
Automatic Calculation & Follow-Up Stage

B1 Preliminary Stage

In the Preliminary Stage, we prepare for the Error-Annotation stage by producing the following resources:

The specifications of the translation, in the format of an STS XML file
The metric selected for evaluation, compatible with the specifications, in the format of a metric XML file, including:
- Error types and severity levels
- A scoring model
- A cutscore
A bitext of source text and target text, aligned and segmented. This may be the whole document, or a sample thereof.

B1.1 Formalizing Specifications: Making an STS File

In order to ensure a transparent TQE process, the first step of a TQE is to review and ensure access to the translation specifications that were originally negotiated between the requester of the translation and the project manager for the undertaking entity. For ease of data transfer, this is done by means of a Structured Translation Specifications (STS) file, which is an XML file with data fields corresponding to the structured parameters in ASTM F2575. Once the appropriate specifications are filled in, they may be viewed on a variety of different platforms, including the TRG Annotation Tool that we will use in Section B2.

If the original specifications of the project are unavailable or were never created, it is still possible to reconstruct the specifications by making educated guesses about the situation under which the translation was commissioned. Even such hypothetical specifications may still be important and useful in a TQE.

Formalizing recorded or reconstructed negotiations is a nontrivial task that should be conducted by a professional translation project manager, just as error annotation should be conducted by a professional translator. It requires intimate knowledge of the translation process and of stakeholder expectations.

A tool that provides a visual interface for producing an STS file is available here.

B1.2 Selecting a TQE Metric: Making a Metric File

The metric file (to be described hereunder) is an XML file that contains information essential to the Error-Annotation Stage (see Section B2). This information itself is known as the TQE metric. It is embedded here as an XML file for ease of data interchange.

It is the responsibility of the translation project manager to produce a metric that is tailored to the specifications of the translation project that they are managing. Like specifications, a metric must be reconstructed if one was never created or made available. Also as with specifications, it should be a professional project manager who creates, selects, or verifies the metric to be used in a TQE.

A key assumption in the MQM framework is that the design of any measure of goodness must be justified in relation to the subject whose goodness is to be measured (see the article What is quality? under Section A1). In other words: Because the translation project specifications vary by use case, there is no single, universal TQE metric.

For example, an error may be considered critical in the context of one translation project, but minor or neutral in another. That being said, metrics may be reused if two translation projects reuse the same specifications (such as two translations part of a series of translations, differing only in due date).

A metric is composed of three components, as stated in ASTM WK46396 New Practice for Analytic Evaluation of Translation Quality §4.5:

an appropriate selection of error types,
a scoring model, and
a mechanism to determine whether the evaluated translation has passed or failed.

The MQM implementation of each of these components is described in the subsections below (B1.2.1-1.2.3), followed by a description of a webapp that can be used to generate a metric XML file (B1.2.4). The way in which each component is embedded in XML can be found in the related sections on the tools page.

B1.2.1 An Appropriate Selection of Error Types

Instead of a universal metric, the MQM framework provides a universal typology of errors that may occur in translation (https://themqm.org/error-types-2/typology/). Each MQM TQE metric pulls its error selection from this typology. These errors are then used to annotate segments of a translation. Having every MQM metric be a subset of the MQM error typology ensures transparency and minimizes data loss between TQE systems, as well as between evaluators, translators, and clients. In this example, we use the MQM Full error typology, rather than the MQM Core error typology; MQM Core is a subset of MQM Full.

In addition to the error types themselves, a metric indicates the weight of each error type selected. This is the number by which the error counts of this type are multiplied during scoring. By default, this number is one.

Furthermore, a metric indicates the multipliers for each severity level. MQM provides four error severity levels: neutral, minor, major, and severe. The severity level penalty point multiplier is the number by which the error counts of each severity level are multiplied during scoring. By default, these numbers are 0, 1, 5, and 25, respectively.

After annotation is complete (see Section B3), the products of each error count, multiplied by the relevant error weights and severity level multipliers, are summated to give an Error Type Penalty Total (ETPT) for each error type. More on this will be given in Section B3.

B1.2.2 A Scoring Model

While error annotation is a crucial part of the TQE process—and also the most time- and effort-intensive—it only produces raw annotation data, not a quality score. The method to calculate the quality rating from the raw annotation data is called a scoring model.

This is a series of formulas to convert error information (how many errors occurred, of what kind and where, how severe they were, etc.) into a single number that reliably reflects a human judgement of quality. In MQM, this is a conversion of the ETPT into the Quality Score (QS).

In this tutorial, we will focus on two linear scoring models, with and without calibration. The linear scoring model without calibration, also called the “raw” scoring model, is well known in the industry. The calibrated linear scoring model uses a quality scale that is different from the raw model’s error count scale. This is done for ease of human use and application. Both models are described in the paper The Multi-Range Theory of Translation Quality Measurement: MQM scoring models and Statistical Quality Control (13 authors).

The TQE Calculator that will be used in Section B3 automatically applies the scoring model, which is specified to be raw (non-calibrated) or calibrated in the metric. Certain calculation values are also part of the scoring model and must be included in the metric:

The Reference Word Count (RWC) an arbitrary number of words in a hypothetical reference evaluation text. Implementers use this uniform word count to compare results across different projects. The RWC is often set at 1000.
The Maximum Score Value (MSV) of 100 is the maximum possible QS that a translation may obtain—the score that the QS is “out of.” In a calibrated scoring model, it is also an arbitrary value designed to manipulate the QS in order to shift its value into a range which is easier to understand. It converts the score to a percentage-like value.
For a calibrated scoring model, the number of Acceptable Penalty Points (APP) is the number of penalty points that stakeholders would deem as still acceptable for the Reference Word Count. So, this is usually the maximum acceptable number of penalty points per 1000 words.

These values are used to calculate intermediate values before a QS is actually obtained.

For all linear scoring models:

The Per-Word Penalty Total (PWPT) is determined by dividing the APT by the EWC. It is the key value in Raw QS calculation.

For calibrated linear scoring models:

The Defined Passing Interval (DPI) is the interval between the MSV and the PT defined in B1.2.3.
The Normed Penalty Total (NPT) represents the PWPT relative to the RWC. Typically, 1000 is used as the RWC; therefore NPT is sometimes referred to as the Error Penalty Total per Thousand Words. It is obtained by multiplying the PWPT by RWC.
The Scaling Factor (SF) is the parameter to scale the NPT into the DPI in order to give in a meaningful relationship to the PT.

These definitions are taken from sections 2.1 and 2.3 of the MQM website’s description of scoring models.

For simplicity, a linear scoring model is usually applied—even though the world is profoundly non-linear—because any non-linear function can be approximated by a linear one over a small interval. Therefore, if the scoring model is set up and verified for a certain sample size, it will perform reasonably well for nearby sample sizes at producing a reliable QS. Sampling is discussed in further detail in Section B1.3.

B1.2.3 A Mechanism to Determine Whether the Evaluated Translation Has Passed or Failed

Because numeric scores can be open to interpretation, a TQE metric must specify a method of converting the QS output by the scoring model into a quality rating, i.e. a binary verdict on whether the translation is acceptable or not. The most straightforward way, and the way implemented in this tutorial, to determine whether a translation has passed or failed a TQE is with a cutscore, also known as a Passing Threshold (PT). If the QS is above the cutscore, the translation passes. Otherwise, it fails.

While a cutscore takes the unassuming form of a simple number, there is actually wide discussion in the quality management community around how to best select one. A good cutscore reflects real-world human perception. And a metric is validated based on how well it predicts what experts will say about a translation. However, the process of setting a cutscore in a real-life project is beyond the scope of this tutorial.

B1.2.4 Generating a Metric XML File

This webapp is a simple tool that allows a user to upload a typology, and then select which error types they wish to include in their metric, along with the weights for each error type. In the “Scoring Information” box, the user inputs severity level multipliers, all scoring model information and the cutscore.

This tool provides the user with the XML version of the MQM Full error typology. If a user is conducting a TQE with another error typology, they may upload their own, but it must conform to the same XML structure as the provided mqm_typology_full.xml file.

If you have more questions about the MQM framework used for selecting metrics and defining error types, you can read about it here: https://themqm.org/.

With the selected metric’s XML file created, the project manager delivers it to the evaluator, who then proceeds with the next step in the TQE preparing the text.

B1.3 Preparing the Document for Annotation: Making a New Bitext File

Once the translation specifications and the TQE metric are finalized, the last step before the Error-Annotation stage is the preparation of the source text and the target text, or sample thereof, for evaluation.

If it were possible to annotate the entire document for translation errors, then we could be confident that the translation quality evaluation would be as reliable as the specific TQE evaluation process, which has its own inherent reliability.

The reliability of any particular evaluation hangs upon the qualifications of the evaluator, who needs to be a highly proficient and experienced linguist.

Such highly qualified resources are scarce, expensive, and have limited availability. This is the primary reason why TQE is typically performed on samples of limited size, rather than on an entire large document, or on all small documents in the case of a series of self-sufficient texts within a project. (Not to mention that TQE is usually done for someone else; if the goal is to fix the errors, the linguist can simply edit the text without annotating it.)

It is therefore common practice to evaluate only a sample and extrapolate the score to estimate the quality of the full document. In this sense, TQE evaluation works with a sample of a certain size taken from a larger document or content stream.

Both sample size and document size significantly affect the validity and reliability of a TQE, for two main reasons:

the mathematics of defect measurement is rooted in statistics, and
there is strong practical evidence that human perception is non-linear.

Statistics tells us that very small samples (under 250 words) introduce high uncertainty into quality measurements, regardless of the method, making them unsuitable for MQM’s linear analytic scoring and requiring alternative approaches.

Medium-sized samples (500 to 5000 words) are better suited for linear assumptions and work well with traditional MQM-based evaluation. For instance, a scoring model calibrated for a 2000-word reference sample can be applied, with reasonable accuracy, to samples between 1500 and 2500 words.

Note that unfortunately, non-linearity of human perception shows itself of a wide range of samples, because human readers tolerate fewer errors per unit as the text grows, due to cognitive effects like priming. In order produce scores that accurately reflect human perception on a wide range of sample sizes, it may be necessary to use a non-linear scoring model. More information about non-linear scoring models is forthcoming.

Once a sample has been selected, the evaluator determines the Evaluation Word Count (EWC), usually by means of a software app such as a CAT (computer assisted translation) tool. Usually, this is the word count of the source text.

Finally, the sample of the source text and the sample of the target text are combined into a tab-delimited bitext TXT file, where they must be segmented (split in translation units) and aligned (each translation unit appears on the same line as its corresponding source or target segment, separated by a tab character), just like the example in Section A4. This may be done in a CAT tool, using a command line interface or text editor, or with this webapp.

B2 Error-Annotation Stage: Annotating Errors with the TRG Annotation Tool

In the Error-Annotation Stage, a professional translator produces error annotations by assigning issues from the metric to textual segments in the bitext.

The TRG Annotation Tool was originally created as a webapp in 2015 as part of a master’s thesis on TQE, using the PHP webapp framework Symfony. It was used in other student theses (such as Marshall Martins, page 17) under the name of “MQM Scorecard.” It was further developed by a software engineer at the German Research Center for Artificial Intelligence (DFKI) as part of the EU-funded QT21 project. Upon the release of PHP7, LTAC Global, along with the Translation Research Group (TRG) at BYU, undertook the major task of rewriting the webapp in a React framework. Nowadays, the TRG Annotation Tool is being used by researchers to annotate bitexts in low-resource languages, and the TRG continues to use it in research on a corpus of ATA translator certification exams.

The TRG Annotation Tool is a self-hosted webapp. Anyone creating a TQE system can create their own instance, add users and projects, and upload custom typologies. The technical details for setting up an instance are outside the scope of this tutorial, but are detailed on the tools page.

Once an instance is set up and a user account is created for the evaluator, they may log in and create a project, uploading the files prepared in the Preliminary Stage as necessary. Note that each account must upload a general error typology to be shared across all projects. This should be whatever error typology was used to create the metric (see Section B.1.2.1 and B.1.2.4). With a project created, the user, who is the evaluator, completes annotation of the aligned, segmented text and exports the data, which is download as a JSON file.

B2.1 Creating a Project

Navigate to the “Create project” tab.
Choose a name.
Upload the bitext, specifications, and metric file as created in Section B1.

Any of these parameters may be changed at any time from the “View projects” tab by selecting the project’s “Edit” button. An account’s uploaded error typology can only be changed if there are no active projects on that account.

B2.2 The Project Editor

Once a project is created and opened, there will be multiple tabs in the project editor.

Scorecard

This is the main annotation interface. The evaluator scrolls through the bitext and annotates errors. The “Filter” pane at the bottom of the interface allows the evaluator to search for strings in the bitext.

Once an error is identified, the evaluator annotates it:

Select the segment where the error is found. Double-click anywhere in a segment to select it, use the sidebar arrows to navigate through the segments, or enter a segment number to select under the Navigation pane in the lower right corner of the interface. The selected segment is highlighted in red.
With the segment selected, click the pencil icon to enable or disable highlighting. If it is orange, then highlighting is enabled.
With highlighting enabled, click and drag (or double-click a word) as usual to highlight text in the source or target column of the selected segment to create an error.
Select the type and severity level of the error to add. The selection available comes from the uploaded metric. Further information for each error type comes from the uploaded typology, and can be inspected by mousing over the error types in the dropdown menu.
Add any freetext notes on the error.
Click “Add New Error” finalize the annotation of this error.

To select an annotation, click the associated error button underneath the associated text segment. The associated error text will be highlighted. In the right-hand side of the interface, any notes associated with the annotation will appear, along with buttons to deselect, edit, or delete the error annotation.

Project Specifications

Here, the evaluator can consult the translation project specifications, as they were formalized in the uploaded STS file.

Reports

Here, the evaluator can see a summary of the error count in the translation, split by type and severity. This is similar to the error count table that will be created in Section B3.1. There is also a button to export the project data as a JSON file.

Training and Help

In-app tutorials are available in this tab.

About

This is the same throughout every project, and gives credit to the contributors and supports of the TRG Annotation Tool, as well as a contact and bug reporting information.

B2.3 Exporting Project Data

Once the evaluator is satisfied that they have found and annotated all relevant errors in the bitext, they can export the error annotation data as a JSON file.

The JSON file can then be converted into a TEI file. This tutorial provides a webapp that converts TRG Annotation Tool JSON exports into TEI.

To ensure that no data was lost on export or conversion, this tutorial also provides a “reconstructor” for data inspection. This is a webapp that, given a TRG Annotation Tool project’s TEI file, shows a pared-down view of that project’s text and error annotations as an HTML document formatted to resemble the TRG Annotation Tool’s annotation interface.

B3 Automatic Calculation & Follow-Up Stage: Computing the Score

In the Automatic Calculation and Follow-Up Stage, we don’t need to create any more resources. The evaluator, informed by resources produced by the project manager, has created the error annotation data. The project manager has designed the scoring model to be used. Thus, we upload the necessary inputs to an automatic tool that runs the scoring model and compares the outputted overall quality score to a cutscore, delivering a final quality rating. This pass/fail quality rating is the ultimate goal of a TQE.

The automatic tool from Part A that we used for scoring is the TQE Calculator. It walks us through the automatic calculation process in three steps:

Creating an error count table.
Calculating the APT.
Calculating the QS using the selected scoring model.

Any TQE system will need to be able to replicate these steps.

B3.1 Creating an Error Count Table

An MQM approach does more than simply tallying up errors. Each error annotated by the evaluator is assigned a type (with an associated weight) and a severity level (with an associated multiplier). These are the factors needed to calculate the actual number of “penalty points” incurred by the translation for each error. In terms of workflow, the error count data, prepared during the annotation stage, must be combined with the data describing the error type weights and the severity level multipliers. A mechanism for doing this is an error count table.

An error count table has a row for each error type, and a column for each severity level (usually, these are neutral, minor, major, and critical). Each row also has an error type weight, and each column has a severity level multiplier.

From there, each error type has its counts populated. The counts should be split between the severity levels. Thus, an example for data presentation in an error count table would look something like this:

TQE Error Count Table

Error Type	Error Type Weight	Severity Levels & Multiplier
		Neutral	Minor	Major	Critical
		0	1	5	25
Error Type 1	1	0	3	0	0
Error 1 Subtype	2	0	0	2	0
Error Type 2	1	0	7	0	0

The first step to using the TQE calculator is to create the error count table. Upon loading the tools, there will be a blank table present in the first section:

This table can be populated automatically or manually.

To populate it automatically, upload a metric file and TEI file (a “scorecard”) as output by the TRG Annotation Tool.

To populate it manually, modify the input text boxes for the error names, weights, and counts, as well as the severity point multipliers. To add more rows (you will likely have more than two error types), use the “+ error type” button in the upper left of the table.

A populated error table should look something like this:

Note that an error count of zero may be expressly input as zero or simply left blank.

B3.2 Calculating the APT

The Absolute Penalty Total (APT) is the most important value used in QS calculation. It is summation of the Error Type Penalty Totals (ETPTs) for each error type in the project.

The ETPT of an error type is the sum of the penalty points calculated for that error type. To calculate it, the error count at each severity level is multiplied by the severity level’s multiplier. Each of these products is summated, and that sum is multiplied by the error type’s weight to obtain that error type’s ETPT.

\(ETPT_{error\,type} = \displaystyle \sum_{for\,each\,severity\,level} (ErrorCount_{level}× SeverityMultiplier_{level}× Error Type Weight_{error\,type})\)

Each of these ETPTs is added together to obtain the APT for the whole annotated translation sample:

\(APT= \displaystyle \sum_{error\,type\,i,\,severity\,level\,j}ETPT_{i,\,j}\)

Or, more exactly,

\(APT= \displaystyle \sum_{error\,type\,i,\,severity\,level\,j} (ErrorCount_{i,\,j}× SeverityMultiplier_{j}× Error Type Weight_{i})\)

These calculations can be applied to an error count table:

TQE Error Count Table with Totals

Error Type	Error Type Weight	Severity Levels & Multiplier				ETPT
		Neutral	Minor	Major	Critical
		0	1	5	25
Error Type 1	1	0	3	0	0	3
Error 1 Subtype	2	0	0	2	0	20
Error Type 2	1	0	7	0	0	7
APT						30

The error count table created in B3.1 gives all the necessary information for the TQE Calculator to perform these calculations automatically, which occurs during its second step:

B3.3 Calculating the QS

Once the APT is calculated, it is automatically passed to the third step of the TQE Calculator, which applies a linear scoring model of the user’s choice (set through the “Toggle Raw vs. Calibrated” button). Besides the APT, there are other parameters that need to be entered into the tool at this point. All except the APT and the EWC may be expressly entered into the text fields or extracted from the uploaded metric.

Name (Full Name)	Needed for which scoring models?	Can be imported from metric?
APT (Absolute Penalty Total)	Both	No (but import from Calculator Step 2)
EWC (Evaluation Word Count)	Both	No
PT (cutscore/Passing Threshold)	Both	Yes
MSV (Maximum Score Value)	Only calibrated	Yes
RWC (Reference Word Count)	Only calibrated	Yes
APP (Acceptable Penalty Points)	Only calibrated	Yes

There are intermediate values calculated by both noncalibrated and calibrated linear scoring models (see B1.2.2), but these are handled behind the scenes by the TQE Calculator.

B3.3.1 Noncalibrated (Raw) Linear Scoring Models

Noncalibrated (raw) linear scoring models are described in sections 2.4.1 and 2.5.1 of The MQM Scoring Models – MQM (Multidimensional Quality Metrics).

B3.3.1 Calibrated Linear Scoring Models

Calibrated linear scoring models are described in sections 2.4.2 and 2.5.2 of The MQM Scoring Models – MQM (Multidimensional Quality Metrics).

Once all parameters have been uploaded, the “Run Scoring Model” button will compute the QS and report a quality rating of PASS or FAIL.

The TQE Calculator also supplies the option to generate an Excel spreadsheet. This spreadsheet recreates the error count table and shows all calculations performed, and updates the corresponding cells if the error count table is changed. It also shows intermediate values not displayed in the webapp interface.

References

Lommel, A. (2016). Blues for BLEU: Reconsidering the Validity of Reference-Based MT Evaluation. Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools and Data Sets to an Integrated Ecosystem”, 63-70 (PDF pages 73-80). ↩︎