In case of terms with various meanings (polysemous words or homonyms), it can be misleading but at least annoying to get concordance search results from the wrong topic. For example, researching the legal term “resolution”, and getting results from chemistry-related regulations can easily seem like a waste of time, writes Tímea Palotai-Torzsás and Robin Palotai, co-founders of Juremy.
“As we are not only developers but also active users of Juremy.com, we understand the importance of choosing the right context first hand. So we looked for a way to solve this issue by focusing our searches to specific domains.
In this article we’ll give a concise overview of the EuroVoc thesaurus, its existing practical uses in EU legal and professional translation, but also a challenge we faced while using it to customize searches in Juremy, and how we solved that.
1. What is EuroVoc classification and why was it created?
EuroVoc is the multilingual thesaurus maintained by the Publications Office of the European Union, covering the activities of the EU. EuroVoc organizes concepts into 21 domains and 127 sub-domains in all 24 official languages of the European Union and in three languages of countries which are candidates for EU accession: Albanian, Macedonian and Serbian. The covered fields relate to EU and parliamentary activities, and encompass both European Union and national points of view.
The original aim of the thesaurus was to provide the information management and dissemination services with a coherent indexing tool for the effective management of their documentary resources. Today, it enables users to carry out documentary searches using a controlled vocabulary and with the benefit of semantic networks between concepts. Let’s have a closer look at EuroVoc’s semantic network structure:
The 21 domains are different fields of knowledge which are used to categorize concepts. For example, EUROPEAN UNION, LAW, TRADE or TRANSPORT are domains. The domains are further subdivided into 127 sub-domains (also called microthesauri). For example, see the listing of all 7 sub-domains of the the domain TRADE below:
Each such microthesaurus further expands to a tree of concepts (also called terms). See a part of the concept tree under microthesaurus “2026 Consumption” listed below:
As illustrated above, immediately below a microthesaurus are top terms (TT), and further narrower terms (NT). Narrower terms are indexed by their distance from some reference term. In the example above, NT1 is one step apart from the top term, while NT2 is two steps apart.
The EuroVoc concept tree is rathen broad than shallow. In the latest release of EuroVoc published in December 2021, out of the 7301 concepts, 545 are top terms and the bulk of the remaining terms are just one or two steps apart from the top term.
The concepts are hierarchically rather sparsely connected as well, due to an important principle of EuroVoc: apart from the GEOGRAPHY domain, concepts only belong to a single domain, even if logically they could belong to multiple. Taking the example of the EuroVoc handbook, the concept “common agricultural policy” could logically either belong to the domain EUROPEAN UNION, or the domain “AGRICULTURE, FORESTRY AND FISHERIES”. But in practice, it is found under the latter, more specific domain.
This limitation of hierarchical relations simplifies categorization, but also introduces some ambiguity for users (“Did I select the right domain?”). We’ll see later how we addressed this within Juremy’s domain-based search functionality.
2. Who uses EuroVoc and for which purposes? Why is it useful for linguists?
- Indexing large document collections: usage by EU and national organizations
EuroVoc is one of the means for indexing the EUR-Lex documentary collection. It is used by:
− the European Parliament
− the Publications Office of the European Union
− the national and regional parliaments in Europe
− some national government departments and European organizations, and Documentation Centres. The assigned descriptors (either domains, microthesauri or concepts) can then either be used for guiding search and retrieval of documents in the collection, also to quickly convey a summary of the document contents for the users.
- EUR-Lex advanced document search filter
EuroVoc descriptors appear to be a useful tool for capturing the thematic content of EU legislation. It is possible to search EUR-Lex documents by “Theme” in the advanced search page, which will allow users to pick one or more EuroVoc descriptors which are used to describe the content of a document. This is useful rather for document search than terminology lookup.
However, in EUR-Lex advanced search, it is not possible to filter on a complete domain or micro-thesaurus with a single click. If the user wishes to only retrieve documents belonging to a given domain, she would need to tick all top terms in that category (see the image below). Consequently, this use case is more designed for very specific document research in a narrower field.
- Domain filtering of IATE terms
IATE (Inter-Active Terminology for Europe) is the EU’s interinstitutional terminology database, administered by the EU Translation Centre. Currently IATE contains 934 thousand entries and around 8 million terms in the 24 official EU languages, and also Latin. The EuroVoc thesaurus serves as the basis for the IATE classification of terms.
The “Expanded search” surface on the IATE website supports domain filtering across all EuroVoc domains and descriptors.
Another use case of applying domain filters to IATE terms is the Term Recognition plugin in the Trados Studio CAT tool. When activated in the Termbase Settings, the IATE Terminology Provider plugin will automatically extract all IATE terms in the active segment of the document and display them in the target language as well. With an additional setting, translators can filter IATE terms by choosing one or more domains in the domain filter page.
However, in case of legal language or more general expressions, it is hard to decide which domain(s) should be chosen to find the right terminology. In order not to exclude possibly accurate translations, one feels an urge to tick all boxes.
As EuroVoc is used to categorize both EurLex documents and IATE terms, which are coincidentally the corpora on which Juremy offers blazing-fast bilingual concordance search, it is a natural choice for focusing search in Juremy as well.
But as seen above, the problem is that filtering reduces the number of potentially useful query hits, by not surfacing terms which don’t belong to the chosen domain, sub-domain or concept. Now let’s see how Juremyresolves this issue!
3. How does Juremy surface more relevant results by using the EuroVoc classification?
Focusing the topic is very important in EU institutional and legal translation. The risk of choosing a target language equivalent from the wrong context can be high, particularly for shorter query expressions.
For example, if we would like to find the German equivalent of the term “duty”, first we have to determine which field of knowledge our search term belongs to. If the document topic, or surroundings of the expression suggests a financial context, we should search for equivalents within the EuroVoc domain “Finance”. In this case, the accurate translation would be a variant of “Steuer” or “Zoll” (see the following illustration from Juremy’s new interface):
On the other hand, if the context suggests that the term “duty” is rather related to an industry sector specification, we should be looking for the correct translation in the “Industry” domain. It is always the linguist’s understanding of the translation project which will help choose the right terminology. Specifically in case of this example, “duty” has at least a dozen different meanings depending on the sector in which it is to be interpreted, as illustrated below:
However, there are two main issues which arise when we try to filter topics, or look for the equivalent term only in a given domain or sub-domain.
First, it is often ambiguous for a translator to decide which EuroVoc topic to choose when trying to find the most accurate translation of a term. The classification is quite complex, furthermore a given IATE term or EUR-Lex document often belongs to multiple domains.
– Second, it is top priority for linguists to find as many relevant hits for a term query as possible. This allows them to choose the most accurate term from a larger pool of alternatives.
What if we added EuroVoc topics to Juremy as filters? As a user, if we didn’t get any matches for our query, we could never be sure if there were any relevant matches that were filtered away. That thought would prompt us to retry the search with the filters turned off. Toggling the filters on and off is at least inefficient, but also a risk that we wouldn’t use topic filtering at all.
To resolve this contradiction, we implemented a user-friendly and efficient way to focus searches: instead of setting strict filters, we can set topic preferences! When preferences are set, Juremy will return preferred results in the first place, but would still fall back to other topics if there aren’t any preferred ones. This way we can’t miss any relevant results.
Currently we support setting topic preferences in terms of EuroVoc domains. Preferring a domain will automatically prefer all IATE entries or EUR-Lex documents tagged with those domains or any of its constituting concepts. Clicking a domain label on any result entry (after enabling the Topic view) will let us set our preference for that domain.
Also, we found it is often easier to tell what domain we likely won’t need. So in addition to positive preferences, we can also set negative ones – domains from which (at least given better options) we would rather not get results. And once we have set up some preferences, they can be fine-tuned further with the topic preferences editor:
A bit about the visuals: As topics are treated more prominently on Juremy’s user interface, we wanted the topic labels to be quickly distinguishable, informative but also not obtrusive. To help that, we got a custom icon set designed representing the 21 EuroVoc domains, and assigned well-differentiable colors for the domains as well.
We also found these objectives to be somewhat contradictory. So we give the decision to our users about their preferred markup style, and provide two customization options:
- Choice of detail exposed on the label: just the domain icon, or also the domain name, or even showing the specific concept within the domain.
- Ability to turn off colors. However mesmerizing nice colors are, we wanted an option to keep the topic labels in the background as much as possible.
The topic preferences feature has just been released and made available to users upon a short registration on our website. Sign up for the 30-day free trial to benefit from all of Juremy’s functionalities. If you have any questions or remarks relating to this article, please feel free to email us at firstname.lastname@example.org .
 the EuroVoc Handbook issued by the Publications Office, http://publications.europa.eu/resource/cellar/a2723a83-574f-11eb-b59f-01aa75ed71a1.0001.01/DOC_1
 Apart from the hierarchical term relations, EuroVoc also contains various links that allow horizontal movement between related terms (RT) even across domains.
 Or in fact, a concept can only belong to a single parent concept (broader term, BT), so there aren’t any dense hierarchical relations even within a single domain, except GEOGRAPHY.
 Michal Ovádek: Facilitating access to data on European Union laws, https://www.tandfonline.com/doi/full/10.1080/2474736X.2020.1870150