Glossary

A

API

An API, or Application Programming Interface, is a set of rules and protocols that allows different software applications to communicate with each other. It defines the methods and data formats that applications can use to request and exchange information. APIs play a crucial role in enabling the integration of different software systems, allowing them to work together seamlessly.

Analysis-ready and cloud-optimized

Analysis-ready and cloud-optimized (ARCO) data refers to data that has been processed, cleaned, and formatted to meet the requirements of analytical tools or processes as well as structured or stored in a way that takes advantage of cloud computing capabilities for storage, processing, and scalability.

Asset

An asset is a piece of data or information, or an artifact, that does not evolve over time, and that is directly usable by users or software (e.g. part of software pipelines, research studies, thematic or local applications etc.). Hence, this includes, but not limited to, data products, model outputs or any object produced as part of an analytical process (tables, maps, graphics, etc.).

C

Climate and Forecast conventions

The Climate and Forecast (CF) conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. This enables users of data from various sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities.

CF standard name table

The Climate and Forecast standard name table (part of the CF Conventions), defines strings that identify physical quantities and variables.

CI/CD

CI/CD stands for Continuous Integration and Continuous Delivery (or Continuous Deployment), and it represents a set of principles and practices in software development aimed at improving the efficiency, reliability, and speed of the development and delivery process.

Computing cluster

A computing cluster is a set of computing nodes (computers, Virtual Machines, HPC nodes, etc.) that work together so that they can be viewed as a single system, independent of the infrastructure.

D

Data lake

A Data Lake is a repository of raw or refined data, structured or not, without format constraints, as opposed to a data warehouse, which is a repository of highly structured and refined data. Data lakes leverage the development of high-speed and scalable Cloud based storage solutions (object, file and block storages).

Data orchestrator

A Data Computing Orchestrator, or a Data Orchestrator is a workflow manager specialized in asset production. It is responsible for creating/dispatching pipelines on the target infrastructure/computing clusters.

Digital Ocean Forum

The Digital Ocean Forum (DOF) is an annual event to collaborate with stakeholders in defining the goals and aspirations that can be accomplished by setting up a robust EU Digital Twin Ocean (DTO).

Digital twin of the ocean

A Digital Twin of the Ocean (DTO) is a virtual representation of the real ocean that has a two-way connection with it. Observations from the real ocean, in combination with models, data science and artificial intelligence, are used to create a digital twin that adapts as the real world changes. Manipulating the twin to address ‘what if’ scenarios can provide information for decision-making and highlight regions of the real ocean in need of better or different observations. A well-constructed digital twin of the ocean will enable a wide range of users to interact with ocean data and information to improve understanding and inform decisions and can support ocean literacy and ocean understanding. They can be used to explore ways in which the ocean will respond to a changing set of conditions, providing a powerful tool for decision making. DTOs will provide ocean researchers, professionals, citizen scientists, educators, policy makers, and the general public alike with the capability to visualise and explore ocean knowledge, data, models and forecasts.

The construction of a Digital Twin of the Ocean should be based on existing European Union public services, programs and projects, including existing infrastructures, observations and tools, to provide improved products, improved ocean information, improved user services and models and tools in a Digital Co-Working Environment framework that will be used on demand. Co-design, flexibility and interactivity for users of the Digital Twin of the Ocean will be the novelties, but operationality, robustness and scientific assessment are the constraints to provide useful information. The EDITO projects aim to fulfil that need.

E

EDITO

EDITO refers to both: - The European Digital Twin of the Ocean (EDITO) project - The European Digital Twin Ocean Core Infrastructure platform, developed as part of the EDITO-Infra project

EDITO computing cluster

An instantiated computing cluster that can be used to run services and execute processes.

EDITO Data API

The API exposed by the EDITO platform allowing users to store, reference and access data, metadata and assets.

EDITO data catalog

The EDITO data catalog references collections on a common topic with assets that are stored either in EDITO data storage or in external platforms (such as Copernicus Marine Service or EMODnet). It offers APIs to add assets to the catalog and APIs to access the assets.

EDITO data lake

The “data access” component of EDITO, composed of the both EDITO data storage and the EDITO data catalog.

EDITO data orchestrator

A data orchestrator exposing APIs to define and execute pipelines that produce assets on EDITO computing cluster or external infrastructure (cloud-based computing clusters, HPC centres, etc.).

EDITO data storage

The EDITO data storage is a cloud-based repository of assets.

EDITO datalab

The EDITO datalab is a collaborative platform for oceanographic data science and near-data computing to build Digital Twin Oceans. It is a subcomponent of the EDITO platform dedicated to data scientists, developers and integrators. It is a web portal allowing to graphically exploit all capabilities of the EDITO data lake and engine and their APIs.

EDITO engine

The “computing” component of EDITO, composed of both the EDITO computing cluster and the EDITO data orchestrator.

EDITO platform

The EDITO platform is the European Digital Twin Ocean Core Infrastructure developed as part of the EDITO-Infra project. It aims at providing a good user experience to explore and share ocean data, models and services in an open and collaborative way.

EDITO process

A process defined and executed with the EDITO Process API. A process is a remote function that generates data, such as data transformation, pre/post-processing, reanalysis, forecasts, detections, What-If scenarios, quality controls. It can be piped, scheduled, or triggered on-demand. Its inputs and output locations can be configured at runtime. In opposite to a service, it is not interactive during execution (it does not host a web server or UI).

EDITO Process API

The API exposed by the EDITO platform allowing users to define, execute and manage processes.

EDITO project

The European Digital Twin of the Ocean (EDITO) is composed of two projects, the “EU Public Infrastructure backbone for the European Digital Twin of the Ocean” (EDITO-Infra) and the “European Digital Twin of the Ocean Underlying Models” (EDITO-Model-Lab). You can learn more about these projects here.

EDITO service

A service defined and executed with the EDITO Service API. A service is an interactive application (serving API and/or a graphical web interface). It can be a data science tool or an end-user application (decision making applications, What-If applications or focus applications) dedicated to anyone or to a specific community of users.

EDITO Service API

The API exposed by the EDITO platform allowing users to define, execute and manage services.

EDITO user software

A software is a set of computer programs and associated documentation and data. In the frame of the EDITO-Infra project, a “user software” is a software uploaded by a user to the EDITO platform. Two types of software are allowed: containerized software (i.e. docker images) and scripts/codes in languages that are handled by the EDITO Service API and Process API. Both are foreseen to define end-users applications, tools, ad-hoc tasks and pipelines.

EDITO virtual co-working environment

As part of the EDITO-Infra project, a virtual co-working environment is a web-based portal for users to connect, interact, work together in communities and share knowledge. It leverages capabilities of Virtual Development Environment (VDE) and Virtual Research Environment (VRE) with a focus on social networking and collaborative tools with an advanced ergonomic approach for data visualisation and interactivity (e.g. smart viewer). In particular, it includes a set of tools for intermediate users to co-develop scientific toolboxes and to perform simulations using orchestrators and GUI to optimise the workflows, the connection with the data lake and the exploitation of a large diversity of computing resources.

On the EDITO platform, the virtual co-working enviroment is materialize by its sub-components; the EDITO datalab, the integrator viewer, the trainings and tutorials.

EDITO-Infra

The main aim of EDITO-Infra is to build the EU Public Infrastructure backbone for the European Digital Twin of the Ocean by upgrading, combining and integrating key service components of the existing EU ocean observing, monitoring and data programmes Copernicus Marine Service and European Marine Observation and Data Network (EMODnet) into a single digital framework.

EuroHPC

EuroHPC stands for European High-Performance Computing. EuroHPC is a joint collaboration and initiative among European countries to develop and deploy a pan-European supercomputing infrastructure. The primary goal of EuroHPC is to enhance Europe’s competitiveness in the field of high-performance computing, support scientific and industrial research, and address complex challenges that require significant computational power.

F

FAIR principles

FAIR is an acronym for Findable, Accessible, Interoperable, and Reusable. These principles are designed to improve the usability and value of digital assets, especially research data. The FAIR Data Principles were introduced to address challenges related to the discovery, accessibility, and usability of data in scientific research.

G

GitLab

GitLab is a web-based platform that provides a complete DevOps lifecycle tool for managing source code repositories, continuous integration, and deployment pipelines. It offers features for version control using Git, issue tracking, code review, continuous integration, and delivery (CI/CD), and collaboration among development teams, making it a comprehensive solution for software development and project management.

H

HPC

HPC stands for High-Performance Computing. It refers to the use of powerful computing systems to solve complex computational problems or perform tasks that demand substantial processing power and computational resources. High-Performance Computing involves the use of parallel processing and supercomputers to execute tasks quickly and efficiently.

I

IAM

The term “Identity and Access Management (IAM) system” refers to a comprehensive framework or set of processes and technologies designed to manage and secure digital identities within an organization. IAM systems play a crucial role in controlling and regulating access to various resources, systems, and data based on the users’ identities and their associated permissions. This includes activities such as user authentication, authorization, and the management of user credentials, roles, and privileges. The primary goal of an IAM system is to ensure that the right individuals have appropriate access to the right resources at the right time, while also maintaining security and compliance standards.

J

JSON Web Tokens

JSON (JavaScript Object Notation) Web Tokens (JWTs) are a compact, self-contained means of representing information between two parties. They are commonly used for authentication and authorization purposes in web development.

K

Keycloak

Keycloak is an open-source identity and access management (IAM) solution developed by Red Hat. It provides a robust and flexible platform for securing applications and services through features such as authentication, authorization, and user management. It supports various identity protocols, including OAuth 2.0 and OpenID Connect, making it interoperable with a wide range of applications and services. Keycloak also includes features like role-based access control, multi-factor authentication, and social login integration.

M

Metadata

Metadata is “data that provides information about other data”, but not the content of the data, such as the text of a message or the image itself.

N

NetCDF

Network Common Data Form (NetCDF) is a standardized file format used for storing multidimensional scientific datasets. It is commonly employed in various scientific fields like geosciences, meteorology, and oceanography due to its capacity to manage complex data structures efficiently.

O

OAuth2

OAuth 2.0, short for “Open Authorization 2.0,” is an authorization framework that enables a third-party application to obtain limited access to a user’s resources on a server, such as an API (Application Programming Interface), without exposing the user’s credentials to the application. It is widely used for enabling secure access to APIs and web services.

OGC

OGC stands for the Open Geospatial Consortium. It is an international consortium that develops and publishes open standards for geospatial information and technologies. The OGC is focused on promoting interoperability and collaboration in the field of geospatial data and services.

OGC API – Features

A multi-part standard that offers the capability to create, modify, and query spatial data on the Web and specifies requirements and recommendations for APIs that want to follow a standard way of sharing feature data.

OpenID Connect

OpenID Connect (OIDC) is an identity layer built on top of the OAuth 2.0 protocol, designed to provide authentication and authorization capabilities for web and mobile applications. Developed to address the need for a standardized identity layer on the internet, OpenID Connect enables clients (applications or websites) to verify the identity of end-users based on authentication performed by an authorization server.

P

Pipeline

A pipeline, or computing pipeline, or data pipeline, or ETL (Extract, Transform, Load) pipeline is a chain of tools (software scripts, tasks, operations, etc.) that takes assets from one location, storage or format and processes it to a different location, storage or format. For instance, data external to the data lake can be processed to be stored in an optimised format within the data lake. The pipelines can be triggered either automatically by an orchestrator or explicitly by a user.

S

S3 storage

GitLab is a web-based platform that provides a complete DevOps lifecycle tool for managing source code repositories, continuous integration, and deployment pipelines. It offers features for version control using Git, issue tracking, code review, continuous integration, and delivery (CI/CD), and collaboration among development teams, making it a comprehensive solution for software development and project management.

STAC

The SpatioTemporal Asset Catalog (STAC) specification provides a common structure for describing and cataloging spatiotemporal assets. A spatiotemporal asset is any file that represents information about the earth captured in a certain space and time.

STAC Catalog

A simple, flexible JSON (JavaScript Object Notation) file of links that provides a structure to organize and browse STAC Items.

STAC Collection

A STAC object extending the STAC Catalog object with additional information such as the extents, license, keywords, providers, etc. that describe STAC Items that fall within the Collection.

STAC Item

The core atomic unit, representing a single spatiotemporal asset as a GeoJSON feature plus datetime and links.

Storage tiering

Storage tiering refers to the strategy used to optimize data storage-related cost.

V

Vault

HashiCorp Vault is an open-source tool designed for secure secret management, data protection, and access control in modern infrastructure. Vault addresses the challenges associated with managing sensitive information, such as passwords, encryption keys, and API tokens, in a secure and scalable manner.

Virtual Machine

A virtual Machine (VM) is a software-based emulation of a physical computer that operates in an isolated environment. It allows multiple operating systems to run on a single physical machine, enabling users to run diverse applications and services without affecting the underlying hardware. Each VM must be configured with a dedicated operating system and software as if it were a separate computer.

W

WEkEO

WEkEO is an online platform designed to provide access to a wide range of Earth observation data and related information. It offers a collaborative environment for researchers, scientists, and the broader Earth observation community to access, analyze, and share data. The platform is developed in partnership with the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT), the European Centre for Medium-Range Weather Forecasts (ECMWF), and the European Commission. WEkEO provides users with a centralized hub for accessing a diverse set of Earth observation datasets, including satellite imagery, climate data, and other environmental data.

WMS

Web Map Service is a standard protocol for serving georeferenced map images over the internet. It allows users to access and retrieve map images from a server and overlay them onto their own maps or applications.

Workflow

A workflow is the representation of a series of pipelines.

Workflow manager/engine

A workflow manager or engine is the executor of defined workflows.

Z

Zarr

A data storage format for multi-dimensional arrays, optimized for efficient handling of large datasets in scientific computing. It supports compression, chunking, and is commonly used in Python for tasks involving arrays and data analysis.