Data glossary

Definition for standard data terms

FAIR RDM Glossary

Terminology taken from the PATTERN Training material 1.

Definition of common data terms
Term Definition Source
Data Data, in its simplest form, refers to information that can be collected, stored, analyzed, and shared. https://geo-data-support.sites.uu.nl/what-is-research-data/
Dataset A dataset is any organized collection of data. The most basic dataset is composed of data elements in a table. Each column represents a particular variable. Each row corresponds to a given value of that column’s variable. A dataset may also present information in a variety of non-tabular formats, such as an extended mark-up language (XML) file, a geospatial data file, or an image file. Dataset is a flexible term and may refer to an entire database, a spreadsheet or other data file, or a related collection of data resources. https://data.ca.gov/pages/open-data-glossary
Data Management Plan A Data Management Plan (DMP) is a formal document you develop at the start of your research project which outlines all aspects of managing your data, both during and after your project. It contains, among others: https://www.uu.nl/en/research/research-data-management/guides/data-management-planning
●      what data you will collect (type, size)
●      where you will store them, who will have access to them and how access is regulated
●      how often you will back-up your data
●      how you will document your data
●      the versioning strategy and folder structure you will use
●      whether there are any privacy and ownership issues
●      whether and how you are planning to archive and share your data.
FAIR Findable, Accessible, Interoperable, Reusable
Research Data Management (RDM) Research data management refers to the handling of research data (collection, organisation, storage, and documentation) during and after a research activity. Good data management helps ensure that researchers share their data in a FAIR way (findable, accessible, interoperable, and re-useable). https://www.scienceeurope.org/our-priorities/open-science/research-data-management/
Metadata Data associated with an information object for purposes of description, administration, legal requirements, technical functionality, use and usage, and preservation Getty Research Institute (https://www.getty.edu/research/publications/electronic_publications/intrometadata/setting.pdf)
Metadata is information about a dataset that makes the data easier to find or identify. Metadata includes the title and description, method of collection, limitations, author, publisher, area and time period covered, license, date and frequency of release. Metadata describes the dataset’s structure, data elements, its creation, access, format, and content.
Persistent identifier Persistent identifier (PID) is a long-lasting reference to a resource that provides the information required to reliably identify, verify and locate the resource. In a digital environment, PIDs have the form of URLs. When pasted in a browser, they take users to the resource. https://www.openaire.eu/guides/
CC0 1.0 CC0 (aka CC Zero) is a public dedication tool, which enables creators to give up their copyright and put their works into the worldwide public domain. About CC Licenses - Creative Commons
Preferred file formats Preferred formats are file formats of which DANS – based on international agreements – is confident that they will offer the best long-term guarantees in terms of usability, accessibility and sustainability. Deposits of research data in preferred formats will always be accepted by DANS. https://dans.knaw.nl/en/file-formats/
Open file formats Open File Formats are file formats that are published and freely available for anyone to use. A file format is a standard way of encoding storage of computer information. Open file formats can be contrasted with proprietary, protected file formats. Open file formats are often recommended for preservation purposes because they typically do not require special software to open. https://www.nnlm.gov/guides/data-glossary/open-file-formats
FAIR-Aware tool The FAIR-Aware tool helps you assess your knowledge of the FAIR Principles, and better understand how making your data(set) FAIR can increase the potential value and impact of your data. https://fairaware.dans.knaw.nl/
Open data Data is open if it can be freely accessed, used, modified and shared by anyone for any purpose (http://opendefinition.org/). For Data.ca.gov, open data is regularly updated and comes from an authoritative https://data.ca.gov/pages/open-data-glossary
Data documentation Data documentation includes various types of information that can help find, assess, understand/interpret, and (re)use research data – e.g. information about methods, protocols, datasets to be used and data files, preliminary findings, etc. Documentation helps understand the context in which data were created, as well as the structure and the content of data. Data should be documented through all stages of the research data lifecycle. Detailed and rich documentation ensures reproducibility and upholds research integrity. Documentation also includes metadata. https://www.openaire.eu/rdm-glossary
ReadMe file A readme file provides information about a data file and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data. https://data.research.cornell.edu/data-management/sharing/readme/
Tidy Data principles ●      Tidy data is data that is well designed for working with using computers https://datacarpentry.org/semester-biology/materials/tidy-data/
●      Creating tidy data as you collect it will make it much easier to analyze it later
Knowledge Organisation System (KOS) Generic term used for referring to a wide range of items (e.g. subject headings, thesauri, classification schemes and ontologies), which have been conceived with respect to different purposes, in distinct historical moments. They are characterized by different specific structures and functions, varied ways of relating to technology, and used in a plurality of contexts by diverse communities. However, what they all have in common is that they have been designed to support the organization of knowledge and information in order to make their management and retrieval easier. https://www.isko.org/cyclo/kos
MARC MARC (machine-readable cataloging). The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form. https://www.loc.gov/marc/
Linked Open Data Data where relationships/connections between them are available to allow easy data access. CODATA Research Data Management Terminology
A typical case of a large Linked dataset is DBPedia (http://dbpedia.org/), which essentially makes the content of Wikipedia available in RDF. This related collection of interrelated datasets is stored on the Web and available via a common format RDF. https://doi.org/10.5281/zenodo.10626170
API - Application Programming Interface An API is a collection of definitions, protocols and instruments that interact and communicate with each other.
Schema Metadata schemas (or metadata standards) are compilations of categories for describing data. A distinction is made between interdisciplinary or independent standards and discipline-specific or dependent standards. Metadata schemas are intended to ensure that all researchers use the same descriptive vocabulary in order to guarantee interoperability and thus comparability of data sets. https://www.uni-marburg.de/en/hefdi/data-information-consulting/frequently-asked-questions-faq/publishing-and-sharing-research-data/what-are-metadata-metadata-schemas-controlled-vocabularies-and-documentation
RDF Triple As its name indicates, a triple is a sequence of three entities that codifies a statement about semantic data in the form of subject–predicate–object expressions (e.g., “Bob is 35”, or “Bob knows John”). Wikipedia
Repository Physical or digital storage location that can house, preserve, manage, and provide access to many types of digital and physical materials in a variety of formats. Materials in online repositories are curated to enable search, discovery, and reuse. There must be sufficient control for the physical and digital material to be authentic, reliable, accessible and usable on a continuing basis. CODATA RDM Terminology Working Group. (2024). CODATA RDM Terminology (2023 version): overview (Version 2023). Zenodo. https://doi.org/10.5281/zenodo.10626170
Back to top

Footnotes

  1. PATTERN Training, DOI 10.5281/zenodo.11093950↩︎