Data documentation guidelines

This page presents information about how energy research data should be documented in the Tampere University energy research data catalog. It is intended for both the potential users of the data and persons who are adding data to the catalog. It helps users of the data to better understand the data documentation and it helps documenters in documenting their data. The catalog is based on the open source CKAN data management platform. So in addition to this page, the CKAN user guide can be checked for more details about how to use the catalog.

Available data is documented as datasets. A dataset should be a suitably sized logical entity. Data should be related to the same topic that is narrow enough, so it is easy to find. For example weather observations including temperature and humidity from Tampere and Helsinki could be documented in multiple ways. For example they could be just one dataset, two data sets for each city or four datasets one for each measurement and city. Here the best solution could be the two datasets option. So even if your data could work as one dataset remember to consider if it could make sense to split it into smaller datasets.

Datasets are organised using organizations and groups. Organization should be the party who is providing the data and is responsible for maintaining it and can provide help in using the data. It can be a research project, research group or an university unit for example. If a suitable organization does not exist, the CKAN administrator (otto.hylli@tuni.fi) can create it. If the organization exists its administrator or the CKAN administrator can add users to its members which then enables the user to create new datasets for the organization. Groups are a handy additional way to categorise datasets for example when the same organization has data related to multiple topics. Groups can also include datasets from multiple organizations. A dataset can also belong to multiple groups so for example the weather observations from Tampere dataset could belong to both group for weather and a group for Tampere.

A dataset consists of various fields that give information about it such as title, description, source and maintainer. These are described in more detail later on this page. The dataset also has resouces which can be files or links. Resources can contain the data itself if it can be provided publicly. Resource can also provide additional information about the data such as list of available measurements. File resources can be any type since they can be downloaded and viewed on the user's computer. Though of course more common file formats should be favored to make them easier to use. However CKAN has special features for tabular data stored in Excel or CSV files. When this kind of file is added, CKAN is able to extract the data from the file and upload it to the CKAN datastore. This makes it possible then to preview the data on the resource's web page without the need to download the resource file. This feature has some limitations. Notably Excel files consisting of multiple sheets are not supported.

Below are listed all of the fields that can be used to give information about a dataset and some notes about them. Some of these fields are natively offered by CKAN and some are custom fields added specifically to the energy research data catalog. Of these only title is required but giving as much information as possible is highly recommended.

title: Short describtive title that gives a good idea what the data is about.
description: Longer description of the data. Should give all relevant information needed to use the data that is not apparent from the dataset resources. The description should give a general overview of the data: where it comes from and what it covers. It can also explain what kind of usage the data is suited for. It should explain how the data can be accessed if it is not available as resources. The format of the data should also be explained. It should also mention any restrictions related to how the data can be used. If multiple datasets share common features such as the data format or way it can be accessed, those things can be explained only once in a related group or organization, and the dataset description can then just provide a link to that information.
tags: Keywords used to describe and categorise the dataset and make it more easy to find. Tags can be chosen freely. If possible tags that have been already used should be used. When starting to type a tag, the system suggests already existing tags. When viewing a dataset its tags are links, which show all datasets using that tag.
license: Tells under which conditions the data can be used. If the data is open a suitable open data license can be selected from the list of licenses. If the data is meant just for use in the university the option "For internal research and education use" can be selected. More details about how the data can be used can be given in the dataset description. You can learn more about open data licenses on opendefinition.org.
visibility: Public datasets are visible to everyone. Private datasets are visible only for the members of the organization who owns the dataset. Datasets in the energy research catalog should be public so that researchers can find them. However when the dataset is been worked on it can be marked as private until it is ready.
source: Link to the possible source of the dataset if applicable.
version: Version number for the dataset. Probably not often needed.
author: Author of the dataset. Not probably needed.
author email: Email address of the dataset author.
maintainer: Person responsible for maintaining the dataset. This should be a person who can answer questions about the dataset and if the data is not public, provide access.
maintainer email: Contact email address for the maintainer.
Data owner: The party the data actually belongs to and who has the final say in how the data can be used. Can be different from the organization who provides the data.
measurement start date: If the data covers a period of time this should indicate the start date for that period.
measurement end date: If the data covers a period of time this should indicate the end date for that period.
temporal resolution: If the data is a time series then this should indicate the shortest possible time interval between data points. For example 1 second or 30 minutes.