Workshop 2

Data formats and tools

What's it about?

When starting a new project, it can be easier to stick with the tools and file formats we know rather than adding something else to the pile of things to learn or indeed creating a risk to the project by choosing to use an untried or unfamiliar approach. What reduces that risk is recommendation from someone who’s done it and got first hand experience of the pros and cons.

That’s what the round table discussion of DM4T/TEDDINET workshop 1 decided to tackle for the second workshop in the series: to report on experiences with “next level” formats and tools applied to the energy research domain, with the overall goal of enabling the delivery of accessible data legacies for the TEDDINET portfolio. Topics shortlisted were: RDF, Mongo, HDF5, JSON, XML DTDs, Matlab, R and Python.

What we learned from workshop 2

(Personal perspective by Julian Padget)

The list of talks and the slides from workshop two are available on the project website below. The aim of the second workshop was to bring together people working on creating open data and representatives of TEDDINET projects to exchange views and practice on how to store, query and publish data during and following the project research phase.

Jo Barratt described Open Knowledge International, which is developing a tool-chain for "frictionless" data, regardless of original format using open formats to enable long-term access because "the best use for your data is the one you have not thought of". The OKI perspective was complemented by the TEDDINET and other EPSRC project reports, that made clear not only the variety of formats (relational, CSV, hierarchical data format) and tools (the R statistics package and Matlab), being used, but also the potential post-project problems in providing continued access to the data. Matt Colmer echoed this last issue, identifying the key challenge for Digital Catapult/InnovateUK's Building Data Exchange as how to describe, associate and integrate data, while not hosting the data itself, but rather a collection of data endpoint and metadata references. The last point underlined the importance of Alex Ball's talk about metadata authoring, but also the challenge facing projects in finding suitable ontologies (rather than making their own) and comfortable tools for the STEM community.

Key points for me were:

  1. Not knowing what ontologies and tools are even available, let alone being able to choose between them, with inertia stemming from not wanting to make the "wrong" choice.
  2. Optimism that tool-chains such as that being developed by OKI could offer some help in making the first step from private to query-able publication of data.
  3. The usefulness of knowing that a range of formats and tools are in effective use across a variety of projects, and that inter-operation is not a significant problem.
  4. Hearing how the data preservation community is addressing and solving some of the problems that STEM researchers are starting to be aware of and that we need to talk more!
  5. We need more TEDDINET projects to participate in the preparation for long-term data publication.

Who should attend?

This is a 24hr workshop, Tuesday pm and Wednessday am (14th and 15th June), with a networking dinner in the evening and will take place on campus at the University of Bath. Subsistence is covered by DM4T (for TEDDINET projects we'd like to limit this to 2 delegates/projects, but talk to us, if need be). If you are not part of a TEDDINET project, DM4T will also cover your subsistence costs.

To register, please email Caroline Hughes with the following information:

  1. Name, affiliation and project (if any)
  2. TEDDINET partner status
  3. Dietary requirements
  4. Attendance (2nd or 3rd or both)
  5. Accommodation: required or not

Who's speaking?

Draft programme

Day 1 Location: 8W2.4 14th June
12:00 - 13:00 Arrival + lunch
13:00 - 13:15 Introduction to DM4T and workshop 2 (slides) Julian Padget
Computer Science, Bath
13:15 - 14:15 Towards Frictionless Data (slides)
The Frictionless Data Project
Jo Barratt
Open Knowledge Foundation
14:15 - 14:30 Overview of candidate data formats (slides) Julian Padget
Computer Science, Bath
14:30 - 15:00 MySQL and ENLITEN (slides) Sukumar Natarajan, Bath
15:00 - 15:30 Data formats: Hierarchical Data Format (slides) Daniel Fosas de Pando, Bath
15:30 - 16:00 Coffee Break
16:00 - 16:30 Data formats: Resource Description Format (slides) Julian Padget, Bath
16:30 - 17:00 Data collection, storage and analysis in the APAtSCHE project (slides) Bruce Stephen, Strathclyde
16:30 - 17:00 Using R for energy data analysis (slides) Sukumar Natarajan, Bath
19:30 - 21:30 Dinner Abbey Hotel, Bath
Day 2 Location: 8W2.4 June 15th
09:30 - 10:30 Making data better by adding some meta(data) (slides) Alex Ball, Library, Bath
10:30 - 11:00 Metadata for Energy and working with Python (slides) Jack Kelly, Imperial College
11:00 - 11:15 Coffee Break
11:15 - 11:45 Matlab (slides) Alfonso Ramallo Gonzalez, Bath
11:45 - 12:15 The Building Data Exchange (slides) Mat Colmer, Digital Catapult
12:15 - 13:30 Round table + lunch
critique of this workshop, planning for future workshops