Teaching Research Data Management with DataLad: A Multi-Year, Multi-Domain Effort

Multi-Year, Multi-Disciplinary Efforts in Scientific Research Data Management Education

Research Background

With the development of modern neuroscience, Research Data Management (RDM) has become an indispensable skill for scientists. However, despite the importance of research data management for scientific research, such technical skills are often neglected in domain-specific graduate education. Therefore, more and more communities are making concerted efforts to provide organized training opportunities and self-learning materials to help early-career researchers acquire knowledge and skills in this area.

Massachusetts Institute of Technology (MIT)’s “The Missing Semester of Your CS Education” is a prime example of this educational gap. Furthermore, the high availability of modern computers and applications has significantly reduced users’ familiarity with computers, leaving many scientists lacking the basic technical skills required for effectively managing research data and results.

In response to this, the authors of this paper employed a multimodal teaching approach, leveraging the DataLad ecosystem by providing online and printed manuals, modular courses, and a flexible research data management knowledge base to conduct a series of research data management training sessions.

Source of the Paper

This paper was co-authored by Michał Szczepanik, Adina S. Wagner, Stephan Heunis, Laura K. Waite, Simon B. Eickhoff, and Michael Hanke, who are from the Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7) in Jülich, Germany, and the Institute of Systems Neuroscience at Heinrich Heine University Düsseldorf in Dresden, Germany. The paper was published on April 22, 2024, and was featured in the journal “Neuroinformatics.”

Introduction to DataLad

DataLad is a Python-based software tool licensed under MIT, designed to co-manage code, data, and their relationships. Built on Git-annex (a versatile system for data logistics) and Git (the industry standard for distributed version control), it adapts to scientific workflows through open source software development and distribution principles. Therefore, providing good user documentation and interaction can greatly help developers improve software quality.

Research Objectives and Methods

The main goal of this paper is to create and evaluate a multimodal teaching method to help researchers master research data management skills within the DataLad ecosystem and analyze the strengths and weaknesses of this training method. The research aims to enable technical novices to quickly and efficiently use the DataLad software, based on actual user needs, such as early-career researchers in research consortia. Additionally, the authors hope that the training materials can be entirely open-source, easily accessible, flexible, directly applicable to different research environments, and maintainable.

The DataLad Research Data Management Manual

Since its first release (version 0.0.1) in 2015, DataLad has had technical documentation that includes design overviews and reference documents. Although any form of documentation is better than none, if existing documentation does not meet the needs of the target users, they may still be inadequate. To address this issue, the authors created the DataLad Manual project to supplement existing technical documentation.

Design Considerations

The goals of the manual project include content suitable for a broad audience, practical experience, understandable language for technical novices, low entry barriers, and integrated workflows. The manual is structured into four sections: 1. Introduction: Contains a high-level description of the software and its features, along with detailed installation instructions for all operating systems. 2. Basics: Presented as code-driven tutorials, covering all stable software features. 3. Advanced: Covers functions beyond the basics, with independent chapters. 4. Use Cases: Contains brief descriptions and step-by-step instructions of actual use cases.

Technical Backbone

The manual’s development environment uses Sphinx (a documentation generator) in combination with the reStructuredText markup language to generate multiple output formats (e.g., HTML, PDF, LaTeX, ePub). Additionally, through an extension mechanism, the authors added custom warnings and designs, such as optional detail boxes, all of which are part of a Python package. The authors also developed a separate Python package, autorunrecord, for sequentially executing code in a specified environment and recording its output.

Impact and Scope

The online manual has been in continuous development for over four years, with an average of two releases per year, coordinated with the release of the DataLad core package, ensuring that users have access to the corresponding version of the manual. User documentation has facilitated improvements in software quality, validated the effectiveness of development efforts, and significantly increased the number of users and package downloads. For instance, from December 2022 to July 2023, the average number of visits to the online manual was 22,000 per 30 days, much higher than the 6,600 visits to the technical documentation. Overall, the development of the DataLad manual has had measurable positive impacts on the number of users, the package’s popularity, and software quality.

Courses and Workshops

In addition to the manual, the authors have designed a short-term RDM course based on DataLad, covering everything from dataset creation and local version control to data publication, collaboration, and dataset reuse. The course website uses the Carpentries course template, the content is written in Markdown, and the website is built using the static site generator Jekyll.

The course modules include basic DataLad commands, data structure optimization, remote collaboration, and dataset management. To ensure the openness of the teaching materials, all content is hosted in a public repository and open-sourced under the Creative Commons Attribution license.

Online Office Hours and Knowledge Base

Besides the manual and courses, the authors have also set up knowledge bases and online office hours to provide flexible support and establish documentation for recording and resolving technical issues. The framework of the knowledge base project is a simplified version of the manual’s technical system, with all knowledge base projects marked up using reStructuredText and hosted in a Git repository, generated in HTML format using the Sphinx tool.

Summary and Outlook

This research demonstrates the effectiveness of a multimodal teaching approach in enhancing researchers’ research data management skills. The paper details the design and technical requirements of the manual, courses, and knowledge base and shares experiences and lessons learned during development and teaching processes. These efforts not only improved the user experience and software quality of DataLad but also provided valuable references for other research software development and data management education projects.