Differences in Workflow Systems: A Use-Case Driven Comparison

Zusammenfassung

Scientific Workflow Management Systems (SWMS) are software systems designed to enable the scalable, distributed, and reproducible execution of complex data analysis workflows on large datasets. Due to the importance of such analyses, a plethora of different systems have been built over the last decades. Although all of them share the same core functions of allowing workflow specification, controlling task dependencies, and steering the correct execution of workflows on a given computational infrastructure, they differ notably in the specific implementations of these functionalities and often offer additional features that can benefit workflow developers. The differences are often subtle, yet impactful, and often lack proper documentation leading to unpleasant surprises when trying to port a workflow developed for one SWMS to another or when re-implementing a stand-alone application with an SWMS. In this chapter, we want to highlight some of the main differences between workflow systems by comparing the properties and features of four SWMSs: Nextflow, Airflow, Argo, and Snakemake. The comparison is conducted using SWMSs to reimplement a complex workflow from the remote sensing domain that analyzes satellite images to study the development of vegetation over the years on the island of Crete. We find and describe important distinctions between these systems in numerous aspects, including file handling, scheduling, parallelization strategies, and language elements. We believe that awareness of these differences and the difficulties they might incur in a specific setting is important for making informed decisions when choosing an SWMS for a new research project.

Publikation
Workflow Systems for Large-Scale Scientific Data Analysis
Fabian Lehmann
Fabian Lehmann
Promotionsstudent

Ich interessiere mich für verteile Systeme und wissenschaftliche Workflows, insbesondere deren Scheduling.