Fabian Lehmann

Ph.D. candidate

Humboldt-Universität zu Berlin

Biography

I am Fabian Lehmann, a Ph.D. candidate in computer science at the Knowledge Management in Bioinformatics Lab at the Humboldt-Universität zu Berlin. I get my funding through FONDA, a collaborative research center of the German Research Foundation (DFG).

Since my bachelor studies, I have been fascinated by any complex, distributed system. I love to understand and overcome their limits. In my Ph.D. research, I focus on workflow engines, improving the execution of distributed workflows while analyzing large amounts of data. In particular, my goal is to improve scheduling and data management. Therefore, I work closely with the Earth Observation Lab at the Humboldt-Universität zu Berlin to understand real-world requirements.

Interests

Distributed Systems
Scientific Workflows
Workflow Scheduling

Education

Master of Science in Information Systems Management, 2020

Thesis: Design and Implementation of a Processing Pipeline for High Resolution Blood Pressure Sensor Data

Technical University of Berlin
Bachelor of Science in Information Systems Management, 2019

Thesis: Performance-Benchmarking in Continuous-Integration-Processes

Technical University of Berlin
Abitur (comparable to A Levels), 2015

Hannah-Arendt-Gymnasium (Berlin)

Professional Experience

Ph.D. candidate (computer science)

Knowledge Management in Bioinformatics Lab (Humboldt-Universität zu Berlin)

Nov 2020 – Present Berlin, Germany

In my Ph.D. studies, I focus on improving the execution of large scientific workflows processing hundreds of gigabytes of data.

Student Assistent

DAI-Labor (Technical University of Berlin)

May 2018 – Oct 2020 Berlin, Germany

In my student job, we were working with time-series data in DIGINET-PS. For example, we predicted parking slot occupation.

GeoTripNet - Case Study

University of Oxford

Oct 2019 – Mar 2020 Oxford, England, United Kingdom

For the case study, we crawled restaurants' reviews on Google Maps to analyze the relations between different restaurants and examine gentrification in Berlin districts. One problem was to process and analyze the large amount of data in real-time.

Fog Computing Project

Einstein Center Digital Future

Apr 2019 – Sep 2020 Berlin, Germany

This project aimed to analyze SimRa’s bicycle rides. Therefore, we developed a distributed analysis pipeline. Moreover, we visualized the track information on an interactive web. We were able to classify risk hotspots for Berlin’s cyclists' tracks.

Application Systems Project

Conrad Connect

Oct 2017 – Mar 2018 Berlin, Germany

For Conrad Connect, we analyzed hundreds of gigabytes of IoT data. Moreover, I uncovered security vulnerabilities in their software.

Semester Term Work

Reflect IT Solutions GmbH

Mar 2016 – Apr 2016 & Sep 2016 – Oct 2016 Berlin, Germany

In my semester term work, I helped to develop the backend for a construction-progress-management system.

Gap work between school and studies

SPP Schüttauf und Persike Planungsgesellshaft mbH

May 2015 – Sep 2015 Berlin, Germany

Before I started my bachelor studies, I worked a few months, helping to manage a large construction project, gaining experience in dealing with different trades.

Computer skills

A small excerpt

JAVA

Python

Docker

Kubernetes

Spring Boot

Latex

SQL

React

JavaScript

Nextflow

Haskell

Excel

Software

Common Workflow Scheduler

Resource managers can enhance their scheduling capabilities by leveraging the Common Workflow Scheduler interface to receive workflow graph information from workflow systems. This enables the resource manager’s scheduler to make more advanced decisions.

Benchmark Evaluator

The Benchmark Evaluator is a plugin for the Jenkins automation server to load benchmark data and decide on the success of a build accordingly.

Publications

Fabian Lehmann, Jonathan Bader, Friedrich Tschirpke, Ninon De Mecquenem, Ansgar Lößer, Soeren Becker, Katarzyna Ewa Lewińska, Lauritz Thamsen, Ulf Leser

May 23, 2025 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

WOW: Workflow-Aware Data Movement and Task Scheduling for Dynamic Scientific Workflows

Scientific workflows process extensive data sets over clusters of independent nodes, which requires a complex stack of infrastructure components, especially a resource manager (RM) for task-to-node assignment, a distributed file system (DFS) for data exchange between tasks, and a workflow engine to control task dependencies. To enable a decoupled development and installation of these components, current architectures place intermediate data files during workflow execution independently of the future workload. In data-intensive applications, this separation results in suboptimal schedules, as tasks are often assigned to nodes lacking input data, causing network traffic and bottlenecks.
This paper presents WOW, a new scheduling approach for dynamic scientific workflow systems that steers both data movement and task scheduling to reduce network congestion and overall runtime. For this, WOW creates speculative copies of intermediate files to prepare the execution of subsequently scheduled tasks. WOW supports modern workflow systems that gain flexibility through the dynamic construction of execution plans. We prototypically implemented WOW for the popular workflow engine Nextflow using Kubernetes as a resource manager. In experiments with 16 synthetic and real workflows, WOW reduced makespan in all cases, with improvement of up to 94.5 % for workflow patterns and up to 53.2 % for real workflows, at a moderate increase of temporary storage space. It also has favorable effects on CPU allocation and scales well with increasing cluster size.

Kathleen West, Fabian Lehmann, Vasilis Bountris, Ulf Leser, Yehia Elkhatib, Lauritz Thamsen

May 22, 2025 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Exploring the Potential of Carbon-Aware Execution for Scientific Workflows

Scientific workflows are widely used to automate scientific data analysis and often involve processing large quantities of data on compute clusters. As such, their execution tends to be long-running and resource intensive, leading to significant energy consumption and carbon emissions.
Meanwhile, a wealth of carbon-aware computing methods have been proposed, yet little work has focused specifically on scientific workflows, even though they present a substantial opportunity for carbon-aware computing because they are inherently delay tolerant, efficiently interruptible, and highly scalable.
In this study, we demonstrate the potential for carbonaware workflow execution. For this, we estimate the carbon footprint of two real-world Nextflow workflows executed on cluster infrastructure. We use a linear power model for energy consumption estimates and real-world average and marginal CI data for two regions. We evaluate the impact of carbonaware temporal shifting, pausing and resuming, and resource scaling. Our findings highlight significant potential for reducing emissions of workflows and workflow tasks.

Katarzyna Ewa Lewińska, Akpona Okujeni, Katja Kowalski, Fabian Lehmann, Volker C. Radeloff, Ulf Leser, Patrick Hostert

January 1, 2025 Remote Sensing of Environment

Impact of data density and endmember definitions on long-term trends in ground cover fractions across European grasslands

Long-term monitoring of grasslands is pivotal for ensuring continuity of many environmental services and for supporting food security and environmental modeling. Remote sensing provides an irreplaceable source of information for studying changes in grasslands. Specifically, Spectral Mixture Analysis (SMA) allows for quantification of physically meaningful ground cover fractions of grassland ecosystems (i.e., green vegetation, non-photosynthetic vegetation, and soil), which is crucial for our understanding of change processes and their drivers. However, although popular due to straightforward implementation and low computational cost, ‘classical’ SMA relies on a single endmember definition for each targeted ground cover component, thus offering limited suitability and generalization capability for heterogeneous landscapes. Furthermore, the impact of irregular data density on SMA-based long-term trends in grassland ground cover has also not yet been critically addressed.
We conducted a systematic assessment of i) the impact of data density on long-term trends in ground cover fractions in grasslands; and ii) the effect of endmember definition used in ‘classical’ SMA on pixel- and map-level trends of grassland ground cover fractions. We performed our study for 13 sites across European grasslands and derived the trends based on the Cumulative Endmember Fractions calculated from monthly composites. We compared three different data density scenarios, i.e., 1984–2021 Landsat data record as is, 1984–2021 Landsat data record with the monthly probability of data after 2014 adjusted to the pre-2014 levels, and the combined 1984–2021 Landsat and 2015–2021 Sentinel-2 datasets. For each site we ran SMA using a selection of site-specific and generalized endmembers, and compared the pixel- and map-level trends. Our results indicated no significant impact of varying data density on the long-term trends from Cumulative Endmember Fractions in European grasslands. Conversely, the use of different endmember definitions led in some regions to significantly different pixel- and map-level long-term trends raising questions about the suitability of the ‘classical’ SMA for complex landscapes and large territories. Therefore, we caution against using the ‘classical’ SMA for remote-sensing-based applications across broader scales or in heterogenous landscapes, particularly for trend analyses, as the results may lead to erroneous conclusions.

Jonathan Bader, Kathleen West, Soeren Becker, Svetlana Kulagina, Fabian Lehmann, Lauritz Thamsen, Henning Meyerhenke, Odej Kao

January 1, 2025

Predicting the Performance of Scientific Workflow Tasks for Cluster Resource Management: An Overview of the State of the Art

Scientific workflow management systems support large-scale data analysis on cluster infrastructures. For this, they interact with resource managers which schedule workflow tasks onto cluster nodes. In addition to workflow task descriptions, resource managers rely on task performance estimates such as main memory consumption and runtime to efficiently manage cluster resources. Such performance estimates should be automated, as user-based task performance estimates are error-prone.
In this book chapter, we describe key characteristics of methods for workflow task runtime and memory prediction, provide an overview and a detailed comparison of state-of-the-art methods from the literature, and discuss how workflow task performance prediction is useful for scheduling, energy-efficient and carbon-aware computing, and cost prediction.

Rafael Ferreira da Silva, Deborah Bard, Kyle Chard, de Witt Shaun, Ian T. Foster, Tom Gibbs, Carole Goble, William Godoy, Johan Gustafsson, Utz-Uwe Haus, Stephen Hudson, Shantenu Jha, Laila Los, Drew Paine, Frederic Suter, Logan Ward, Sean Wilkinson, Marcos Amaris, Yadu Babuji, Jonathan Bader, Riccardo Balin, Daniel Balouek, Sarah Beecroft, Khalid Belhajjame, Rajat Bhattarai, Wes Brewer, Paul Brunk, Silvina Caino-Lores, Henri Casanova, Daniela Cassol, Jared Coleman, Taina Coleman, Iacopo Colonnelli, Anderson Andrei Da Silva, Daniel de Oliveira, Pascal Elahi, Nour Elfaramawy, Wael Elwasif, Brian Etz, Thomas Fahringer, Wesley Ferreira, Rosa Filgueira, Jacob Fosso Tande, Luiz Gadelha, Andy Gallo, Daniel Garijo, Yiannis Georgiou, Philipp Gritsch, Patricia Grubel, Amal Gueroudji, Quentin Guilloteau, Carlo Hamalainen, Rolando Hong Enriquez, Lauren Huet, Kevin Hunter Kesling, Paula Iborra, Shiva Jahangiri, Jan Janssen, Joe Jordan, Sehrish Kanwal, Liliane Kunstmann, Fabian Lehmann, Ulf Leser, Chen Li, Peini Liu, Jakob Luettgau, Richard Lupat, Jose M. Fernandez, Ketan Maheshwari, Tanu Malik, Jack Marquez, Motohiko Matsuda, Doriana Medic, Somayeh Mohammadi, Alberto Mulone, John-Luke Navarro, Kin Wai Ng, Klaus Noelp, Bruno P. Kinoshita, Ryan Prout, Michael R. Crusoe, Sashko Ristov, Stefan Robila, Daniel Rosendo, Billy Rowell, Jedrzej Rybicki, Hector Sanchez, Nishant Saurabh, Sumit Kumar Saurav, Tom Scogland, Dinindu Senanayake, Woong Shin, Raul Sirvent, Tyler Skluzacek, Barry Sly-Delgado, Stian Soiland-Reyes, Abel Souza, Renan Souza, Domenico Talia, Nathan Tallent, Lauritz Thamsen, Mikhail Titov, Benjamin Tovar, Karan Vahi, Eric Vardar-Irrgang, Edite Vartina, Yuandou Wang, Merridee Wouters, Qi Yu, Ziad Al Bkhetan, Mahnoor Zulfiqar

October 1, 2024 Zenodo

Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows

See all publications

Recent Talks

FORCE on Nextflow: Scalable Analysis of Earth Observation data on Commodity Clusters

Modern Earth Observation (EO) often analyses hundreds of gigabytes of data from thousands of satellite images. This data usually is …

Fabian Lehmann

November 1, 2021

Research Projects

FONDA

Foundations of Workflows for Large-Scale Scientific Data Analysis

Contact

fabian.lehmann@hu-berlin.de
+49 (0)30-2093-41285
Building 4, Room IV.426, Rudower Chaussee 25, Berlin, 12489