Fabian Lehmann

Promotionsstudent

Humboldt-Universität zu Berlin

Über mich

Ich bin Fabian Lehmann und promoviere in Informatik am Lehrstuhl für Wissensmanagement in der Bioinformatik an der Humboldt-Universität zu Berlin. Ich werde über FONDA, ein Sonderforschungsbereich der Deutschen Forschungsgemeinschaft (DFG), gefördert.

Während meines Bachelorstudiums habe ich meine Faszination für komplexe, verteilte Systeme entdeckt. Ich begeistere mich dafür, die Limits solcher Systeme auszutesten und zu überwinden. In meiner Promotion fokussiere ich mich auf die Optimierung von Workflow Systemen zur Analyse von riesigen Datenmengen. Insbesondere konzentriere ich mich hierbei auf den Aspekt des Schedulings. Hierfür arbeite ich eng mit dem Earth Observation Lab der Humboldt-Universität zu Berlin zusammen, um die Anforderungen der Praxis zu verstehen.

Interessen

Verteilte Systeme
Wissenschaftliche Workflows
Workflow Scheduling

Bildung

Master Wirtschaftsinformatik, 2020

Abschlussarbeit: Design and Implementation of a Processing Pipeline for High Resolution Blood Pressure Sensor Data

Technische Universität Berlin
Bachelor Wirtschaftsinformatik, 2019

Abschlussarbeit: Performance-Benchmarking in Continuous-Integration-Prozessen

Technische Universität Berlin
Abitur, 2015

Hannah-Arendt-Gymnasium (Berlin)

Erfahrungen

Promotionsstudent (Informatik)

Wissensmanagement in der Bioinformatik (Humboldt-Universität zu Berlin)

Nov. 2020 – Aktuell Berlin, Deutschland

In meinem Promotionsvorhaben fokussiere ich mich auf die Optimierung der Ausführung von großen wissenschaftlichen Workflows, die Hunderte Gigabytes an Daten verarbeiten.

Studentische Hilfskraft

DAI-Labor (Technische Universität Berlin)

Mai 2018 – Okt. 2020 Berlin, Deutschland

In meinem Studentenjob habe ich im Rahmen von DIGINET-PS Zeitreihenanalysen durchgeführt. Unter anderem haben wir die Auslastung der Parkplätze auf der Straße des 17. Juni vorhergesagt.

GeoTripNet - Fallstudie

Universität Oxford

Okt. 2019 – März 2020 Oxford, England, Großbritannien

Im Rahmen der Fallstudie haben wir die Bewertungen aller Restaurants in Berlin auf Google Maps gecrawlt. Anschließend haben wir die Beziehungen zwischen verschiedenen Restaurants analysiert, um die Gentrifizierung in Berliner Bezirken zu untersuchen. Ein Problem bestand darin, die große Datenmenge in Echtzeit zu verarbeiten, zu analysieren und zu visualisieren.

Fog Computing Projekt

Einstein Center Digital Future

Apr. 2019 – Sept. 2020 Berlin, Deutschland

In diesem Projekt haben wir die Fahrradfahrten von SimRa analysiert. Dafür haben wir eine verteilte Analyse Pipeline aufgesetzt und die Daten anschließend in einer interaktiven Web-App dargestellt. Anschließend konnten wir Gefahrenstellen für die Berliner Fahrradfahrer erkennen.

Anwendungssysteme Projekt

Conrad Connect

Okt. 2017 – März 2018 Berlin, Deutschland

Für Conrad Connect haben wir Hunderte Gigabytes an IoT Daten ausgewertet. Außerdem habe ich Sicherheitsmängel auf ihrer Website gefunden.

Semesterferien-Job

Reflect IT Solutions GmbH

März 2016 – Apr. 2016 & Sep 2016 – Oct 2016 Berlin, Deutschland

In meinen Semesterferien habe ich geholfen, das Backend für eine Software zur Unterstützung der Bauüberwachung zu entwickeln.

Arbeit zwischen Abitur und Studium

SPP Schüttauf und Persike Planungsgesellshaft mbH

Mai 2015 – Sept. 2015 Berlin, Deutschland

Bevor ich mit meinem Bachelorstudium begonnen habe, habe ich einige Monate die Bauüberwachung der Sanierung eines 18-Geschossers unterstützt.

IT-Kenntnisse

(Eine kleine Auswahl)

JAVA

Python

Docker

Kubernetes

Spring Boot

Latex

SQL

React

JavaScript

Nextflow

Haskell

Excel

Software

Common Workflow Scheduler

Resource Manager können mit Hilfe des Common Workflow Schedulers eine Schnittstelle bereitstellen, über die Workflow-Systeme Informationen zum Workflow-Graphen übermitteln können. Diese Daten ermöglichen es dem Scheduler des Resource Managers, bessere Entscheidungen zu treffen.

Benchmark Evaluator

Der Benchmark Evaluator ist ein Plugin für den Jenkins Automatisierungsserver zum Laden und Auswerten von Benchmarkergebnissen.

Publikationen

Fabian Lehmann, Jonathan Bader, Friedrich Tschirpke, Ninon De Mecquenem, Ansgar Lößer, Soeren Becker, Katarzyna Ewa Lewińska, Lauritz Thamsen, Ulf Leser

23. Mai 2025 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

WOW: Workflow-Aware Data Movement and Task Scheduling for Dynamic Scientific Workflows

Scientific workflows process extensive data sets over clusters of independent nodes, which requires a complex stack of infrastructure components, especially a resource manager (RM) for task-to-node assignment, a distributed file system (DFS) for data exchange between tasks, and a workflow engine to control task dependencies. To enable a decoupled development and installation of these components, current architectures place intermediate data files during workflow execution independently of the future workload. In data-intensive applications, this separation results in suboptimal schedules, as tasks are often assigned to nodes lacking input data, causing network traffic and bottlenecks.
This paper presents WOW, a new scheduling approach for dynamic scientific workflow systems that steers both data movement and task scheduling to reduce network congestion and overall runtime. For this, WOW creates speculative copies of intermediate files to prepare the execution of subsequently scheduled tasks. WOW supports modern workflow systems that gain flexibility through the dynamic construction of execution plans. We prototypically implemented WOW for the popular workflow engine Nextflow using Kubernetes as a resource manager. In experiments with 16 synthetic and real workflows, WOW reduced makespan in all cases, with improvement of up to 94.5 % for workflow patterns and up to 53.2 % for real workflows, at a moderate increase of temporary storage space. It also has favorable effects on CPU allocation and scales well with increasing cluster size.

Kathleen West, Fabian Lehmann, Vasilis Bountris, Ulf Leser, Yehia Elkhatib, Lauritz Thamsen

22. Mai 2025 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Exploring the Potential of Carbon-Aware Execution for Scientific Workflows

Scientific workflows are widely used to automate scientific data analysis and often involve processing large quantities of data on compute clusters. As such, their execution tends to be long-running and resource intensive, leading to significant energy consumption and carbon emissions.
Meanwhile, a wealth of carbon-aware computing methods have been proposed, yet little work has focused specifically on scientific workflows, even though they present a substantial opportunity for carbon-aware computing because they are inherently delay tolerant, efficiently interruptible, and highly scalable.
In this study, we demonstrate the potential for carbonaware workflow execution. For this, we estimate the carbon footprint of two real-world Nextflow workflows executed on cluster infrastructure. We use a linear power model for energy consumption estimates and real-world average and marginal CI data for two regions. We evaluate the impact of carbonaware temporal shifting, pausing and resuming, and resource scaling. Our findings highlight significant potential for reducing emissions of workflows and workflow tasks.

Katarzyna Ewa Lewińska, Akpona Okujeni, Katja Kowalski, Fabian Lehmann, Volker C. Radeloff, Ulf Leser, Patrick Hostert

1. Januar 2025 Remote Sensing of Environment

Impact of data density and endmember definitions on long-term trends in ground cover fractions across European grasslands

Long-term monitoring of grasslands is pivotal for ensuring continuity of many environmental services and for supporting food security and environmental modeling. Remote sensing provides an irreplaceable source of information for studying changes in grasslands. Specifically, Spectral Mixture Analysis (SMA) allows for quantification of physically meaningful ground cover fractions of grassland ecosystems (i.e., green vegetation, non-photosynthetic vegetation, and soil), which is crucial for our understanding of change processes and their drivers. However, although popular due to straightforward implementation and low computational cost, ‘classical’ SMA relies on a single endmember definition for each targeted ground cover component, thus offering limited suitability and generalization capability for heterogeneous landscapes. Furthermore, the impact of irregular data density on SMA-based long-term trends in grassland ground cover has also not yet been critically addressed.
We conducted a systematic assessment of i) the impact of data density on long-term trends in ground cover fractions in grasslands; and ii) the effect of endmember definition used in ‘classical’ SMA on pixel- and map-level trends of grassland ground cover fractions. We performed our study for 13 sites across European grasslands and derived the trends based on the Cumulative Endmember Fractions calculated from monthly composites. We compared three different data density scenarios, i.e., 1984–2021 Landsat data record as is, 1984–2021 Landsat data record with the monthly probability of data after 2014 adjusted to the pre-2014 levels, and the combined 1984–2021 Landsat and 2015–2021 Sentinel-2 datasets. For each site we ran SMA using a selection of site-specific and generalized endmembers, and compared the pixel- and map-level trends. Our results indicated no significant impact of varying data density on the long-term trends from Cumulative Endmember Fractions in European grasslands. Conversely, the use of different endmember definitions led in some regions to significantly different pixel- and map-level long-term trends raising questions about the suitability of the ‘classical’ SMA for complex landscapes and large territories. Therefore, we caution against using the ‘classical’ SMA for remote-sensing-based applications across broader scales or in heterogenous landscapes, particularly for trend analyses, as the results may lead to erroneous conclusions.

Jonathan Bader, Kathleen West, Soeren Becker, Svetlana Kulagina, Fabian Lehmann, Lauritz Thamsen, Henning Meyerhenke, Odej Kao

1. Januar 2025

Predicting the Performance of Scientific Workflow Tasks for Cluster Resource Management: An Overview of the State of the Art

Scientific workflow management systems support large-scale data analysis on cluster infrastructures. For this, they interact with resource managers which schedule workflow tasks onto cluster nodes. In addition to workflow task descriptions, resource managers rely on task performance estimates such as main memory consumption and runtime to efficiently manage cluster resources. Such performance estimates should be automated, as user-based task performance estimates are error-prone.
In this book chapter, we describe key characteristics of methods for workflow task runtime and memory prediction, provide an overview and a detailed comparison of state-of-the-art methods from the literature, and discuss how workflow task performance prediction is useful for scheduling, energy-efficient and carbon-aware computing, and cost prediction.

Rafael Ferreira da Silva, Deborah Bard, Kyle Chard, de Witt Shaun, Ian T. Foster, Tom Gibbs, Carole Goble, William Godoy, Johan Gustafsson, Utz-Uwe Haus, Stephen Hudson, Shantenu Jha, Laila Los, Drew Paine, Frederic Suter, Logan Ward, Sean Wilkinson, Marcos Amaris, Yadu Babuji, Jonathan Bader, Riccardo Balin, Daniel Balouek, Sarah Beecroft, Khalid Belhajjame, Rajat Bhattarai, Wes Brewer, Paul Brunk, Silvina Caino-Lores, Henri Casanova, Daniela Cassol, Jared Coleman, Taina Coleman, Iacopo Colonnelli, Anderson Andrei Da Silva, Daniel de Oliveira, Pascal Elahi, Nour Elfaramawy, Wael Elwasif, Brian Etz, Thomas Fahringer, Wesley Ferreira, Rosa Filgueira, Jacob Fosso Tande, Luiz Gadelha, Andy Gallo, Daniel Garijo, Yiannis Georgiou, Philipp Gritsch, Patricia Grubel, Amal Gueroudji, Quentin Guilloteau, Carlo Hamalainen, Rolando Hong Enriquez, Lauren Huet, Kevin Hunter Kesling, Paula Iborra, Shiva Jahangiri, Jan Janssen, Joe Jordan, Sehrish Kanwal, Liliane Kunstmann, Fabian Lehmann, Ulf Leser, Chen Li, Peini Liu, Jakob Luettgau, Richard Lupat, Jose M. Fernandez, Ketan Maheshwari, Tanu Malik, Jack Marquez, Motohiko Matsuda, Doriana Medic, Somayeh Mohammadi, Alberto Mulone, John-Luke Navarro, Kin Wai Ng, Klaus Noelp, Bruno P. Kinoshita, Ryan Prout, Michael R. Crusoe, Sashko Ristov, Stefan Robila, Daniel Rosendo, Billy Rowell, Jedrzej Rybicki, Hector Sanchez, Nishant Saurabh, Sumit Kumar Saurav, Tom Scogland, Dinindu Senanayake, Woong Shin, Raul Sirvent, Tyler Skluzacek, Barry Sly-Delgado, Stian Soiland-Reyes, Abel Souza, Renan Souza, Domenico Talia, Nathan Tallent, Lauritz Thamsen, Mikhail Titov, Benjamin Tovar, Karan Vahi, Eric Vardar-Irrgang, Edite Vartina, Yuandou Wang, Merridee Wouters, Qi Yu, Ziad Al Bkhetan, Mahnoor Zulfiqar

1. Oktober 2024 Zenodo

Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows

Weitere Publikationen

Vorträge

FORCE on Nextflow: Scalable Analysis of Earth Observation data on Commodity Clusters

Modern Earth Observation (EO) often analyses hundreds of gigabytes of data from thousands of satellite images. This data usually is …

Fabian Lehmann

1. November 2021

Projekte

FONDA

Grundlagen von Workflows für die Analyse großer naturwissenschaftlicher Daten

Kontakt

fabian.lehmann@hu-berlin.de
+49 (0)30-2093-41285
Haus 4, Raum IV.426, Rudower Chaussee 25, Berlin, 12489