Biography

I am Fabian Lehmann, a Ph.D. candidate in computer science at the Knowledge Management in Bioinformatics Lab at the Humboldt-Universität zu Berlin. I get my funding through FONDA, a collaborative research center of the German Research Foundation (DFG).

Since my bachelor studies, I have been fascinated by any complex, distributed system. I love to understand and overcome their limits. In my Ph.D. research, I focus on workflow engines, improving the execution of distributed workflows while analyzing large amounts of data. In particular, my goal is to improve scheduling and data management. Therefore, I work closely with the Earth Observation Lab at the Humboldt-Universität zu Berlin to understand real-world requirements.

Interests
  • Distributed Systems
  • Scientific Workflows
  • Workflow Scheduling
Education
  • Master of Science in Information Systems Management, 2020

    Thesis: Design and Implementation of a Processing Pipeline for High Resolution Blood Pressure Sensor Data

    Technical University of Berlin

  • Bachelor of Science in Information Systems Management, 2019

    Thesis: Performance-Benchmarking in Continuous-Integration-Processes

    Technical University of Berlin

  • Abitur (comparable to A Levels), 2015

    Hannah-Arendt-Gymnasium (Berlin)

Professional Experience

 
 
 
 
 
Knowledge Management in Bioinformatics Lab (Humboldt-Universität zu Berlin)
Ph.D. candidate (computer science)
Nov 2020 – Present Berlin, Germany
In my Ph.D. studies, I focus on improving the execution of large scientific workflows processing hundreds of gigabytes of data.
 
 
 
 
 
DAI-Labor (Technical University of Berlin)
Student Assistent
May 2018 – Oct 2020 Berlin, Germany
In my student job, we were working with time-series data in DIGINET-PS. For example, we predicted parking slot occupation.
 
 
 
 
 
University of Oxford
GeoTripNet - Case Study
Oct 2019 – Mar 2020 Oxford, England, United Kingdom
For the case study, we crawled restaurants' reviews on Google Maps to analyze the relations between different restaurants and examine gentrification in Berlin districts. One problem was to process and analyze the large amount of data in real-time.
 
 
 
 
 
Einstein Center Digital Future
Fog Computing Project
Apr 2019 – Sep 2020 Berlin, Germany
This project aimed to analyze SimRa’s bicycle rides. Therefore, we developed a distributed analysis pipeline. Moreover, we visualized the track information on an interactive web. We were able to classify risk hotspots for Berlin’s cyclists' tracks.
 
 
 
 
 
Conrad Connect
Application Systems Project
Oct 2017 – Mar 2018 Berlin, Germany
For Conrad Connect, we analyzed hundreds of gigabytes of IoT data. Moreover, I uncovered security vulnerabilities in their software.
 
 
 
 
 
Reflect IT Solutions GmbH
Semester Term Work
Mar 2016 – Apr 2016 & Sep 2016 – Oct 2016 Berlin, Germany
In my semester term work, I helped to develop the backend for a construction-progress-management system.
 
 
 
 
 
SPP Schüttauf und Persike Planungsgesellshaft mbH
Gap work between school and studies
May 2015 – Sep 2015 Berlin, Germany
Before I started my bachelor studies, I worked a few months, helping to manage a large construction project, gaining experience in dealing with different trades.

Computer skills

A small excerpt

JAVA
Python
Docker
Kubernetes
Spring Boot
Latex
SQL
React
JavaScript
Nextflow
Haskell
Excel

Software

Common Workflow Scheduler

Resource managers can enhance their scheduling capabilities by leveraging the Common Workflow Scheduler interface to receive workflow graph information from workflow systems. This enables the resource manager’s scheduler to make more advanced decisions.

Benchmark Evaluator

Benchmark Evaluator

The Benchmark Evaluator is a plugin for the Jenkins automation server to load benchmark data and decide on the success of a build accordingly.

Publications

WOW: Workflow-Aware Data Movement and Task Scheduling for Dynamic Scientific Workflows

Scientific workflows process extensive data sets over clusters of independent nodes, which requires a complex stack of infrastructure components, especially a resource manager (RM) for task-to-node assignment, a distributed file system (DFS) for data exchange between tasks, and a workflow engine to control task dependencies. To enable a decoupled development and installation of these components, current architectures place intermediate data files during workflow execution independently of the future workload. In data-intensive applications, this separation results in suboptimal schedules, as tasks are often assigned to nodes lacking input data, causing network traffic and bottlenecks.
This paper presents WOW, a new scheduling approach for dynamic scientific workflow systems that steers both data movement and task scheduling to reduce network congestion and overall runtime. For this, WOW creates speculative copies of intermediate files to prepare the execution of subsequently scheduled tasks. WOW supports modern workflow systems that gain flexibility through the dynamic construction of execution plans. We prototypically implemented WOW for the popular workflow engine Nextflow using Kubernetes as a resource manager. In experiments with 16 synthetic and real workflows, WOW reduced makespan in all cases, with improvement of up to 94.5 % for workflow patterns and up to 53.2 % for real workflows, at a moderate increase of temporary storage space. It also has favorable effects on CPU allocation and scales well with increasing cluster size.

Impact of data density and endmember definitions on long-term trends in ground cover fractions across European grasslands

Long-term monitoring of grasslands is pivotal for ensuring continuity of many environmental services and for supporting food security and environmental modeling. Remote sensing provides an irreplaceable source of information for studying changes in grasslands. Specifically, Spectral Mixture Analysis (SMA) allows for quantification of physically meaningful ground cover fractions of grassland ecosystems (i.e., green vegetation, non-photosynthetic vegetation, and soil), which is crucial for our understanding of change processes and their drivers. However, although popular due to straightforward implementation and low computational cost, ‘classical’ SMA relies on a single endmember definition for each targeted ground cover component, thus offering limited suitability and generalization capability for heterogeneous landscapes. Furthermore, the impact of irregular data density on SMA-based long-term trends in grassland ground cover has also not yet been critically addressed.
We conducted a systematic assessment of i) the impact of data density on long-term trends in ground cover fractions in grasslands; and ii) the effect of endmember definition used in ‘classical’ SMA on pixel- and map-level trends of grassland ground cover fractions. We performed our study for 13 sites across European grasslands and derived the trends based on the Cumulative Endmember Fractions calculated from monthly composites. We compared three different data density scenarios, i.e., 1984–2021 Landsat data record as is, 1984–2021 Landsat data record with the monthly probability of data after 2014 adjusted to the pre-2014 levels, and the combined 1984–2021 Landsat and 2015–2021 Sentinel-2 datasets. For each site we ran SMA using a selection of site-specific and generalized endmembers, and compared the pixel- and map-level trends. Our results indicated no significant impact of varying data density on the long-term trends from Cumulative Endmember Fractions in European grasslands. Conversely, the use of different endmember definitions led in some regions to significantly different pixel- and map-level long-term trends raising questions about the suitability of the ‘classical’ SMA for complex landscapes and large territories. Therefore, we caution against using the ‘classical’ SMA for remote-sensing-based applications across broader scales or in heterogenous landscapes, particularly for trend analyses, as the results may lead to erroneous conclusions.

Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows

Research Projects

FONDA

FONDA

Foundations of Workflows for Large-Scale Scientific Data Analysis

Contact