Special Call results
We thank all applicants for computation time within the Special Call on Important Societal Challenges: Access for AI and GPU Accelerated Applications.
Within this call, among nine projects 115,007 node hours on the accelerated parts of the LUMI supercomputer were distributed.
HPC experts from IT4Innovations decided on the allocations within this Special Call as follows:
Researcher: Oldřich Plchot
Project: Speech Anonymization Using Self-Supervised Models and Synthetic Data
Allocation: 10,000 node hours
Abstract: Our work focuses on speech anonymization. Besides the apparent usage of preserving the privacy of a speaker e.g., in a phone call, these systems can also be used to protect the speaker's privacy when speech is further processed by a third party (cloud, phone app, etc.). Current systems often use a two-stage approach where the text is first extracted by automatic speech recognition (ASR), and then the text-to-speech (TTS) system, conditioned on pseudo-speaker information, synthesizes the resulting speech. While these models achieve impressive results in terms of naturalness and privacy, the drawback is the latency and requirement of transcribed dataset for ASR and high-quality data for TTS.
Researcher: Radim Špetlík
Project: Towards Explainability in Image-to-image Problems with Patch-oriented IADs
Allocation: 24,000 node hours
Abstract: The responsible use of image-to-image technology entails considering the potential for misuse, such as generating deepfakes for malicious purposes or creating non-consensual explicit content. Ensuring accountability in AI practices involves developing safeguards and regulations to mitigate these risks, including transparent disclosures when images are altered or generated by AI systems. Techniques like interpretable machine learning models or explainable AI methods can help elucidate how image-to-image models make decisions. This transparency is crucial in understanding and potentially rectifying undesirable outputs, aiding in responsible deployment and decision-making. My research in style transfer focusing on joining iterative alpha deblending models (IAD) [2] and patch-based approaches (such as [3]) offers a substantial leap forward in enhancing the interpretability of AI-generated content (cf. [1]). By emphasizing patch-based methods, my work allows for a clearer understanding of how specific elements within an image contribute to the overall style transfer process. This increased interpretability not only aids in grasping the nuances of style transformation but also facilitates the identification and manipulation of localized features, enhancing control and customization in image generation.
Researcher: Jan Brezina
Project: Surrogate and Generative Models for Effective Properties of 3D Fractured Media
Allocation: 6,007 node hours
Abstract: A deep geological repository (DGR) of radioactive waste for the Czech Republic will be located in a crystalline rock where the dominant water flow is through a fracture network. Despite all engineering barriers, the long-term barrier is the surrounding rock. The necessity to describe the transport of contaminants through a wide range of fracture scales has motivated our aim to investigate groundwater processes with discrete fracture-matrix (DFM) models that combine a continuum and a discrete fracture network. Robust and efficient methods for upscaling the physical properties of real 3D fractured rock are necessary for constructing a DGR digital twin and performing safety assessment calculations. Building on the developed convolution neural network models, we plan to extend these models to upscaling of 3D fractured media. The goal is to efficiently find a tensor field approximating a network of discrete random fractures. As a complementary approach, we plan to investigate diffusion models for the direct generation of the tensor field from the stochastic parameters of the fracture network.
Researcher: Ondřej Bouček
Project: Machine Learning for Lipid-Protein Docking
Allocation: 10,000 node hours
Abstract: Molecular docking is a key tool in structural molecular biology and computer-assisted drug design. The goal of ligand-protein docking is to predict the predominant binding mode(s) of a ligand with a protein of known three-dimensional structure. Machine learning tools, such as DiffDock, are approaching the performance of classical ligand-protein docking methods and offer significant potential for further improvement. Lipids are larger molecules with more degrees of freedom than most ligands, making lipid-protein docking more challenging. Furthermore, the study of lipid-protein interaction is of high relevance as it is crucial for understanding of Alzheimer’s Disease. In our project, we aim to extend current state-of-the-art methods
for ligand-protein docking and create a machine-learning framework for lipid-protein docking. Specifically, we aspire to combine DiffDock with quantum-mechanical methods such as SQM/COSMO Scoring Function to curate additional accurate lipid-protein interaction data, which are scarce.
Researcher: Anton Bushuiev
Project: Generative modeling of protein-protein interactions
Allocation: 16,000 node hours
Abstract: Proteins are large molecules that drive nearly all processes in living cells1. The analysis of protein-protein interactions (PPIs) and their design unlocks application areas of tremendous importance, most notably in healthcare and biotechnology2. Recently, we have developed PPIformer, a new self-supervised method to design protein-protein interactions by training from a large number of potentially all known PPI structures3. PPIformer was shown to outperform existing methods in detecting favorable protein modifications and, therefore, opens up an exciting space for its further development. In this project, we will extend PPIformer into a generative protein design assistant via self-play reinforcement learning.
Researcher: Tomas Soucek
Project: Learning to Generate Actions and State Transformations from Instructional Videos
Allocation: 8,000 node hours
Abstract: Learning robot policies is one of the central problems in artificial intelligence. Automatizing mundane or dangerous tasks could have a profound impact on applications in manufacturing, food production, or everyday household chores. Recent works in robotics aim to learn the policies by defining the goals using images. Image-defined tasks have the advantage of providing detailed information, such as the desired stiffness of the whipped cream or the thickness of avocado slices, which can be difficult to convey by other means, such as language. However, images of the goal states are rarely available in advance for arbitrary tasks and environments. Motivated by this limitation, we aim to generate realistic images of the goal states for a variety of tasks defined by text prompts. While recent works excel in generating realistic and high-fidelity images from textual descriptions, the methods fail to generate images that transform objects but preserve the environment the objects are located in. We propose to leverage image sequences from large-scale video data for training generative models. These models will manipulate input images according to text prompts to show the goal states.
Researcher: Antonín Vobecký
Project: Vision-language understanding for autonomous driving
Allocation: 8,000 node hours
Abstract: Natural language descriptions of driving scenes will enable identifying and describing unusual events and corner cases, which are important in safety-critical applications such as autonomous driving. The overall goal of the project is to develop an automatic pipeline for dense captioning of driving scenes. This will in turn help to improve current captioning/Q&A models in automotive contexts. These models now struggle with more complex scenes and with localizing descriptions spatially in the image and the scene. We will tackle this goal by automatically preparing more descriptive captions for autonomous driving scenes and using them to finetune vision-language models for captioning and question answering applications.
Researcher: Evangelos Kazakos
Project: Spatio-temporal grounding in long untrimmed videos
Allocation: 25,000 node hours
Abstract: Spatio-temporal grounding aims at localizing a natural language expression in videos both in space and time. Differently than object detection where the goal is to detect a pre-determined number of classes, a spatio-temporal grounding model should be able to associate any sentence to a spatio-temporal tube in videos, as long as the entity described in the sentence is visible. Potential applications include text-to-video retrieval, where grounding can enhance retrieval by learning object-level instead of image-level representations. Another important application is learning for embodied perception for robotics. Imagine robots that can learn how to perform a specific task, e.g. how to cook a specific meal, just by watching instructional videos from Youtube. Spatio-temporal grounding makes a step in that direction. Yet, the progress on this task is hindered due to the small scale of existing datasets [1,2]. In this project, our first goal is to build automatically a large-scale dataset for spatio-temporal grounding. We will use HowToCaption dataset [3], a dataset of 25M high-quality video-caption pairs, which provides temporal annotations (start/end times) along with text. We will lift HowToCaption to a spatio-temporal grounding dataset by automatically augmenting it with bounding box annotations using a state-of-the-art spatio-temporal grounding model, namely TubeDETR [4]. To assess the effectiveness of our dataset, we will train TubeDETR on it and evaluate it on other spatio-temporal grounding datasets [1,2] both in zero-shot and fine-tuning settings. We will also pre-train the text-to-video retrieval baseline from [3] and compare it with the one trained on HowToCaption [3] to assess the importance of grounding for video retrieval.
Researcher: Jan Hůla
Project: Efficient Language Models
Allocation: 8,000 node hours
Abstract: Large language models (LLMs) can produce results comparable to human annotators for many tasks dealing with text. Nevertheless, using these costly models to extract information from large text corpora can be impractical. Therefore, we aim to produce a set of LLM-based pipelines that will drive the cost of information extraction as low as possible while retaining high accuracy. Our goal is to extend and consolidate the latest techniques that improve the accuracy of LLMs and others that deal with model compression.