Second International Workshop on Distributed and Parallel Programming for Extreme-scale AI

Location: Mines Paris - PSL University, 60 Boulevard Saint-Michel, Paris
Main Room: L108-A

DP2E-AI 2026 PROGRAM

Program Chair: Serge G. Petiton

Monday, June 29th

8:30-9:00 | Registration, welcome coffee

9:00-9:30 | Opening session

Welcome address from Ecole des Mines de Paris
Paolo Stringari, Director of Research at MINES Paris – PSL
Corinne Ancourt, Fabien Coelho
Introduction to the workshop
Serge G. Petiton,
[Bio]
Serge G. Petiton received the B.S. degree in mathematics, the Ph.D. degree in computer science, and the “Habilitation à diriger des recherches”, from the Sorbonne University, Pierre and Marie Curie Campus. He was post-doc student and junior researcher scientist at Yale University, 1989-1990. He has been researcher at the “Site Experimental en Hyperparallelisme” (supported by CNRS and CEA) from 1991 to 1994. He also was affiliate research scientist at Yale and visiting research fellow in several US laboratories (NASA/ICASE, AHPCRC,..) during the period 1991-1994. From September 1994 to February 2026, Serge G. Petiton was tenured full Professor at the University of Lille in France and he had CNRS and/or INRIA associated senior positions in several laboratories (LIFL and CRISTAL in Lille, and ASCI, LRI and the “Maison de la Simulation” in Paris-Saclay). Serge G. Petiton was P.I. of several international projects with Japan and Germany (ANR, CNRS, SPPEXA,..) and had-has many industrial collaborations (TOTAL, CEA, Thales, Airbus, IBM, Nvidia, Intel, Huawei…). Serge G. Petiton has been scientific director of 27 Ph.D.s and has authored more than 150 articles on international journals, books, and conferences. His main current research interests are in “Parallel and Distributed Computing”, “Sparse Linear Algebra”, “Language and Programming Paradigms”, and “AI methods”.
Since 2019, he is working as an independent scientific consultant for industrial companies, and since April 2026, he is affiliated with the University of Tokyo’s Information Technology Center.
[Slides]

9:30-12:40 | Keynotes

Chair: Nahid Emad

HPC in Transition. [9:30-10:10]
Jack Dongarra, University of Tennessee and Oak Ridge National Laboratory, USA, University of Manchester, UK
[Bio]
Jack Dongarra is specializes in numerical algorithms in linear algebra, parallel computing, the use of advanced computer architectures, programming methodology, and tools for parallel computers. He holds appointments at the University of Manchester, Oak Ridge National Laboratory, and the University of Tennessee. In 2019 he received the ACM/SIAM Computational Science and Engineering Prize. In 2020 he received the IEEE-CS Computer Pioneer Award. In 2021 he received the ACM A.M. Turing Award for his pioneering contributions to numerical algorithms and software that have driven decades of extraordinary progress in computing performance and applications. He is a Fellow of the AAAS, ACM, IEEE, and SIAM; a foreign member of the British Royal Society and a member of the U.S. National Academy of Sciences and the U.S. National Academy of Engineering.
[Abstract]
High-performance computing is entering a decisive transition driven by forces that are largely external to traditional scientific HPC. The economics of AI and hyperscale cloud now shape leading-edge silicon, system architectures, and software ecosystems, while energy and data movement have become the dominant constraints on performance, facility design, and long-term sustainability. This talk examines how these dynamics shift HPC’s center of gravity from a primarily FP64, node-centric worldview toward accelerator-heavy, rack-scale, and workflow-defined systems.
We argue that the next era of scientific capability will be measured less by peak floating-point rates and more by time–energy–fidelity trade-offs across end-to-end pipelines. The most plausible path to “effective zettascale” is not brute-force FP64, but certified mixed-precision algorithms, communication-avoiding methods, AI-augmented reduced-order models, and hybrid AI+simulation workflows with rigorous error control and uncertainty quantification. We also outline an emerging reference architecture for platforms comprising integrated simulation, AI, and data/workflow partitions, federated with secure cloud resources and instruments.
[Slides]
HPC+AI Converged Supercomputing on LineShine Exascale System [10:10-10:50]
Yutong Lu, Sun Yat-Sen University, China
[Bio]
Dr. Yutong Lu is a Professor in the Department of Computer Science and Engineering at Sun Yat-sen University, China. She also serves as Director of the National Supercomputing Centers in Guangzhou and Shenzhen. Professor Lu specializes in high-performance computing, with research interests spanning advanced computer architecture, programming models, and parallel computing environments. She has extensive research and development experience across multiple generations of China’s supercomputing systems. She played a key role as Deputy Chief Designer of the Tianhe-2 supercomputer, which ranked No. 1 on the TOP500 list for six consecutive releases. She currently serves as Chief Designer of the LineShine exascale supercomputer. Throughout her career, Professor Lu has been dedicated to bridging high-performance computing with major scientific and engineering applications. she has helped make advanced computing capabilities more accessible to a broader community of researchers and industry users.
Professor Lu has been recognized as an ISC Fellow and a CCF Fellow. She is a member of the Expert Committee of China’s National Key R&D Program. She has also led multiple major research projects supported by the MOST and the NSFC. Her current research focuses on cutting-edge computer architectures and the convergence of advanced AI and HPC systems and applications.
[Abstract]
Supercomputing and artificial intelligence are advancing rapidly. Their convergence is driving scientific and engineering applications toward deeper integration of simulation, data, and AI models. This trend is reshaping system architectures toward heterogeneous acceleration, mixed-precision computing, high-bandwidth memory, and software–hardware co-design. This report presents LineShine, a new-generation AI–HPC converged supercomputing system. It introduces the system design philosophy and the balanced implementation of computing, communication, storage, and software capabilities. Large-scale application cases in atmosphere–ocean modeling, industrial simulation, advanced materials, and biomedicine are discussed to demonstrate the value of converged supercomputing for scientific discovery and engineering innovation. The report also outlines future directions for AI–HPC converged systems and application ecosystems.
[Slides]

10:50-11:20 | Coffee break

AI-for-Science by Integrations of Simulations/Data/Learning on Heterogeneous Supercomputers [11:20-12:00]
Kengo Nakajima, University of Tokyo, Japan
[Bio]
Kengo Nakajima has been a professor in the Supercomputing Research Division of the Information Technology Center at the University of Tokyo since 2008. Prior to joining the University of Tokyo in 2004, he spent 19 years in industry. He has also been a deputy director of RIKEN Center for Computational Science (R-CCS) since 2018. His research interests cover computational mechanics, parallel numerical algorithms and high performance computing (HPC). Kengo holds a B.Eng in aeronautics (University of Tokyo, 1985), an MS in aerospace engineering (University of Texas at Austin, 1993), and a PhD in engineering mechanics (University of Tokyo, 2003).
[Abstract]
Since 2015, we have advanced the BDEC initiative to integrate Simulation, Data, and Learning (S+D+L). To fully leverage heterogeneous supercomputers at the University of Tokyo, such as Wisteria/BDEC‑01 and Miyabi, both equipped with CPUs and GPUs, we have developed the advanced software infrastructure h3‑Open‑BDEC. This platform enables AI-for-Science workflows and has produced internationally recognized achievements across fields including earth and life sciences, while also supporting hybrid Quantum-HPC (QC-HPC) computing. This presentation introduces AI-for-Science applications on heterogeneous supercomputers, and examples of hybrid QC-HPC computing. Recently, RIKEN launched international initiative with Fujitsu and NVIDIA for "FugakuNEXT" development, where FugakuNEXT is based on heterogeneous compute nodes consisting of CPUs by Fujitsu and GPUs by NVDIA. This presentation will also introduce the prospects for developing AI-for-Science type applications on the FugakuNEXT system based on the experiences with Wisteria/BDEC-01 and Miyabi.
[Slides]
AI in Science: From Myth to Reality [12:00-12:40]
Christophe Calvin, CEA, France
[Bio]
Christophe Calvin, PhD in applied mathematics from Grenoble, is a research director at the CEA. He began his career developing massively parallel CFD codes, then led major simulation projects for new generation nuclear reactors. He later headed laboratories and research programs in reactor physics, thermohydraulics and supercomputing, with strong international HPC collaborations. Since 2015, he has overseen digital simulation, intensive computing, AI, data and IT in fundamental research. He now serves as Research Data and Code Administrator and recognized senior fellow at the CEA.
[Abstract]
Artificial intelligence has permeated many areas of society, from everyday life to the business world and numerous fields of research.
AI has become a strategic international issue, a source of competition among nations, and a key challenge for global industries. This competition has numerous economic and environmental impacts and sparks major societal debates, even raising questions about the future of humanity.
Nations and major digital players have embarked on a frantic race marked by an explosion in the construction of data centers, whose power is no longer measured in PFlops but in gigawatts. This race is also reflected in the sheer scale of machine learning models, which are measured in trillions of parameters, requiring ever-increasing amounts of data. There is also a race regarding the performance and reasoning capabilities of these models: we speak of agent-based AI, world models…
But beyond this global frenzy, what is the reality on the ground for a research organization like the CEA, which, to remain among the world’s most innovative organizations, must embrace this shift toward AI for science?
During this presentation, using examples and illustrations drawn from real-world cases, I will discuss the practical challenges related to infrastructure, security policy, governance, and changes in research processes.
[Slides]

12:40-13:40 | Lunch (Ecole des Mines de Paris)

13:40-15:40 | Keynotes

Chair: Corinne Ancourt

Dataflow Execution: The Future of Compute for Agentic AI [13:40-14:20]
Kunle Olukotun, University of Stanford and SambaNova, USA
[Bio]
Kunle Olukotun is the Cadence Design Professor of Electrical Engineering and Computer Science at Stanford University. Olukotun is a pioneer in multicore processor design and the leader of the Stanford Hydra chip multiprocessor (CMP) research project. He founded Afara Websystems to develop high-throughput, low-power multicore processors for server systems. The Afara multi-core multi-thread processor, called Niagara, was acquired by Sun Microsystems and now powers Oracle's SPARC-based servers. Olukotun co-founded SambaNova Systems, a Machine Learning and Artificial Intelligence company, and continues to lead as their Chief Technologist.
Olukotun is the Director of the Pervasive Parallel Lab and a member of the Data Analytics tor What's Next (DAWN) Lab, developing infrastructure for usable machine learning. He is a member of the National Academy of Engineering, an ACM Fellow, and an IEEE Fellow for contributions to multiprocessors on a chip design and the commercialization of this technology. He has received the ACM-IEEE CS Eckert-Mauchly Award and the IEEE Harry H. Goode Memorial Award.
[Abstract]
Agentic AI is fundamentally changing the requirements for AI infrastructure. These systems orchestrate complex workflows involving large reasoning models, specialized domain-specific models, retrieval systems, and external tools, often requiring many sequential rounds of token generation before producing a final result. As a result, the user experience and productivity of agentic AI are increasingly determined by token generation speed. Current GPU-based architectures were designed primarily for high-throughput training and batch-oriented inference. While they excel at dense matrix multiplication, they are fundamentally limited during autoregressive token generation, where memory access and inter-chip communication increasingly dominate performance.
In this talk, I will argue that agentic AI demands a fundamentally different computation model based on Dataflow Execution. By directly mapping computation and communication onto a spatially distributed, hardware-synchronized architecture, dataflow systems eliminate much of the overhead of instruction-driven execution, reduce data movement, and sustain high utilization across compute and memory resources for both large and small models. This enables significantly higher token generation rates, lower latency, and more efficient execution of multi-model agentic workflows.
Using examples from SambaNova’s Reconfigurable Dataflow Architecture (RDA) and recent research on dynamic RDAs, I will describe how dataflow-based AI systems can deliver the responsiveness, flexibility, scalability, and efficiency required for the next generation of agentic AI applications.
[Slides]
Scaling Processes Rather than Models: Foundations and Challenges for Agentic AI [14:20-15:00]
Ian Foster, University of Chicago and ANL, USA
[Bio]
Ian Foster is Senior Scientist and Distinguished Fellow, and director of the Data Science and Learning Division, at Argonne National Laboratory, and the Arthur Holly Compton Distinguished Service Professor of Computer Science at the University of Chicago. He has a BSc degree from the University of Canterbury, New Zealand, and a PhD from Imperial College, United Kingdom, both in computer science. His research is in distributed, parallel, and data-intensive computing technologies, and their applications to scientific problems. The Globus software that he co-designed is widely used as scientific infrastructures. He is convinced that intelligent agents are going to transform science and scientific infrastructure.
[Abstract]
The past decade of AI has been characterized by scaling laws relating model performance to parameters, data, and compute. Agentic systems introduce a different notion of scale: the complexity, duration, and consequence of the processes that AI systems can execute. Such processes may involve planning, experimentation, tool use, interaction with humans, and adaptation to evolving environments. This talk explores the implications of this shift from model-centric to process-centric AI. I examine how concepts such as reasoning, memory, evaluation, and reliability must be reconsidered when intelligence is expressed through extended computational processes rather than isolated predictions. Using examples from scientific discovery and autonomous experimentation, I discuss emerging approaches to representing agent state, managing uncertainty, tracking provenance, and evaluating long-horizon behavior. The resulting perspective suggests a research agenda in which the primary object of study is not the model itself but the computational process that the model helps create and execute.
[Slides]
Distributed and Parallel Sparse Matrix Computation in the Age of AI [15:00-15:40]
Serge Petiton,
[Bio]
Serge G. Petiton received the B.S. degree in mathematics, the Ph.D. degree in computer science, and the “Habilitation à diriger des recherches”, from the Sorbonne University, Pierre and Marie Curie Campus. He was post-doc student and junior researcher scientist at Yale University, 1989-1990. He has been researcher at the “Site Experimental en Hyperparallelisme” (supported by CNRS and CEA) from 1991 to 1994. He also was affiliate research scientist at Yale and visiting research fellow in several US laboratories (NASA/ICASE, AHPCRC,..) during the period 1991-1994. From September 1994 to February 2026, Serge G. Petiton was tenured full Professor at the University of Lille in France and he had CNRS and/or INRIA associated senior positions in several laboratories (LIFL and CRISTAL in Lille, and ASCI, LRI and the “Maison de la Simulation” in Paris-Saclay). Serge G. Petiton was P.I. of several international projects with Japan and Germany (ANR, CNRS, SPPEXA,..) and had-has many industrial collaborations (TOTAL, CEA, Thales, Airbus, IBM, Nvidia, Intel, Huawei…). Serge G. Petiton has been scientific director of 27 Ph.D.s and has authored more than 150 articles on international journals, books, and conferences. His main current research interests are in “Parallel and Distributed Computing”, “Sparse Linear Algebra”, “Language and Programming Paradigms”, and “AI methods”.
Since 2019, he is working as an independent scientific consultant for industrial companies, and since April 2026, he is affiliated with the University of Tokyo’s Information Technology Center.
[Abstract]
Artificial intelligence, particularly large-scale language models (LLMs) and machine learning applications, is redefining architectures as well as approaches to distributed and parallel programming: from chip design to new data structures and programming paradigms, including new arithmetic methods and accelerators, for example. Nevertheless, expertise in linear algebra and high-performance computing remains essential, particularly for sparse nonsymmetric matrix computations.
In this talk we focus on computations from various fields, such as certain aspects of LLM algorithms, graph neural networks, and several other AI methods, as well as, of course, more classical problems in computational science. We notably study the computation of sequences of multiplications of large, irregular, and nonsymmetric sparse matrices by a vector or a skinny dense matrix. The most efficient solutions are often obtained using clusters of multicore nodes, composed of different sets of cores, sometimes without accelerators. We present several experiments conducted on various multi-core node clusters at the Flatiron Institute and on RIKEN’s Fugaku supercomputer. We particularly consider cases where the vectors or the skinny dense matrices are not necessarily loaded onto every node and must therefore be distributed. We evaluate the impact of several methods for distributing data among nodes and core sets, two Fugaku network topologies, and various algorithmic strategies, in particular. We also present two distributed and parallel sparse matrix generators that require no I/O operations, enabling a wide range of experiments.
[Slides]

15:40-16:10 | Break

16:10-17:10 | Session Arithmetic in the Age of AI.

Chair: Georges-André Silber

GPU's lower precision tensor cores: will they help my code run faster? [16:10-16:40]
Géraud Krawezik, Flatiron Institute, USA
[Bio]
Geraud Krawezik is a member of the Scientific Computing Core at the Flatiron Institute, the internal research division of the Simons Foundation in NYC. His work involves helping researchers best use the resources available to them, which involves following and benchmarking the advances in computing, networking, and storage. While his background is in HPC (PhD from University of Paris Orsay and post-doc at UIUC), he is also working closely with ML researchers to provide them with the technical expertise needed to run large scale training of scientific models, and lately the development of agentic tools.
[Abstract]
As GPU vendors are shifting their focus from High Performance Computing to Machine Learning, they are pushing more lower precision units (BF16, FP8, NVFP4) in their products at the expense of higher precision ones (double and single floating point).
For this project we have first benchmarked the matrix-matrix capabilities of the GPUs with all the different precisions available, as it is at the core of these optimizations, including the recently added emulation modes. We have also tested how the "sparsity features" of these units can be used. Then we have studied their application to machine learning, by benchmarking the quantized inference of different Large Language Models through publicly available frameworks. Future work will focus on scientific foundation models, to see if these techniques can be useful for both training and inference.
[Slides]
Impact of MxP and emulating GEMM technologies [16:40-17:10]
Toshiyuki Imamura, RIKEN, Japan
[Bio]
Toshiyuki Imamura leads the Large-Scale Parallel Numerical Computing Technology Team at RIKEN R-CCS, specializing in numerical libraries for emerging HPC systems such as FugakuNEXT. He earned his diploma and doctorate in Applied Systems and Sciences from Kyoto University in 1993 and 2000, respectively. Previously, he was a researcher at CCSE, JAERI (1996-2003), a visiting scientist at HLRS (2002), and an associate professor at the University of Electro-Communications (2003-2012). His research interests include HPC, performance autotuning, parallel eigenvalue computation, and the high-performance EigenExa library for GPU systems. His group achieved an HPL-MxP ranking of 2.0 EFLOPS on the full Fugaku system in 2020-2021. He was a Gordon Bell Prize finalist at SC05, SC06, and SC20 for his contributions to large-scale parallel eigensolvers. He aims to nurture young HPC talent by participating nearly 10 times as often as the Japanese representative in the IHPCSS series.
[Abstract]
We are developing the Ozaki scheme, which approximates high-precision matrix multiplication using low-precision methods such as the Error-Free Transform, polynomial decomposition, and CRT. The Ozaki-2 scheme achieves a linear cost of O(P) with CRT, where P is the number of moduli used for truncation. Low-precision multiplication (e.g., int8 or FP8 input, int32 output) seamlessly integrates with existing matrix engines, enabling high performance on hardware like GPUs, TPUs, and NPUs. On platforms such as NVIDIA Blackwell, emulation reaches about 100-150 TFLOPS—comparable to or surpassing DGEMM's 40 TFLOPS, limited by FP64. Using a CRT-capable accumulator and data formats enables controlled output based on input. This work covers recent improvements, results, and how variable precision affects the cost and accuracy of algorithms that employ Ozaki Scheme 2 on AI hardware.
[Slides]

17:10-17:50 | Panel and general discussion

Moderator: Ian Foster.

Participants : Christophe Calvin, Jack Dongarra, Yutong Lu, Kunle Olukotun and Kengo Nakajima

17:50-18:00 | Closing remarks of the first day

19:00-21:00 | Workshop Diner

Restaurant Procope, https://www.procope.com/

Tuesday June 30th

8:30-9:00 | Registration, welcome coffee

09:00-10:30 | AI and HPC

Chair: Albert Cohen

Scaling RL at Hugging Face: A Systems View of Inference-Training Co-Design, NCCL Weight Sync, and Long-Context MoE [09:00-09:30]
Amine Dirhoussi, HuggingFace, France
[Bio]
Amine Dirhoussi is a machine learning engineer at Hugging Face, working on the TRL library. His focus is on scaling large-model post-training: distributed asynchronous RL, MoE training at scale, and long-context parallelism. He contributed to the MoE pretraining work presented at DP2E-AI 2025 by Ferdinand Mom.
[Abstract]
Reinforcement Learning is now the dominant post-training regime for frontier LLMs, and it has turned the open training stack into an HPC workload. A single GRPO step interleaves a large-scale inference rollout, a reward computation, and a synchronous gradient update on a Mixture-of-Experts model that may need to attend to 100k+ tokens. Making this loop fast on HF open source stack : TRL, transformers, and accelerate forced us to revisit every parallelism axis the library exposes.
This talk gives a systems view of what we built and what we had to fix. We start from the GRPO loop and the inference split, including NCCL weight synchronization between the trainer and the rollout engine. We then walk through the parallelism stack one axis at a time (DP, ZeRO/FSDP, TP, EP, SP, CP), explaining why each is needed, how it lives inside transformers, and where MoE models broke the existing abstractions. We focus on the new Expert Parallel implementation in transformers, built on top of TP, and on combining EP with SP and CP for long-context RL.
[Slides]
Distributed AI job using Ultra-Ethernet (UEC) networking for scale-out [09:30-10:00]
Benjamin Maze, AMD, France
[Bio]
Benjamin Maze is coming from network integration (dimension data / NTT) and move then to security as System Engineer (ADC, DDOS protection, WAF within Radware). In 2022, he joined Pensando that has been acquired by AMD as SE for distributed services on DPU (smartnic, smartswitches and now AINIC). AMD Pensando is now part of networking BU at AMD called NTSG (Networking Technology and Solutions Group).
[Abstract]
AMD is providing CPU/GPUs and DPU (AINIC) to build large scale AI cluster. This presentation will go through the AINIC based on AMD Pensando DPU and providing UEC ready feature set to leverage the best RDMA performance. The AINIC is used for internode GPU-to-GPU RDMA communication without going through the CPU over 400/800 Gigabit ethernet interface.
[Slides]
Fleets Without a Kernel: The Distributed Systems Substrate Agentic AI Is Missing [10:00-10:30]
Laurent Bindshaedler, Max Planck Institute, Germany
[Bio]
Laurent Bindschaedler is a Research Group Leader at the Max Planck Institute for Software Systems, where he leads the Data Systems Group. He is also an Associate Fellow at the University of Saarland. His research sits at the intersection of operating systems, databases, and machine learning, with recent work on abstractions for long-horizon LLM agents, transactional semantics for agent tool use, and benchmarks for agentic workflows. He holds a PhD from EPFL and was a postdoctoral fellow at MIT CSAIL. His work has been published at SOSP, ASPLOS, EuroSys, ICML, EMNLP, and NDSS.
[Abstract]
Large language models have made code generation cheap. They have also, more quietly, made action cheap: a growing share of software is now produced and executed by agents that do not merely return outputs but act on the world, writing code, restarting services, and moving money. Extreme-scale AI no longer runs a single agent; it coordinates thousands of them at once. A fleet of agents that act, speculate in parallel, and share state is, in every meaningful sense, a distributed system. Yet it is a distributed system without a kernel: it has no commit protocol to decide when a speculative action becomes real, no coherence over the state its agents share, and no scheduler or consensus to coordinate them. Classical parallel and distributed computing built these guarantees for processes, cores, and nodes (commit protocols for transactions, cache coherence for shared memory, scheduling and consensus for distributed nodes), but their agentic counterparts do not yet exist. This talk argues that the substrate beneath agentic AI rests on three systems contracts, effect (when an action commits, and who agrees), state (what persists, and who owns it), and composition (how independently planned agents coordinate without a coordinator), and that each becomes a distributed and parallel problem at extreme scale. The community that built commit protocols, cache coherence, schedulers, and consensus is best positioned to build their agentic counterparts. The primitives already exist but the kernel does not.
[Slides]

10:30-11:00 | Coffee break

11:00-12:00 | AI and HPC

Chair: TBA

From Agent Framework to Evolvable Agent Systems: How openJiuwen Turns Multi-Agent Work into Reusable Skills [11:00-11:30]
Jeff Pan, University of Edinburgh,UK
[Bio]
Jeff Pan is a professor of knowledge computing in the School of Informatics at the University of Edinburgh. He is also chairing the Knowledge Graphs Group at the Alan Turing Institute. He is the Chief Editor and main author of the first book on Knowledge Graphs. His work sits at the intersection of large language models, knowledge graphs, and autonomous agent systems, with a focus on making AI systems more reliable, interpretable, and useful in complex real-world settings. He has contributed to research on, among others, retrieval-augmented generation, agent memory, multi-agent coordination, and evaluation for long-context and long-horizon tasks and has collaborated with industry partners to translate these ideas into deployable systems. In his talk, Prof Pan will examine how agentic AI is moving beyond isolated tool use toward systems that can retain experience, coordinate specialised capabilities, and improve themselves over time.
[Abstract]
Agentic AI is moving from conversation and workflow routing toward grounded tool use, multi-agent collaboration, and ultimately evolvable systems. This talk presents openJiuwen as an open agentic AI stack for that transition: a platform that connects structured search, persistent state, memory, tools, protocols, evaluation, and orchestration. Rather than treating these as separate modules, openJiuwen aims to turn them into a coherent substrate through which agents can ground actions in evidence, manage what should be retained or forgotten, recover from changing execution state, and learn from experience.
The talk then focuses on JiuwenSwarm as openJiuwen’s L4-to-L5 collaboration layer. It argues that the goal is not to run more agents, but to select collaboration when task structure justifies it, coordinate agents through explicit state and replanning, and convert successful collaboration into reusable skills. Through task-aware execution modes, dynamic team formation, shared workflows, skill evolution, and governed human intervention, JiuwenSwarm points toward agent systems in which every task leaves behind more than an answer: it leaves behind a better capability for the next task.
[Slides]
Disaggregated inference in Huawei Ascend SuperNode: from LLM to Omni-modality models [11:30-12:00]
Liu Hongsheng, Huawei, China
[Bio]
I received my PhD in Statistics and Operation research from UNC Chapel Hill at 2020. Currently, I serve as LLM/Omni-modality model serving Expert at Huawei 2012 Laboratories; I’m also a member of the vllm-porject and founder of the vLLM-Omni omni-modal serving engine; Besides, I’am also one of the major contributors to the development of the disaggregated inference architecture in the vLLM community, including AFD and EPDG.
[Abstract]
The shift from text LLMs to omni-modality models introduces massive workload asymmetry and heavy computational strain that traditional unified serving paradigms cannot handle efficiently.
This talk presents the architectural design, implementation practices, and production insights of disaggregated inference within the Huawei Ascend SuperNode platform. For high-throughput, low-latency text LLM serving, we deep-dive into our core intra-model optimization: Attention-FFN (Attn-FFN) disaggregation, which decouples memory-bound attention layers from compute-bound feed-forward networks across specialized hardware pools.
Scaling this foundation to multi-media workloads, we introduce the Encoder-Prefill-Decode-Generation (EPDG) disaggregation framework natively integrated with vLLM and vLLM-Omni. We detail how the SuperNode orchestrates this specialized four-stage pipeline—isolating heavy multi-media encoding, decoupling variable-length pre-fills, and optimizing cross-node KV cache transfers via the high-bandwidth Ascend interconnect fabric during decoding and text/media generation. Finally, we share concrete performance data from large-scale deployments, offering a production-proven infrastructure blueprint for next-generation omni-modality AI.
[Slides]

12:00-12:40 | Poster presentations

Chairs: Chong Li and Soraya Zertal

12:40-14:00 | Lunch (Ecole des Mines de Paris)

13:30-14:30 | Poster Session

14:30-15:50 | Keynotes

Chair: Denis Barthou

Building Agent Societies: A revolution in software engineering. First applications to hardware and software optimization for large scale AI computations [14:30-15:10]
Ludovic Denoyer, IMEC, Belgium
[Bio]
Ludovic Denoyer is a French AI researcher specializing in deep reinforcement learning, natural language processing, and agent-based systems. Currently the Scientific Director at IMEC France leading a the imec AI labs in Paris (http://ailabs.imec-int.com), his career spans both premier academic and industrial labs. He is currently full Professor at Sorbonne University (on sabatical), where he has authored over 150 papers on sequential learning and efficient neural networks. He then became former Research Scientist at Meta AI (FAIR), Staff Scientist at Ubisoft (focusing on gaming AI bots), and key researcher at H Company leading the agent team.
[Abstract]
The software engineering landscape is undergoing a fundamental paradigm shift, moving away from human-written code and rigid, complex data structures toward "Agent Societies." In this new era, natural language serves as a universal multimodal communication protocol, while machine learning and open-ended foundational models enable systems to learn, adapt, and reason dynamically. Rather than continuing to build ever-larger individual models, the future of computing lies in scaling collective, distributed intelligence.
In this talk, we introduce the vision and research of the newly established imec AI Labs (Paris). We explore the foundational pillars of this software revolution. We illustrate on first results how this agentic revolution goes beyond traditional software to reshape the hardware-software boundary with illustrations on kernel co-design and silicon photonics.
[Slides]
Pre-training in the era of large language model [15:10-15:50]
Konstantin Mishchenko, META, France
[Bio]
Konstantin Mishchenko is a Research Scientist at Meta Paris. He was previously a member of the Distributed AI team at Samsung AI Center in the UK, a postdoc at Inria Sierra, and he did a PhD at KAUST under the supervision of Peter Richtarik. His research spans optimizations theory, pre-training of large language models, and post-training of language models for long-context data, and math generation.
[Abstract]
In this talk, we'll discuss the impact of optimizers in LLM pre-training and how we can design more efficient methods that take into account later stages of mid-training and post-training. We will discuss strategies used to scale training across thousands of GPUs, and recipes for learning rate scheduling and model averaging.
[Slides]

15:50-16:10 | Break

16:10-17:10 | Session Agentic AI

Chair: TBA

Tackle the interesting, automate the rest: Building the next generation of computer use agents [16:10-16:40]
Kai Yuan, H Company, Germany
[Bio] [Abstract]
While frontier models have demonstrated the potential for "computer use," the practical reality is often throttled by prohibitive inference costs and inconsistent reliability. This talk explores the development of a best-in-class agent Holo3 that breaks this bottleneck outperforming leading models in UI navigation and task completion at 1/10th the cost. Beyond raw performance, we will discuss the technical framework required to make agents truly viable: robust reliability in "messy" interfaces, human-in-the-loop governance for high-stakes tasks, and the auditability required for transparent execution. This talk features a live demo of the agent navigating complex, multi-step workflows in real-time and will conclude an outlook for the next frontier, where agents move beyond reactive automation toward proactive, cross-platform intelligence in enterprise.
[Slides]
Using AgenLc AI to enhance Machine Learning Researchers producLvity [16:40-17:10]
Maxime Hugues, NVIDIA, USA
[Bio]
Dr. Maxime Hugues is a Principal Applied Scientist in GenAI at AWS, which he joined in 2020. He’s focus on training performance, system reliability, large scale simulation. Prior joining AWS, he worked as HPC Research Scientist at TotalEnergies and as HPC Consultant at Google. He holds a M.E. from the French National Engineer School “ISEN-Toulon”, a M.S. degree from the University of Science and a Ph.D. degree in Computer Science in 2011 from the University of Lille 1.
[Abstract]
Artificial intelligence and in particular LLM models have started to transform our daily lives. Those models take months to build from data preparing, pretraining, post training, and to become more accurate they rely on more complex techniques such as Mixture-of-Experts, trillions of parameters and mixed precisions. While those are frontier models, machine learning engineers work on various of models (LLM, Vision, PhysicalAI) and experiment across various types of systems such as GB200, B300, H100. Their expertise and focus are on the ML while they use supercomputer for their work. In this talk, we present how our team supports those users in a scalable manner using Agents. We’ll start by explaining the challenges supporting a wide variety of workloads and make researcher time productive. We will define the essential metrics collected to improve productivity, the metrics analyzed by agent and the automated actions taken. To conclude, we will present the challenges of using agents in production from access management to hallucinations.
[Slides]

17:10-17:50 | Panel and general discussion: HPC-AI

Moderator: TBA.

Participants : Ludovic Denoyer, Konstantin Mishcheko, + TBA

17:50-18:00 | Closing remarks and date for the next workshop

Contact Us

Please send any questions related to the DP2E-AI 2026 workshop to dp2eai2026@gmail.com

DP2E-AI 2026 PROGRAM

Program Chair: Serge G. Petiton

Monday, June 29th

8:30-9:00 | Registration, welcome coffee

9:00-9:30 | Opening session

9:30-12:40 | Keynotes

Chair: Nahid Emad

10:50-11:20 | Coffee break

12:40-13:40 | Lunch (Ecole des Mines de Paris)

13:40-15:40 | Keynotes

Chair: Corinne Ancourt

15:40-16:10 | Break

16:10-17:10 | Session Arithmetic in the Age of AI.

Chair: Georges-André Silber

17:10-17:50 | Panel and general discussion

Moderator: Ian Foster.

17:50-18:00 | Closing remarks of the first day

19:00-21:00 | Workshop Diner

Restaurant Procope, https://www.procope.com/

Tuesday June 30th

8:30-9:00 | Registration, welcome coffee

09:00-10:30 | AI and HPC

Chair: Albert Cohen

10:30-11:00 | Coffee break

11:00-12:00 | AI and HPC

Chair: TBA

12:00-12:40 | Poster presentations

Chairs: Chong Li and Soraya Zertal

12:40-14:00 | Lunch (Ecole des Mines de Paris)

13:30-14:30 | Poster Session

14:30-15:50 | Keynotes

Chair: Denis Barthou

15:50-16:10 | Break

16:10-17:10 | Session Agentic AI

Chair: TBA

17:10-17:50 | Panel and general discussion: HPC-AI

Moderator: TBA.

17:50-18:00 | Closing remarks and date for the next workshop

Sponsors

Contact Us