Democratizing Large Model Training on Smaller GPUs with FSDP

By Preethi Srinivasan

November 24, 2025

Summary

Preethi Srinivasan, a Solution Consultant at Sahaj Software, demonstrated how combining QLoRA and Fully Sharded Data Parallel (FSDP) enabled large model training on smaller, consumer-grade GPUs, making advanced AI research more accessible.
The talk covered the evolution of parallelism strategies, from traditional data, model, and pipeline parallelism to advanced techniques like FSDP, which significantly reduced memory usage, training time, and communication overhead while improving GPU utilization.
Preethi discussed quantization, mixed precision, CPU offloading, and activation checkpointing, highlighting the trade-offs between speed, accuracy, and resource usage for each optimization.
QLoRA helps reduce the memory footprint for large models, while FSDP improves scalability and training efficiency, making large-model training feasible on smaller GPUs.
The talk provided practical insights into democratizing large model training, bridging the gap between enterprise hardware and accessible AI research.

Generated using GPT-4o-mini.