Democratizing Large Model Training on Smaller GPUs with FSDP
FSDPLarge Model TrainingLLMsParallelism TechniquesQLoRA
By Preethi Srinivasan
November 24, 2025
Summary
Preethi Srinivasan, a Solution Consultant at Sahaj Software, demonstrated how combining QLoRA and Fully Sharded Data Parallel (FSDP) enabled large model training on smaller, consumer-grade GPUs, making advanced AI research more accessible.
The talk covered the evolution of parallelism strategies, from traditional data, model, and pipeline parallelism to advanced techniques like FSDP, which significantly reduced memory usage, training time, and communication overhead while improving GPU utilization.
Preethi discussed quantization, mixed precision, CPU offloading, and activation checkpointing, highlighting the trade-offs between speed, accuracy, and resource usage for each optimization.
QLoRA helps reduce the memory footprint for large models, while FSDP improves scalability and training efficiency, making large-model training feasible on smaller GPUs.
The talk provided practical insights into democratizing large model training, bridging the gap between enterprise hardware and accessible AI research.
Generated using GPT-4o-mini.
Share
More Videos of our talks
Agentic AI 101
Practical Testing Strategies for Databricks: A Software Engineer’s Journey into Data Engineering
What Happens As You Code with AI? Beyond Vibe Coding