Description
Elevate your AI system performance capabilities with this definitive guide to maximizing efficiency across every layer of your AI infrastructure. In today’s era of ever-growing generative models, AI Systems Performance Engineering provides engineers, researchers, and developers with a hands-on set of actionable optimization strategies. Learn to co-optimize hardware, software, and algorithms to build resilient, scalable, and cost-effective AI systems that excel in both training and inference. Authored by Chris Fregly, a performance-focused engineering and product leader, this resource transforms complex AI systems into streamlined, high-impact AI solutions.
Inside, you’ll discover step-by-step methodologies for fine-tuning GPU CUDA kernels, PyTorch-based algorithms, and multinode training and inference systems. You’ll also master the art of scaling GPU clusters for high performance, distributed model training jobs, and inference servers. The book ends with a 175+-item checklist of proven, ready-to-use optimizations.
- Codesign and optimize hardware, software, and algorithms to achieve maximum throughput and cost savings
- Implement cutting-edge inference strategies that reduce latency and boost throughput in real-world settings
- Utilize industry-leading scalability tools and frameworks
- Profile, diagnose, and eliminate performance bottlenecks across complex AI pipelines
- Integrate full stack optimization techniques for robust, reliable AI system performance
From the Preface
In the vibrant streets of San Francisco, where innovation is as common as autonomous vehicle traffic on US Route 101, we find ourselves surrounded by an amazing world of artificial intelligence. Rapid advancements in AI are redefining our daily lives in every aspect. Over the last 20 years, we’ve experienced recommendation engines (2000s), AI assistants (2010s), and fully autonomous vehicles (2020s). The 2030s are going to be even more exciting, as AI is progressing extremely quickly and with massive societal influence.
My personal journey into the fast-moving AI systems performance engineering field was driven by a curiosity to understand the delicate balance and codesign between cutting-edge hardware, highly optimized software, and clever algorithms that power such complex systems and impactful use cases. This realization inspired me to dive deep into the realm of “full-stack” AI performance engineering. I wanted to understand how multiple components like processors, memory architectures, network interconnects, operating systems, and software frameworks all work together in harmony. The complexity of these interactions presented the challenges—and opportunities—that fueled my desire to dive deep and explore this unique combination of technologies.
This book is a realization of my explorations throughout the years as a hands-on ML and AI performance engineer. I created this book for engineers, researchers, practitioners, and enthusiasts who are eager to understand the underpinnings of AI systems performance at all levels. Readers might be building AI applications, optimizing neural network training strategies, or designing and managing scalable inference servers, or they may simply be fascinated by the mechanics of modern AI systems. Overall, this book provides the insights that bridge theory and practice across multiple disciplines.
The reader of this book likely has a foundational understanding of neural networks and a basic familiarity with Python and ML. However, even without these fundamentals, a curious reader can follow the multidimensional codesign performance narrative rooted in the first principles across hardware, software, and algorithms. I promise there is something in this book for every type of reader—and every reader will learn a few new things in these pages.
Throughout the chapters, we examine the evolution of hardware architectures, dive into the nuances of software optimization, and explore real-world case studies that highlight the patterns and best practices of building both high-performance and cost-efficient AI systems. Each section is designed to build upon the last, covering everything from foundational concepts to advanced applications.
Review
“AI systems are layered and fast-moving. Chris breaks the complexity down into a reference that will set the standard for years.”
–Chris Lattner, CEO at Modular
“CUDA kernels, distributed training, compilers, disaggregated inference—finally in one place. An encyclopedia of ML systems.”
–Mark Saroufim, PyTorch at Meta (and Founder of GPU MODE Community)
“Squeezing the most performance out of your AI system is what separates the good from the great. This is the missing manual.”
—Sebastian Raschka, ML/AI Researcher
“An essential guide to modern ML systems—grounded in vLLM and distributed systems—with deep insight into inference optimization and open source.”
—Michael Goin, vLLM Maintainer and Principal Engineer at Red Hat
“A definitive field guide that connects silicon to application, giving AI engineers the full‑stack wisdom to turn raw compute into high‑performance models.”
—Harsh Banwait, Director of Product at Coreweave
Book details
- Author : Chris Fregly
- Publisher : O’Reilly Media
- Publication date : December 16, 2025
- Edition : 1st
- Print length : 1058 pages
- Language : English
- Format : Paperback







