Disconnected, Not Disadvantaged [Part 1]
Making GPUs available in a secured environment to support Generative AI effort
Who Should Read This?
This article is for you if your organization prioritizes security and you are exploring how to establish GPU infrastructure to support large language models (LLMs) within a secure environment. Whether you are dealing with compliance-driven constraints, working in defense, finance, or other highly sensitive industries, or simply curious about deploying generative AI in isolated environments, this article offers insights into our approach, challenges, and solutions.
Introduction
CSIT operates in a highly secure environment where direct internet access is not possible. Our platform teams have built an internal cloud-like ecosystem that supports complex computational needs, allowing us to operate efficiently despite the limitations of being disconnected from the internet.
In November 2022 when OpenAI launched ChatGPT, its potential to revolutionize work processes was apparent. While many could access this powerful tool with ease without needing to worry about hardware or model deployments, we faced the challenge of adopting this technology without relying on external, internet-connected services due to our security posture. Instead, we decided to seek out large language models (LLMs) to deploy internally. Fortunately, the open-source community has been a game-changer, progressively releasing LLMs comparable to commercial offerings.
Many security-conscious organizations might prefer to host their own LLMs too. Some of them may depend on cloud providers for data storage and computation, but the security offerings of those environments may not satisfy organizations that have stringent security requirements. For us, the decision was made to invest in building our own hardware and platform to support Generative AI efforts within our secured environment.
This challenge presents a unique opportunity to tackle interesting problems such as: how to provide GPU and providing LLM as a service to the entire organization.
This two-part series will delve into our journey of making LLMs available in a disconnected environment. In this first article, we will focus on the hardware and platform layer. While the second article will explore the LLM inference layer.
Challenges with Existing Infrastructure
Before Generative AI gained popularity, we were already offering GPU-as-a-Service in the form of VMware virtual machines (VMs) with NVIDIA GPUs. This allowed our engineers to easily experiment and deploy models for proof-of-concept purposes. However, we also faced challenges with the existing infrastructure. The following describes some of our challenges faced.
GPU Virtualization: Initially, we used VMware vSphere Bitfusion to provide GPUs as a shared remote resource to multiple VMs. However, End-of-Life was announced for Bitfusion, and we subsequently switched to the virtual GPU (vGPU) approach. We chose not to split the GPU as it was a manual and rigid process, and LLM deployments consumed a lot of GPU memory anyway. This gave us performance close to that of DirectPath I/O (PCI Passthrough) with the ability to vMotion VMs without interrupting the workloads.
Day 2 Operations: While these vGPUs serve their intended purpose, it came with a set of day 2 operational challenges for our engineers in managing multiple deployments over multiple VMs. Furthermore, it was tedious to perform software and driver upgrades or scaling operations.
GPU Access: Besides the VMware vGPU service mentioned above, our engineers were also asked to explore ways to consolidate and centrally manage GPU resources available in other platforms in the organization. Specific access patterns and authorizations were required to use these GPUs, making it difficult for teams to share and consolidate the GPU computing power.
Performance: Our infrastructure was not designed to support LLM workloads well. Model sizes were getting larger and required a huge amount of GPU memory, which exceeded the capacity of our servers. LLM workloads could not get sufficient GPUs to run larger models and suffered from lower performance due to the lack of fast networking between GPUs.
Datacenter: The power and cooling requirements for GPU servers were much higher than typical servers, and these requirements trended upwards with each new generation of GPUs. Datacenters must provide significantly more power and be equipped with adequate cooling capabilities to support modern AI infrastructure.
1st Generation GPU Farm: Repurposing Hardware for Generative AI Workloads
To better support Generative AI, we needed to upgrade our infrastructure. We planned to procure hardware to build up a new platform. However, this incurred a long lead time as GPUs were scarce due to the surge in global demands for GPUs. Given that we already had existing GPU resources spread across our organization, we decided to consolidate and centralize them into a single farm while in parallel procure new and higher specs GPU hardware. This would allow us early access to the open-source Large Language Models on our internal GPU infra to enable Gen AI explorations while achieving better utilization of the existing GPU resources.
This consolidation provided us with a mix of servers with 40GB and 80GB NVIDIA A100 GPUs. We chose to deploy a Kubernetes cluster to support modern cloud-native architecture and reap its benefits of scalability, reliability and ease of deployment with CI/CD. We used Bright Cluster Manager (now known as Base Command Manager) to provision the Kubernetes cluster, and Rancher to manage it. We provided persistent storage with a Container Storage Interface (CSI) driver to our enterprise storage appliance.
Our 1st generation GPU farm was a success and well-received by the engineers. They had a better experience in getting GPU resources and deploying and managing their workloads. The consolidated hardware also allowed us to stack more GPUs in a server, allowing workloads to obtain more GPUs and deploy larger LLM models.
2nd Generation GPU Farm: Expansion and Optimizing GPU Utilization
The arrival of our new GPU hardware marked the beginning of our 2nd generation GPU farm, which would help us to overcome some of the challenges due to existing hardware limitations. This was a small NVIDIA DGX H100 cluster following the DGX SuperPOD Architecture, which provided best-in-class performance for LLM workloads. The key benefits were its scalability, fast compute with DGX H100 systems, fast networking with NVLink and NDR InfiniBand between GPU cards, and fast storage (which is work in progress at this stage). Due to power and cooling limitations, we had to spread the systems over a larger number of racks. With this new hardware, we were able to deploy larger models with full 128K context length for inference like Llama3.1–70B at FP16 and run more demanding jobs faster. We also migrated our 1st generation GPU farm to the same Kubernetes cluster under different node pools to provide a better user experience through a single control plane.
When operating our 1st generation GPU farm, we found that while GPU allocation was high, GPU utilization was low. There were many reasons for this. LLMs used for inference required large amounts of GPU memory. A range of LLMs were deployed as an option for users to support different use cases, but not all models had constant active usage. Overall usage was low as most of the product teams were still in the experimental stage of using AI and had not created production flows involving LLMs yet. Allocation was inefficient as we did not split GPUs so small workloads were using too much GPU resources.
In our 2nd generation GPU farm, we deployed Run:ai to optimize our GPU utilization. The Run:ai Kubernetes scheduler, which specialized in GPU-based high-performance computing workloads, helped to overcome the limitations of the default Kubernetes scheduler which was optimized for hyperscale workloads. With Run:ai, we could use quotas to manage GPU usage of multiple tenants. Our engineers could easily consume partial GPU resources by specifying fractions (eg. 0.1 gpu) or the specific memory needed (eg. 8 GB) without administrator intervention. They could also overcommit and make use of all idle GPU resources by running preemptible workloads. If idle resources were being contested by other workloads, their GPU allocation would dynamically be reassigned in a fair sharing mode based on the ratio of their project quotas. This was particularly useful for overcommitting on training jobs as they could progress when GPU resources were idle and back off when higher priority workloads came online.
We are also excited to start trying Run:ai’s new GPU Memory Swap capability, which was released in version 2.18. This feature allows GPU workloads to treat CPU memory as swap memory and will allow a higher degree of overcommitting GPU workloads, resulting in higher GPU utilization. This is especially useful in notebook sessions or serving of less used models where the reserved GPU is typically idle.
Where Do We Go from Here?
The AI landscape has been evolving quickly and we need to constantly keep pace and ensure that our infrastructure continues to evolve to power innovation. Our engagement with industry partners like NVIDIA and Run:ai has been invaluable in keeping us up to date with the latest advancements in the industry and has helped augment our capabilities. As our GPU usage patterns become more defined and as industry offerings become more competitive, we may also evaluate alternative platforms that can be more cost effective and optimized for specific workload types.
With the means and ability to manage and scale our GPUs-as-a-service in our secured environment, we are no longer disadvantaged by disconnection from the internet. In the next part of this series, we dive into how we provide LLM as a service within our secured environment here.
If you enjoyed this article and would like to learn more, don’t hesitate to get in touch with us!
PS: If you are passionate about software engineering like we are, CSIT is hiring! Find out more about our .
Key Contributors
- Jerold Tan, Staff Software Engineer at Centre for Strategic Infocomm Technologies with expertise in building infrastructure platforms. Advocate of modernizing infrastructure management with Infrastructure as Code.
- Jason Cheng, Senior Software Engineer at Centre for Strategic Infocomm Technologies specializing in serving AI services at scale. Passionate about building impactful applications to users.
- Raymond Tay, AI Engineering Manager at the Centre for Strategic Infocomm Technologies. Passionate about practical applications of AI with a focus on creating secure, scalable solutions that drive organizational transformation.