Developer Experience Behind the Air Gap: CSIT’s Journey
The interconnected nature of modern software development, with its reliance on cloud services and external tools, poses unique challenges in environments where security demands complete network isolation. At CSIT, we tackled this challenge head-on, transforming our developer platforms to deliver cloud-native capabilities within air-gapped networks. This article shares our six-year journey of evolving from manual, time-consuming processes to sophisticated development platforms that serve over a thousand engineers across multiple secure networks.
Our transformation unfolded in three phases, each building on robust foundations established by our core infrastructure teams. We began by introducing foundational automation that reduced common tasks from days to minutes, enabled by infrastructure engineers who created secure, automated pipelines across network boundaries. Next, we integrated our services to create standardised practices across teams, replacing fragmented toolchains with cohesive platforms. Finally, we developed sophisticated solutions that brought modern capabilities like enterprise LLMs and cloud development environments into our secure networks.
Throughout this journey, the partnership between platform teams driving developer experience initiatives, and infrastructure teams maintaining operational excellence has been crucial — enabling us to improve developer productivity while upholding the high security standards our mission requires.
Challenges for Developer Experience in an Air-Gap
In 2018, our air-gapped environment faced three connected challenges that reduced developer productivity. CSIT recognised the need to break down legacy team barriers across multiple air-gapped networks. Our initial collaborative effort revealed how technical limits and historical organisational structures created three main problems: limited access to external resources, fragmented toolchains, and time-consuming procedures.
Limited Access to External Resources
The air-gap slowed development by blocking access to modern development infrastructure and tools. Although engineers could access the internet on separate PCs, development machines could not use external services that modern teams rely on. This isolation forced teams to host every component of their development stack internally.
Development tools faced similar restrictions. IDEs like Visual Studio Code or IntelliJ ran with limited features — many extensions either failed or needed complex fixes to work offline. Language servers, code completion, and other productivity features that need external connections performed poorly. Basic tasks like adding a new dependency or checking a package version required switching between machines.
This isolation affected more than individual engineers — it created problems throughout our development process. Teams spent significant time maintaining internal systems that could have been managed by external services (e.g. setting up a Gitlab Instance instead of using github.com / gitlab.com). This burden of maintaining infrastructure led to our second major challenge: the spread of different toolchains.
Fragmented Toolchains Creating Barriers
Within our air-gapped networks, each team needed to create their own internal solutions to replace commonly available external developer tools and services. Without access to standardised cloud services or a centralised internal platform, different teams independently implemented and maintained their own versions of essential development tools. This resulted in varied toolchains across our environment — from source control and CI/CD pipelines to build tools and deployment platforms — creating barriers when teams tried to work together.
Some teams saw these problems and worked to improve the developer experience by adding tools like Artifactory for repository management or sharing a Gitlab instance to standardise practices across several teams. However, these improvements remained small in scope, as they lacked proper product ownership — relying instead on volunteer maintainers who managed these services alongside their primary responsibilities. Without dedicated teams, clear ownership, or standards across networks, these efforts led to shadow infrastructure that could not scale to organisational needs. The lack of common practices meant teams could not share knowledge and resources effectively, leading to repeated work and inconsistent methods — a problem that made our third challenge worse: time-consuming procedures.
Time-Consuming Procedures
These procedures in our environment required extensive manual work and multiple hand-offs between teams, leading to significant delays and frustrated engineers. These delays created vicious cycles — teams would batch changes to minimise process overhead, leading to larger and riskier updates that required even more scrutiny and time.
The time taken to complete common tasks illustrate this challenge:
This had a substantial impact on productivity. Engineers spent significant time managing processes rather than delivering value. For example, teams delayed routine maintenance, infrastructure changes, and system improvements to avoid these time-consuming procedures, which led to technical debt accumulating across our systems. The procedural overhead also discouraged experimentation and cross-team collaboration, as teams sought to minimise activities that would trigger these lengthy processes.
CSIT’s Journey
By 2019, these challenges had reached a critical point. Engineers across multiple teams were spending significant portions of their time managing infrastructure rather than solving meaningful problems. Developer experience had emerged as the most frequently cited pain point across CSIT’s engineering organisation, highlighting the urgent need for systematic changes.
The turning point came when CSIT’s leadership recognised that:
- Improving developer experience was essential for CSIT to effectively advance Singapore’s national security interests
- This improvement required coordinated action across CSIT and could not be done by isolated teams
This recognition led to significant investment and senior leadership commitment to transform CSIT’s developer experience. This started a journey that continues today — one that fundamentally reshapes how we approach software development in our air-gapped environment, by first solving our most pressing challenges and then building toward increasingly sophisticated capabilities.
Foundation Building: 2019–2020
Our initial focus in 2019 was on automating manual processes and introducing managed services to improve developer productivity. This period marked a crucial organisational shift, combining existing infrastructure teams that were specialised in self-service and automation with new teams formed from software engineers who brought expertise in developer tooling.
Key initiatives during this phase included:
- Automated File Transfer Pipelines: We replaced manual USB-based transfers with pipelines using secure automated scanning, reducing transfer times from days to minutes. This automation cuts process overhead while enhancing security through automated scanning, enabling engineers to perform routine updates and experiment with new tooling more efficiently.
- Self-Service Infrastructure: We built automated VM provisioning and firewall configuration workflows, reducing infrastructure deployment time from weeks to minutes for most use cases. The automation was particularly effective for VMs within the same network segment, though cross-segment automation remains an ongoing challenge. This enabled teams to rapidly deploy infrastructure without lengthy approval processes.
- Managed Developer Tools: We introduced centrally managed versions of essential tools including Artifactory, Atlassian Suite, GitLab, and Mattermost. By centralising management of these tools, we freed up engineering resources previously dedicated to separately maintaining these tools within individual teams. The managed services approach enabled dedicated teams to maximise tool capabilities, serving all engineering teams with over 150 repositories.
- Managed Container Service: We deployed Rancher to provide teams with managed container infrastructure, encompassing compute, storage, networking, DNS, and PKI. This service enabled quick deployment of containerised applications while standardising deployment patterns through kubectl and helm, promoting consistency across teams.
These foundational improvements significantly reduced manual overhead and established patterns for managed services that we would build upon in subsequent phases. The combination of organisational changes and technical initiatives created a strong base for future platform development while immediately improving daily developer experiences.
Service Integration and Standardisation: 2021–2022
Building on our foundation of automation, we focused on deeper service integration and standardisation across our environment. This second phase marked a shift from basic automation and into sophisticated platform services, with teams collaborating to create standardised patterns that would improve service discovery and integration across CSIT.
Key initiatives during this phase included:
- Service integration layer: Built using Kong, Envoy, OPA, and Keycloak to standardise API communication across CSIT. This layer provided centralised identity management and coarse-grained authorisation, while improving API discoverability through centralised registration. This established consistent patterns for inter-service communication, enabling teams to easily consume and integrate with each other’s services.
- Kubernetes as a Service: We introduced VMware Tanzu with automated cluster provisioning and GitOps-based configuration management. Through integration with NSX-T, teams now can self-service network policy management within predefined security boundaries, eliminating lengthy approval processes for routine network changes. Teams could now deploy and scale their Kubernetes environments on demand, with automated integration of core services like DNS, PKI, and ingress controllers.
- Data Exploration Platform (DEP): We launched our first managed platform service through JupyterHub, providing a secure and comprehensive environment for data exploration and analysis. The platform came pre-configured with essential data science tools and libraries while maintaining access to CSIT’s data stores. This reduced environment setup overhead from significant portions of users’ time to less than 5%, enabling our diverse team of domain experts to work with greater productivity in their jobs, ranging from counter-terrorism to counter-hostile information missions.
These initiatives marked a significant evolution in our approach to platform services, moving beyond basic automation to sophisticated, integrated solutions. The success of DEP in particular validated our Platform-as-a-Service approach, demonstrating how managed services could simultaneously enhance security, improve productivity, and accelerate innovation across CSIT.
Platform Evolution: 2023-Present
Finally in 2023, we shifted focus to platform services that eliminate repetitive work across projects. This phase represents our most ambitious effort yet, bringing cloud-native capabilities into our air-gapped environment while maintaining strict security standards. Our teams are leveraging experience from previous phases to deliver sophisticated platform services that accelerate innovation across CSIT.
Key initiatives during this phase include:
- Enterprise LLM Infrastructure: Built two generations of on-premise LLM infrastructure, progressing from A100 clusters to NVIDIA DGX H100 SuperPOD architecture. Through Run:ai’s specialised Kubernetes scheduler, we enabled flexible GPU resource management including fractional allocation and dynamic reassignment. This infrastructure supports various AI workloads from experimental to production use cases, enabling teams to securely integrate LLM capabilities into their applications. For more details on our GPU infrastructure and LLM serving architecture, check out our detailed two-part technical series on the CSIT blog.
- Cloud Development Environments: Deployed Coder to provide templated development environments that come pre-configured with essential development tools and security controls. This particularly improved onboarding experiences and enabled quick, secured access for non-engineering staff to development tools. While currently focused on pod-based environments, we are expanding to VM-based solutions to provide more complete development capabilities.
- Streamlit Platform: Implemented our managed Streamlit platform to democratise data application development across CSIT. The platform integrates seamlessly with our existing authentication framework while providing streamlined deployment pipelines and robust resource management. What began as a tool for engineering teams has expanded to support analysts and corporate users in business process automation, now hosting 250+ applications and used by more than 65% of CSIT staff weekly. This broad adoption demonstrates the success of our platform-centric approach in supporting diverse user needs across the organisation.
These initiatives showcase our evolution toward sophisticated platform services that can support diverse user needs while maintaining security standards. By focusing on managed platforms that eliminate repetitive work, teams across CSIT are empowered to focus on their core missions while taking advantage of modern development capabilities.
Key learnings
Leadership support
Our transformation required continuous support from various levels of leadership. While CSIT had numerous ground-up efforts over many years, these initiatives could not scale to the entire organisation. When senior leadership committed to improving developer experience, infrastructure teams received additional headcount, while software engineering teams contributed headcount to form platform teams. Department heads and team leads gave their teams time to experiment and migrate to the newly established managed services, understanding that it would cause a short-term loss of output for their departments. This commitment continues today, with leadership increasing our platform engineering headcount from 12 to 15 percent of our total engineering workforce to strengthen our platform capabilities.
This support proved essential because the initial stages of improving developer experience often demand significant resources and investment in team formation and software acquisition. It takes time for newly formed teams to develop proficiency with their products, and even more time for existing teams to enhance their services while handling operations. When we successfully roll out a new managed service, a critical mass of teams must migrate to these new services before benefits appear. Additionally, every new service or platform brings its own operational challenges and requires ongoing resources for maintenance. Our leaders recognised that the immediate and visible “loss” of allocating a team to a managed service is outweighed by the long-term benefits of standardisation, product expertise, and reduced shadow infrastructure across the organisation.
Product mindset
Our organisational structure evolved to align with our journey, with infrastructure teams maintaining network-centric alignment for operational stability while platform teams organised around specific products for consistent feature delivery. Both infrastructure and platform teams are increasingly adopting product thinking in their service delivery. For infrastructure teams, this means designing their services with a focus on user experience and automation, and making core infrastructure more accessible and configurable for engineering teams. Platform teams similarly maintain clear roadmaps, regular release cycles, and continuous user engagement. This approach requires teams to develop a deeper understanding of user needs, establish feedback loops, and take long-term ownership of their services.
We’ve observed clear benefits where product thinking has taken root, with teams showing improved service reliability, more consistent feature delivery, and higher user satisfaction. Team members have developed deep expertise in their respective products while maintaining essential knowledge of network operations — a combination crucial for building solutions that are both technically sophisticated and practically implementable within our network constraints. This success is reinforced by our practice of “dogfooding,” where teams use their own services, helping identify usability issues early and ensure teams understand the developer experience firsthand.
The transition to a product mindset remains an ongoing journey, with teams at different stages of adoption depending on their service maturity and operational constraints. While we’ve seen encouraging results from our initial implementations, we recognise that this transformation takes time and requires sustained effort. Our goal is to extend product thinking across all our services, learning from successful examples while acknowledging that different services may require different approaches to this transition.
Infrastructure and Configuration as Code
Managing infrastructure across multiple air-gapped networks presents a unique challenge. Each network requires its own infrastructure, but manual management meant identical changes had to be repeated across each network — a process where small misconfigurations could create subtle differences that might only surface as unexpected failures on specific networks months later. This led us to adopt Infrastructure as Code (IaC) as a fundamental solution, defining our infrastructure through code that could be version controlled, tested, and automatically deployed.
This transition required significant investment in both automation and process changes. We first established automated workflows for infrastructure management, including secure connectivity to infrastructure control planes and streamlined approval processes. With automation in place, we focused on upskilling teams in declarative configuration approaches and infrastructure design patterns.
We then established a standardised repository structure that enabled consistent deployments while maintaining network-specific configurations where needed. This structure supports our multi-network deployment strategy, as illustrated in the following diagram:
Our approach facilitates deployments across network boundaries through mirror synchronisation and repository synchronisation . This enables teams to deploy consistently across networks using standardised core configurations while maintaining necessary network-specific customisations. All changes are tracked and version-controlled, with approved changes automatically synced across network boundaries.
This systematic approach has transformed infrastructure management at CSIT by significantly reducing time spent on manual operations and troubleshooting. While the initial investment in automation and standardisation was substantial, it has paid off as our systems have grown more complex and our teams have evolved.
Documentation Culture
Creating a documentation culture was essential to our platform transformation. We focused on two areas: supporting our users and preserving our technical knowledge.
Our user documentation strategy shifted significantly. Where we once relied on training videos that were hard to reference, we now focus on text-based documentation with tested code examples. The documentation is available on all relevant networks, making commands easy to copy and use. To keep improving, we encourage users to give feedback on unclear or missing information through simple feedback forms and regular reviews. We are using the Diátaxis framework to reorganise our documentation into clear categories like tutorials and technical guides.
The increased emphasis on documentation extends to internal documentation as well. Teams are writing design documents to capture the reasoning behind specific technical decisions. We invested in operational documentation through detailed runbooks that standardise how we handle both routine tasks and incidents. When incidents occur, we write blameless post-mortems that focus on learning rather than fault-finding. These post-mortems are detailed troubleshooting articles that document the timeline, impact, root cause analysis, resolution steps and future preventative measures. This approach has created a valuable knowledge base that reduces onboarding time for new team members and improves system reliability through standardised operations.
Measuring Success
We take a balanced approach to evaluating our developer experience, combining both qualitative and quantitative measures of developer productivity with an emphasis on qualitative feedback. While we initially attempted to implement DORA metrics across our CSIT, we quickly discovered that only deployment frequency could be consistently measured across teams. The unique constraints of our environment made other metrics challenging to collect — for instance, calculating change failure rate becomes impossible when deployments are automated across a one-way diode.
Given these limitations, our annual developer survey is our most valuable measurement tool. The survey framework incorporates elements from both DORA and SPACE methodologies, enabling us to gather comprehensive feedback directly from our engineering teams. This approach has proven particularly effective in understanding the real impact of our developer experience initiatives.
One particularly telling indicator of success has emerged from tracking the nature of service requests over time. We have observed a clear trend toward increasing complexity — engineers who once requested improvements to foundational infrastructure services (faster file transfer pipelines) are now seeking sophisticated solutions (managed databases). This evolution in request complexity suggests growing confidence and satisfaction with our existing developer experience, as teams feel empowered to tackle more ambitious technical challenges. Given that engineers are typically quick to voice concerns about productivity bottlenecks, the high utilisation of our existing services combined with a lack of negative feedback is another strong signal of success — it indicates that these services have become reliable, friction-free parts of our developers’ workflow.
What’s next
As CSIT continues to evolve our developer platforms, we are focusing on standardising common patterns that have emerged across teams. Three key areas stand out for centralisation:
- Observability: Teams across CSIT have implemented various monitoring solutions, leading to fragmented tooling and inconsistent practices. Our new centralised observability platform, which has completed initial validation, addresses these challenges by providing a unified approach to monitoring, logging and tracing via OpenTelemetry. This platform will eliminate the need for teams to maintain separate monitoring stacks while enabling powerful cross-service debugging capabilities.
- Secrets management: As our services grow more interconnected, teams have developed diverse approaches to secrets management, resulting in duplicated effort and maintenance overhead. Our forthcoming centralised secrets management service will standardise tooling and processes across key areas: automatic secret rotation, consistent delivery methods into application runtime environments, and comprehensive tracking with versioning capabilities that record access patterns and change history. By consolidating these capabilities, we will enhance our security posture while freeing development teams from managing separate implementations.
- Access control management: The growing sophistication of our platforms has highlighted the need for more granular and manageable access controls. Our new fine-grained permissions service (built on OpenFGA), introduces a flexible framework that supports both organisation-wide policies and team-specific requirements. This service will simplify access management while allowing teams to implement granular permissions that meet their specific security requirements.
Looking ahead, we continue to explore emerging technologies like to further enhance our capabilities while maintaining our security requirements. These initiatives continue our journey of bringing cloud-native capabilities to air-gapped environments.
Our experience has shown that with the right combination of leadership support, technical innovation, and focus on developer needs, air-gapped environments can provide a modern, efficient development experience while upholding strict security standards. Building and maintaining these systems requires engineers with deep expertise who enjoy tackling complex technical challenges. For those interested in contributing to Singapore’s national security through this work, !
Acknowledgements
This article builds upon the foundation of presentation content created by Jerold Tan. Significant contributions and review insights were provided by CSIT’s platform and infrastructure teams, whose input proved invaluable in shaping this piece.