Making Deep Learning Climate-Friendly

How to Tackle the Carbon Footprint of AI

5 min readMar 17, 2022

Deep learning is bringing many benefits to the world: solving the 50-year-old protein folding problem, detecting cancer, and improving the power grid. While there is so much that deep learning is powering, we also need to consider the costs.

In the quest for more accurate and generalizable models, there has been an explosion in network size and thereby power consumption. It seems like every year model sizes skyrocket. In 2020, GPT-3 was released with 175 billion parameters and a $4.6M theoretical training cost with the lowest-priced GPU cloud on the market.

Less than a year later, Google released Switch Transformer with over a trillion parameters. OpenAI found that the compute required for best-in-class large models doubles ). With the exponential increase in parameters, there’s an exponential increase in energy.

How are AI researchers and engineers tackling this issue?

Neural Architectures

Common wisdom dictates that larger models provide better results. Yet, that is not always the case. Google Research just released a showing how higher pre-training accuracy reaches a saturation point. As the model accuracy increases, its ability to be fine-tuned to solve specific tasks (i.e. transfer learning) does not increase as quickly. At some point, accuracy can even decrease.

When you start questioning if larger models are always the answer, the opportunity opens to explore improving current architectures. More effort is being placed on architecture selection, with tools like for a task, resulting in many times less computation needed.

Another area gaining interest is the Lottery Ticket Hypothesis, where a dense subnetwork (“lottery ticket”) of the neural network can be trained to achieve similar performance as the entire network. Previously, this could only be done after training the full network, but I discuss in detail .

These approaches challenge the notion that bigger is better. At the same time, they skip the cost of training large models to only then prune large parts of the model. Instead, they skip straight to an optimal architecture to make training models cheaper, faster, and more energy-efficient.

Data Movement

While training large models consumes a high amount of energy, the bulk goes towards inference (). Ideally, a model like ResNet50 is trained once, then used many times across applications. As a result, much effort goes into the inference stage. The biggest area here is data transfer. Reducing the size of our networks helps, but after that, the data must be pushed through no matter what. There are software and hardware approaches to this.

Software

On the software side, datatype selection is key. While higher precision datatypes may be necessary for some applications, they may not improve model performance in others. Less precise datatypes require less data to be transferred, say a 16-bit floating-point versus a 32-bit floating-point number. Different companies are creating different datatypes that further reduce the data size and processing requirements, such as MSFP by Microsoft.

Models that are run often can also leverage batching. Sending requests in batches is often more efficient than sending a single request or data packet.

Hardware

On the hardware side, ensuring the data is close together is important. Shared memory between your deep learning server and client is more efficient than sending it back and forth. Doing edge machine learning on your iPhone is more efficient (and safer!) than sending that data over to Apple’s data center one state over. The closer together your data is, the less transfer needs to occur.

There are also specialized chips for different deep learning applications, like , which maximizes data reuse via a reconfigurable on-chip network.

Compression is also commonly used to minimize the data transferred, since it can require fewer resources to compress data rather than send it outright. This is particularly true of sparse networks, where many tricks are used to take advantage of the sparsity. Sparsity is also a common motivation for pruning, as 0-valued entries can often be ignored in storage and computation.

Data Storage

Beyond data movement, data storage is another rich hardware area for energy savings. As deep learning draws inspiration from the brain, researchers look at the brain’s energy efficiency. The brain is up to as deep learning. Unlike our digital computer world of 1s and 0s, the brain’s synapses look more like analog systems. As the brain learns and forgets, the signals get weaker and stronger. Using this more continuous analog spectrum, researchers have .

In a similar fashion, other forms of computing are being explored. For example, storing photons instead of electrons and representing data in multiple states at once (quantum computing). These may be more efficient ways of storing data.

Closing Thoughts

Deep learning has done so much good for the world and it has the potential to do even more. With the growth in usage of AI, making it more efficient becomes critical. Researchers are looking at neural architecture and size as a key area. At the same time, they are exploring ways of reducing the cost of data storage and data transfer, impacting most technology including the device you are reading this on.

TDS Archive

Making Deep Learning Climate-Friendly

How to Tackle the Carbon Footprint of AI

Neural Architectures

Data Movement

Data Storage

Closing Thoughts

Published in TDS Archive

Written by David Yastremsky

No responses yet