CS Concepts

DRAM vs SRAM vs flip-flops

DRAM stores data using the charge on a capacitor. The charge slowly leaks away, so it needs to be “refreshed” periodically by reading and re-writing the data (in practice, the read alone also does the re-write for reasons we don’t need to go into yet). That’s why it’s “dynamic”: if you don’t refresh it actively, the data goes away. DRAM requires 1 transistor per bit. DRAM is extremely dense, but very slow. It has a single port that’s used for reading and for writing.

rows and columns of capacitors, random access
even only read, must read and write, charge and recharge capacitors
slow and costly to switch rows (scattered reads)
- On A100, memory bandwidth for widely-spaced reads is only 8% of peak bandwidth

SRAM is “static” RAM: it holds data forever. The data is stored in two inverters driving each other in a loop, and to write new data, stronger transistors overpower the tiny ones in the inverter. It requires 6 or 8 transistors per bit depending on the design you choose. SRAM is fairly dense, and medium speed. It’s often either single-ported or has one read port and one write port, but you can build SRAMs with more ports.

Flip flops have different ways of storing their data but it’s often “tristate” inverters driving each other in a loop, with two of those loops per bit. It’s like 20 transistors per bit - extremely poor density but extremely fast. Flip flop arrays can have arbitrary numbers of read and write ports. Flip flops are also used individually, not as part of arrays, to store random bits of state all over the place.

Why is CUDA the way it is?

Key fact:

memory system can only feed about one-sixth of what the execution resources can request
the primary limiting factor is memory

workstream:

Copy to GPU
Process
Copy from GPU

For opmitimal performance, CUDA:

customize hardware for programming language
- oversubscription: feed it a lot of concurrent work, so that it can pack things efficiently
customize programming language for hardware
- occupany is the most powerful tool for tuning a program
  - by changing memory layout
- i.e., 3 blocks/SM to 4 blocks/SM

CUDA SM: Compute Unified Device Architecture Streaming Multiprocessor

CUDA’s GPU Execution Hierarchy

grid of work
divided into many CUDA thread blocks
- A100 max blocks per SM: 32
- a block is broken up into “warps” of 32 threads,
- a warp is the vector element of the GPU, or execution unit
many threads in each block
- block size: the number of threads which must be concurrent
  - resource allocation unit
- shared memory: common to all threads in a block
- no cache, but large registers (up to 100 per thread)
- all thread runs exactly the same program (SIMT)
- within each thread, thread idx and block idx

SIMD (Single Instruction, Multiple Data)

main thread controls everything
explicit thread control (if else)

CUDA is SIMT (Single Instruction, Multiple Threads)

each thread works independently, maintains its own state
implicit thread control

IVN

Q&A

CS Concepts

DRAM vs SRAM vs flip-flops

Why is CUDA the way it is?

CUDA’s GPU Execution Hierarchy