GPU
Graphics Processing Unit
A graphics processing unit is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles.
GPU's are REALLY good at cracking passwords. Why?
A GPU has hundres of cores that can be used to compute mathematical functions in parallel. A CPU usually has 4-8 cores. Although a CPU core is much faster than a GPU core, password hashing is one of the functions that can be done in parallel very easily. This is what gives GPUs a massive edge in cracking passwords.
More details...
Each core is basically able to compute one 32-bit arithmetic operation per clock cycle -- as a pipeline. Indeed, GPU work well with extreme parallelism: when there are many identical work units to perform, actually many more than actual cores ("identical" meaning "same instructions", but not "same data").
Some details, for a somewhat old NVidia card (a GTX 9800+, from early 2009): there are 128 cores, split into 16 "multicore units". Each multicore can initiate 8 operations per cycle (hence the idea of 128 cores: that's 16 times 8). The multicore handles work units ("threads") by groups of 32, so that when a multicore has an instruction to run, it actually issues that instruction to its 8 cores over 4 clock cycles. This is operation initiation: each individual operation takes up to 22 clock cycles to run. You can imagine the instruction and its operands walking into the circuit as an advancing front line, like a wave in a pool: a given wave will take some time to reach the other end of the pool, but you can send several waves sequentially.
So you can maintain the rhythm of "128 32-bit operations per cycle" only as long as you have at least 22 times as many "threads" to run (i.e. a minimum of 22ยท128 = 2816), such that threads can be grouped by packs of 32 "identical" threads which execute the same instructions at the same time. In practice, there are some internal thresholds and constraints which require more threads to achieve the optimal bandwidth, up to about 4096.