Lecture #5 – Floating Point Arithmetic

 

·        Optimizing Integer Arithmetic

 

Ripple Adders use a series of 1-bit full adders to create a multi-bit adder.  Ripple addition has a long propagation delay due to the carry being dependant on each previous stage.

 

 

 

 

·        Floating Point Number

 

numbers with fractions, e.g., 3.1416

 

very small numbers, e.g., .000000001

 

very large numbers, e.g., 3.15576 x 109

 

·        Floating Point Representation:

 

sign, exponent, significand:    (–1)sign  x  significand  x  2exponent  

 

more bits for significand gives more accuracy

 

more bits for exponent increases range               

 

·        Optimizing Integer Arithmetic

 

single precision:  8 bit exponent, 23 bit significand

 

double precision:  11 bit exponent, 52 bit significand

 

Leading “1” bit of significand is implicit

 

Exponent is “biased” to make sorting easier

 

–        all 0s is smallest exponent all 1s is largest

 

–        bias of 127 for single precision and 1023 for double precision

 

–        summary:   (–1)sign ΄ (1+significand) ΄  2exponent – bias

 

 

 

 

 

·        Example

 

decimal:  -.75 = - ( ½ + Ό )

 

binary:  -.11 = -1.1 x 2-1

 

floating point:  exponent = 126 = 01111110

 

IEEE single precision:  1 01111110  10000000000000000000000

 

·        Floating Point Addition