IEEE 754-1985 Float Calculation

Reclaimer Shawn · 1 IEEE 754-1985 Float Calculation 2/25/2016, 1:53 pm

Reclaimer Shawn

Code Creator

Posts : 272
Location : Why would you even want to know this?
MWR Ally Code : 067659239928

Alright, so first off: what exactly does IEEE 754-1985 stand for anyways? Well, it stands for Institute of Electrical and Electronics Engineers, and was a system adopted in 1985. It is the current system used in modern computers to represent decimal numbers in a binary format. Before I go on any further, this tutorial assumes you already know how to perform written calculations in both binary and hexadecimal, without the use of a calculator. If you cannot do these things, please leave while you still can. Alright, now onto the good stuff.

Float values are stored as 32-bit integers, meaning they use 32 0's and 1's(bits) to represent a number. Here's the IEEE 754 Format:

IEEE 754-1985 Float Calculation 618px-11

IEEE 754-1985 Float Calculation 618px-11

Alright, time to break it apart. The sign goes by a signed magnitude component. Sign magnitude means that in this, a 1 signifies a negative number, while 0 a positive. Now, to cover the rest. Now, let's choose a random number... Let's try -6.125 for instance... The first part to do is place a one in the sign to represent a negative

Code:: 1 | XXXXXXXX| XXXXXXXXXXXXXXXXXXXXXXX Sign| Exponent | Mantissa/Significand

Now, we find the Mantissa. First, we work out the full number part of the number. We know by now that 6= 110 in binary. Now, we have to we'll place this down here:

Code:: 110.XXX

Now, you know how with binary we do powers of 2? Well, now we'll use negative powers to represent decimals, like in scientific notation. The first value is 2^-1, or 1/2^1. The next is 1/2^2, and so on. Now, we check if 1/2^1 goes in... Does .5 go in? Nope, so we place a zero. Now, we try 1/2^2. Does .25 go in? Nope, we place another zero. Now, we try 1/2^3, which is .125, which goes in, and 0's out the number, so we stop there. Our "denormalized" number is as such:

Code:: 110.001

Now, we have to "normalize" this to make it work. Sure, we represent decimals in binaries in the denormalized format, but a computer does not. What we do is Move the decimal to the left place as many times as it takes to place it right next to the last 1-bit. It turns out like this:

Code:: 1.10001

Now, we drop the 1 and the decimal point, and get this:

Code:: 10001

"Pad" the remaining 18 bits with 0's, and we get this... We know this due to (Number of bits) - (Number in "Normalized Strand" 23-5=18 remaining bits

Code:: 1 | XXXXXXXX| 10001000000000000000000 Sign| Exponent | Significand/Mantissa

Now, we find the exponent. We now have to remember how many times we moved the decimal place to the left to "normalize" it. We moved it two times to the left. We will add what is called the "Bias" The bias is the highest number we get in a signed(+/-) system of that many bits. The highest number in a signed system with 8 bits is 127. We now add the exponent(2) with 127,and we get 129. Now, all we do is calculate out 129 in binary, and load it into the exponent bits. 129 = 10000001 in Binary, so we load that into the exponent... Our full float notation number is:

Code:: 11000000110001000000000000000000

And we're done! Now, we have Double notation. I included this in the same lesson due to its similarities. The only difference is this:

IEEE 754-1985 Float Calculation 618px-12

IEEE 754-1985 Float Calculation 618px-12

The bias is now 1023(due to using 11 bits, 1023 is the highest signed number), and the significand holds 52 bits, allowing for a calculation of up to 1/2^52 in precision, instead of a 1/2^23 precision in float. Keep in mind that this is God-Awful for numbers that are not powers of 2, and will most likely have to be rounded in the end, and EVERY bit will have to be used just to represent that rounded number. In the next post, I'll put some little extra terminology in the next post, but for now, this is how you can calculate in Float! Enjoy!
A float Calculator to check your work:
http://www.h-schmidt.net/FloatConverter/IEEE754.html

Last edited by Reclaimer Shawn on 3/24/2018, 8:56 pm; edited 2 times in total

Reclaimer Shawn · 2 Terminology and Other Factoids 2/25/2016, 2:15 pm

Reclaimer Shawn

Code Creator

Posts : 272
Location : Why would you even want to know this?
MWR Ally Code : 067659239928

Truncation: Rounding a number to a whole number(if it is 1,2,3, or 4, it'll be rounded down. 5+ will round up)

Flooring: Rounding a value down.(Bringing it to the floor as I like to think)

Ceiling: Rounding a value up.(Raising it up to the ceiling)

For example

Number EX: -12.4 12.6 -12.6 12.4
Rounding Methods: Flooring -13 12 -13 12
Ceiling -12 13 -12 13
Truncating -12 13 -13 12

Not a Number(NaN)
Types of NaNs
Quiet NaN(QNaN): A NaN that simply results from an undefined or erroneous calculation. Say, the hexadecimal number 0x7FFFFFFF, which in a signed 32 bit system is usually the highest number, but here, it's an error.
Signalling NaN(SNaN): Used for either debugging purposes or setting illegal program operations. A SNaN might be 0x7FC00000.

Special Operations in IEEE 754:
Number/Infinity = 0
(+/-)Infinity*(+/-)Infinity = (+/-)Infinity
(+/-)Nonzero number/0 = (+/-)Infinity
(+/-)0/(+/-)0 = NaN
Infinity-Infinity = NaN
(+/-)Infinity/0 = NaN

Special Numbers in IEEE 754:
0x7F800000 = Infinity
0xFF800000 = -Infinity
0x7FC00000 = SNaN(Probably many more than this)
0x80000000 = Negative Zero

Mon	Tue	Wed	Thu	Fri	Sat	Sun
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

IEEE 754-1985 Float Calculation

1 IEEE 754-1985 Float Calculation 2/25/2016, 1:53 pm

Reclaimer Shawn

2 Terminology and Other Factoids 2/25/2016, 2:15 pm

Reclaimer Shawn