IEEE-754 Precision Format

arhzz · May 14, 2024

Hello! (Note : I put the prefix as comp sci, since at my uni this is a computer science class but I am in EE so it could be put under both prefixes. If anyone feels its more appropriate as Engineering feel free to change)

So here is my attempt at the solution

The formula that we are given is ## (-1)^s*M*2^E ## where M is the mantissa and E is the exponent and s Sign

Since its single precision the mantissa should be 23 bits The smallest number the mantissa can have is 1, (followed by 23 0's) . So the value that follows this, is the smallest value that can be addead to the mantissa, hence we can find our epsilon as follows

## \epsilon = \frac{1}{2^{23}} = 2^{-23} = 1,19 * 10^{-7} ## (roughly)

I think this part should be correct;

Now for the second part I tried it like this.

b) The biggest value the mantissa can have is M = 1,(followed by 23 1's) and the biggest value the exponent have is 127 hence ## 2^{127}##

So now I can plug in the formula ## (-1)^s * M * 2^E = 1,89*10^{38} ## where S is either 0 or 1 for positive or negative

Now the answer should be ##3,40*10^{38} ## and I really dont see how they get to that solution. The power of 38 is correct which confuses me, because that would implie that the formula I am using is correct no?

Thanks for the help!

BvU · May 14, 2024

And please use a decimal point, not a decimal comma ...

What do you think of this:

Wikipedia said:

an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2⁻²³) × 2¹²⁷≈ 3.4028235 × 10³⁸

[edit]
And I did
##\qquad##Y = 2 + ε
##\qquad##write (6,'(Z8.8)') Y-2.0
and it printed 00000000 !

##\ ##

arhzz · May 14, 2024

BvU said:

And please use a decimal point, not a decimal comma ...

What do you think of this:

[edit]
And I did
##\qquad##Y = 2 + ε
##\qquad##write (6,'(Z8.8)') Y-2.0
and it printed 00000000 !

##\ ##

Well I think that my solution is wrong, but how do they calculate 3,40? What am I doing wrong in my calculations?

Also for what you did I dont really understand, what is that supposed to represent?

Also noted for the decimal comma points.

pbuk · May 14, 2024

BvU said:

And please use a decimal point, not a decimal comma ...

Why? If the OP is studying in Europe then answers with full stops as the fractional separator would be wrong. You do need to insert braces around the comma though to avoid inserting unwanted white space, and also use \times instead of * for multiplication, so instead of 1,19 * 10^{-7} rendered as ## 1,19 * 10^{-7} ## you should write 1{,}19 \times 10^{-7} rendered as ## 1{,}19 \times 10^{-7} ##.

arhzz said:

So now I can plug in the formula ## (-1)^s * M * 2^E ## = 1,89*10^{38} ## where S is either 0 or 1 for positive or negative

## (-1)^s \cdot M \cdot 2^E ## is the right equation, but you need to plug the right value of M in!

arhzz said:

Now the answer should be ##3,40*10^{38} ## and I really dont see how they get to that solution.

By using approximately the right value of ## M = 1{,}111\dots 111_2 ## where there are 23 1's in the fractional part as you have correctly stated (there is of course a simpler way to express this value, what is it? Can you see that it leads very quickly to a good approximation?).

BvU · May 15, 2024

arhzz said:

Also for what you did I dont really understand, what is that supposed to represent?

It means you have to reconsider your answer for a).
real*4 2.0 is hexadecimal 40000000
2^-23 = 1.1920929E-07 is hexadecimal 34000000
and 2+ 2^-23 is hexadecimal 40000000 also.

arhzz said:

how do they calculate 3,40

##\mathtt {(2 − 2^{−23}) × 2^{127}≈ 3.4028235 × 10^{38}}##

There are 23 bits for the fraction The first (implicit leading bit) is always a 1 and is not stored. Effectively 24 bits
so the biggest possible fraction is FFFFFF (##\mathtt {(1 − 2^{-24})}## and ##\mathtt {2^ {(1 − 2^{-24})} = (2 − 2^{-23})}##.

wikipedia said:

The exponent field is an 8-bit unsigned integer from 0 to 255, in biased form: a value of 127 represents the actual exponent zero. Exponents range from −126 to +127 (thus 1 to 254 in the exponent field), because the biased exponent values 0 (all 0s) and 255 (all 1s) are reserved for special numbers (subnormal numbers, signed zeros, infinities, and NaNs).

so there is the ##\mathtt {2^ {127}}##

Fortran has a function HUGE that prints 7F7F FFFF hex

##\ ##

arhzz · May 15, 2024

BvU said:

It means you have to reconsider your answer for a).
real*4 2.0 is hexadecimal 40000000
2^-23 = 1.1920929E-07 is hexadecimal 34000000
and 2+ 2^-23 is hexadecimal 40000000 also.

##\mathtt {(2 − 2^{−23}) × 2^{127}≈ 3.4028235 × 10^{38}}##

View attachment 345289
There are 23 bits for the fraction The first (implicit leading bit) is always a 1 and is not stored. Effectively 24 bits
so the biggest possible fraction is FFFFFF (##\mathtt {(1 − 2^{-24})}## and ##\mathtt {2^ {(1 − 2^{-24})} = (2 − 2^{-23})}##.

so there is the ##\mathtt {2^ {127}}##

Fortran has a function HUGE that prints 7F7F FFFF hex

##\ ##

Okay now I see how they get the correct result, i retraced it step by s

Okay now I see it, I checked our slides again and the part about the implicit leading bit not being stored is what caused the confusion on my part. I was able to reproduce your answer and get the same result (I also have it in binary form as well)

Thank you for the help on part b)

But for part a) I realized that I made a mistake in my calculations. Reasoning is that we did a very similiar example in class with ##1 + \epsilon > 1 ## and after looking at the solution it seemed to me that the "constant factors" (the 1 and in my example 2) did not impact the solution. Obviously this is wrong since it would mean for every number the solution would be ##2^{-23} ## and that just does not make sense.

So I am kind of stumped on how to do a) any insights?

BvU · May 15, 2024

The key is in "real*4 2.0 is hexadecimal 40000000"

BvU · May 15, 2024

I thought. But it's more complicated. Trial and error, at least for me now.
I figured the ##\varepsilon## for your part a) would be 2**(-22) but I find that anything > 2**(-23) already sets the last bit in that 40000000

with eps = 1.1920929E-07 ( 2**(-23), in hex: 34000000)
I do Y = 2.0 + eps and look at Y and at Y-2

eps: 0.119209289551E-06 in hex: 34000000
: Y: 0.200000000000E+01 in hex: 40000000
Y-2: 0.000000000000E+00 in hex: 00000000

same with with eps = 0.2384186E-06 (2**(-22), in hex: 34800000)

~~eps: 0.119209303762E-06 in hex: 34000001~~
eps: 0.238418607523E-06 in hex: 34800001
: Y: 0.200000023842E+01 in hex: 40000001
Y-2: 0.238418579102E-06 in hex: 34800000

But this last result I also get when I input eps = 1.192093E-07

eps: 0.119209303762E-06 in hex: 34000001
: Y: 0.200000023842E+01 in hex: 40000001
Y-2: 0.238418579102E-06 in hex: 34800000

[edit] messed up struggling with font and bold.
Point is: anything > 2**(-23) qualifies as answer to part a), just NOT 2**(-23) itself
Disclaimer: I am using the Intel fortran compiler that appears to have around a hundred switches for all kinds of compatibilities in floating point arithmetic.

##\ ##

arhzz · May 15, 2024

BvU said:

I thought. But it's more complicated. Trial and error, at least for me now.
I figured the ##\varepsilon## for your part a) would be 2**(-22) but I find that anything > 2**(-23) already sets the last bit in that 40000000

with eps = 1.1920929E-07 ( 2**(-23), in hex: 34000000)
I do Y = 2.0 + eps and look at Y and at Y-2

eps: 0.119209289551E-06 in hex: 34000000
: Y: 0.200000000000E+01 in hex: 40000000
Y-2: 0.000000000000E+00 in hex: 00000000

same with with eps = 0.2384186E-06 (2**(-22), in hex: 34800000)

eps: 0.119209303762E-06 in hex: 34000001
: Y: 0.200000023842E+01 in hex: 40000001
Y-2: 0.238418579102E-06 in hex: 34800000

But this last result I also get when I input eps = 1.192093E-07

##\ ##

Huh interesting, so if I understood correct you found your epsilon,than you add the epsilon to the 2 . So for the first case our epsilon is ##2^{-23}## and after adding it to we get the 40000000 hexadecimal. Now this value -2 gets us to 00000000 which suggests that no change has occured and that the equation is not fullfiled? Did I understand this correctly.

The same analogy for when epsilon is ## 2^{-22}## which yields that epsilon should be ## 2^{-22} ##

But you get tha same result when inputing a different epsilon if I got that right? How does that work, that should not be happening right?

BvU · May 15, 2024

Oops, messed up. Back later.

BvU · May 15, 2024

arhzz said:

Huh interesting, so if I understood correct you found your epsilon,than you add the epsilon to the 2 . So for the first case our epsilon is ##2^{-23}## and after adding it to we get the 40000000 hexadecimal. Now this value -2 gets us to 00000000 which suggests that no change has occured and that the equation is not fullfiled? Did I understand this correctly.

The same analogy for when epsilon is ## 2^{-22}## which yields that epsilon should be ## 2^{-22} ##

But you get tha same result when inputing a different epsilon if I got that right? How does that work, that should not be happening right?

So far I have done some trials starting with real*4 Y = 2.0 which is stored as 40000000 (in hexadecimal).

Adding 2**(-23) (decimal 0.119209289551E-06) to this gets the sum stored as 40000000 again, so Y + 2**(-23) is NOT seen as greater than Y

The minimum change required to see Y+ ##\varepsilon## as greater than Y is when it goes from 40000000 to 40000001 in hex. The decimal value of the latter is 0.200000023842E+01 or 2 + 2**(-22) , which makes sense.

But it appears that the floating point arithmetic rounds off to 2**(-22) numbers slightly greater than 2**(-23):
e.g. 0.1192092967E-06 but NOT 0.1192092966E-06

I think this is going too deep, so probably the intended answer for a) is 2**(-23)

https://en.wikipedia.org/wiki/Machine_epsilon#Values_for_standard_hardware_arithmetics

##\ ##

arhzz · May 15, 2024

BvU said:

Oops, messed up. Back l

BvU said:

So far I have done some trials starting with real*4 Y = 2.0 which is stored as 40000000 (in hexadecimal).

Adding 2**(-23) (decimal 0.119209289551E-06) to this gets the sum stored as 40000000 again, so Y + 2**(-23) is NOT seen as greater than Y

The minimum change required to see Y+ ##\varepsilon## as greater than Y is when it goes from 40000000 to 40000001 in hex. The decimal value of the latter is 0.200000023842E+01 or 2 + 2**(-22) , which makes sense.

But it appears that the floating point arithmetic rounds off to 2**(-22) numbers slightly greater than 2**(-23):
e.g. 0.1192092967E-06 but NOT 0.1192092966E-06

I think this is going too deep, so probably the intended answer for a) is 2**(-23)

https://en.wikipedia.org/wiki/Machine_epsilon#Values_for_standard_hardware_arithmetics

##\ ##

Hm okay interesting so you would bet on 2^(-23) to be the answer? Considering that ## 1+\epsilon > 1 ## yields the same answer can we state that the minimum value is always ## 2^{-23} ## regardless what teh constant factors are ?

BvU · May 15, 2024

I have shown that it is not the right answer (for my compiler). But probably the intended answer for a) is 2**(-23)

Up to you

!

arhzz · May 15, 2024

BvU said:

I have shown that it is not the right answer (for my compiler). But probably the intended answer for a) is 2**(-23)

Up to you !

I think I will try it like this;

What is the equation that relates the components of the normalized IEEE-754 single-precision representation to the value it represents:

$$ v \; = \; \left( -1 \right)^S \left( 1 \; + \; \frac{M}{2^{23}} \right) 2^{\left( E-127\right)} $$

Now I define ## v_1 = 2 ## and ## v_2 = 2+ \epsilon ##

Than we solve for ## \epsilon## and what comes out is my solution.

Does this make any sense for you?

pbuk · May 16, 2024

@BvU you are confusing rounding epsilon with interval epsilon.

@arhzz given that 2 is twice the size of 1 it should be obvious that they could not both have the same value of epsilon.

vela · May 16, 2024

BvU said:

I have shown that it is not the right answer (for my compiler). But probably the intended answer for a) is 2**(-23)

How are you getting that? If you're going to fit the numbers into 24 bits, you have

2 = 10.00 0000 0000 0000 0000 0000
2+e = 10.00 0000 0000 0000 0000 0001

Since there are 22 digits to the right of the binary point, the difference is 2^-22.

BvU · May 17, 2024

pbuk said:

@BvU you are confusing rounding epsilon with interval epsilon.

The question is then: does the problem statement in #1 include this rounding (which is demonstrated in practice in the last block in #8, -- but may well be compiler-dependent)

BvU · May 17, 2024

vela said:

How are you getting that? If you're going to fit the numbers into 24 bits, you have

2 = 10.00 0000 0000 0000 0000 0000
2+e = 10.00 0000 0000 0000 0000 0001

Since there are 22 digits to the right of the binary point, the difference is 2^-22.

Yes, see #8: the hex representation of the smallest number > 2 is 40000001

I have muddied the waters by a practical interpretation -- makes for a good learning opportunity

##\ ##

IEEE-754 Precision Format

Similar threads

Hot Threads

Recent Insights