Community Server

The platform that enables you to build rich, interactive communities
Welcome to Community Server Sign in | Join | Help
in Search

HISTOGRAMDATA implementation bug

Last post 01-31-2010, 6:37 PM by admin. 8 replies.
Sort Posts: Previous Next
  •  01-26-2010, 8:00 AM 92

    HISTOGRAMDATA implementation bug

    Hello all,

    I've been pulling my hair out with complex simulations that didn't work, until I narrowed down my headache to this:
    -------------------------------Code-----------------
    DATA (7.10 7.10 7.10 7.11 7.11 7.11 7.12 7.12) MyList
    HISTOGRAMDATA Binsize 0.01 MyList BinNumbersVec BinStartsVec BinCentersVec FrequenciesVec
    WRITE WINDOW ROWS BinCentersVec%0.2F
    WRITE WINDOW ROWS FrequenciesVec%0.2F
    ------------------------------------------------------
    produces this result:
    7.11    7.11   
    3    5

    Questions:
    - Why two bins with the same BinCenters?
    - Why data intervals of BinSize don't fall into 2 separate bins? This seems to be a fence-post / boundary condition issue. I believe the boundaries of a bin should be at BinCenter - 1/2 BinSize and BinCenter + 1/2 BinSize. Thus if 7.11 falls into a bin of center 7.11, there is no way 7.10 can fall in the same bin, no?
    - We should have a histogram of "length" = 7.12-7.10 = 2 x BinSize = 0.01+0.01 in this case, no?
    - Optional question: how does HISTOGRAMDATA deal with data that falls exactly on a bin boundary? Is it counted in the upper or lower bin?

    I would be happy to be corrected if I wrote "incorrect code" that produces "correctly" these errors, and otherwise I really would appreciate having this fixed, as I've spent weeks trying to figure out why it didn't add up properly. I'm using Rev 1.4.5 if it makes any difference.

    Thanks for any suggestions
    Gus
  •  01-26-2010, 1:20 PM 93 in reply to 92

    Re: HISTOGRAMDATA implementation bug

    Gus,

    Question 1: Why two bins with the same BinCenters?
    Answer: They really don't have the same centers. They appear to have the same centers because your code's format spec rounds them off to two digits. If you print them with more precision, like this,

       WRITE WINDOW ROWS BinCentersVec%0.3F
       WRITE WINDOW ROWS FrequenciesVec%0.3F


    you will see the following:

       7.105    7.115    
       3        5


    Moreover, if you print out the binStartsVec, you can see why the data are allocated to the bins as they are:

       PRINT table binnumbersvec binstartsvec BinCentersVec FrequenciesVec

    produces:

       BinNumbersVec   BinStartsVec    BinCentersVec   FrequenciesVec    
       0.0E00          7.1             7.105           3                 
       1               7.11            7.115           5         
            

    Re: "Optional Question":  how does HISTOGRAMDATA deal with data that falls exactly on a bin boundary? Is it counted in the upper or lower bin?

    Answer (from Help on "HISTOGRAM" command): "Note that the "bin start" value is included in the bin and the "bin end" value is excluded. That means that if the bin start is 1.0, then a value of 1.0 will be counted as being in that bin. If the bin end is 1.0, then a value of 1.0 will belong in the next bin."

    Please let me know if you still have questions.

    Regards,

    John

  •  01-26-2010, 5:51 PM 94 in reply to 93

    Re: HISTOGRAMDATA implementation bug

    Gasp. LOL. Shame on me. I did print the BinStartsVec (lots of them!) - but I made the same mistake of rounding off before the interesting decimals. "too much data" syndrome...

    Thanks for the lesson. And many thanks for the extremely fast response. When I become rich, I promise I'll send you a big cheque ;-)

    Gus

  •  01-26-2010, 5:59 PM 95 in reply to 94

    Re: HISTOGRAMDATA implementation bug

    John,

    It might be worth adding something in the help, regarding the fact that under certain data conditions, BinCenters require one more decimal (to display) than the underlying bin data precision.

    I wasn't far when I was talking about fence-post boundary condition; it just never occurred to me to look for an extra decimal instead of looking for an extra / missing interval. It's obvious in hindsight, but my data has really only 2 decimals and no more, and this threw me off completely, as I was not doing any divisions anywhere.

    Gus
  •  01-26-2010, 7:12 PM 96 in reply to 94

    Re: HISTOGRAMDATA implementation bug

    Gus,

    Have you used the debugger? If not, I suggest you try it out. It is very powerful and I think you would have found the problem much sooner and might have saved some of your hair. :-) With the debugger, you can single-step through your program and look at all the variables at each step. You can set a breakpoint that will cause the program to stop when it reaches the breakpoint. Then you can look at the variables' contents and continue stepping through the program or run it to the next breakpoint. You can even set breakpoints while your program is running. For details, see the appendix "Using the Debugger" in the help document.

    John
    ps - I hope you get very rich! ;-)
  •  01-28-2010, 9:35 AM 97 in reply to 96

    Re: HISTOGRAMDATA implementation bug

    Hello John,

    Yes, I have used the debugger - Stats101 is very well written, and instead of crashing, it launches the debugger before I have a chance to realize a made a mistake ;-)

    However, I would never have thought of using it to check my data value (there was no symptoms of a logical error, and I do have an excuse: my real simulation data is a vector with 1,500 values. It doesn't fit nicely in the debug window, and I've been forced to dump output on the screen or on files to trace what was going on. I do divisions in the actual simulation , and ended up with horrible looking 1,500 long float arrays irregularly spaced out. That's when I introduced the %0.2F format...and forgot to turn it off when I was narrowing down on my problem to 4 lines of code and a 6 item array - and never contemplated the debugger at that late stage.

    I'm just a 44 year old who hasn't done any programing for 25 years. The reflexes are...rusty. I'm grateful to have found Stats101 and a patient teacher!

    PS: I'm curious why you never developed any text manipulation functions, to beef up I/O features and allow people to read/write "complex" files with headers, variable names, or even allow dynamically generated variables and file names? I appreciate it might make Stats101 less lean and/or harder to learn, but then you built a full blown debugger, so evaluating dynamically a text string doesn't seem that far fetched. Am I the first to ask?

    Thanks for all the guidance
    Gus
  •  01-28-2010, 5:28 PM 98 in reply to 97

    Re: HISTOGRAMDATA implementation bug

    Gus,

    You're welcome!

    Re text manipulation functions: I originally designed the PRINT and WRITE commands to match the Resampling Stats design, which was "bare bones." I later added a couple of enhancements, but actually, there hasn't been any clamor for enhancements other than your earlier request that led to the ROWS keyword.

    There's always a tension between adding features and maintaining a program's original mission. Statistics101 main mission is to do probability and statistical simulations. It can display some graphs, but it's not a graphing program; it can print output or write to a file, but it's not a formatting program. There are other programs that do graphing and formatting very well, giving full control to the user. So my thought was to let users easily put their data in a file that they could then, if desired, format or graph in another program, such as Excel or Word. No matter how much effort I expended to improve graphing or formatting, I would never be able to match what was easily available in those programs that are easily accessible to everyone. I felt that the debugger added to the mission and was not something that could be done by another program, so I added it. Besides, I wanted a good debugger to help me do debugging. :-)

    John


  •  01-31-2010, 9:01 AM 100 in reply to 98

    There really is a bug in HISTOGRAMDATA function

    John,

    I don't know what to say. I hope I'm making another beginner's mistake - but I can't find it. I've used the debugger extensively this time, and no matter which way I try, I always end up with the same conclusion. I have been observing my samples "jumping from one bin to another" when I plot the histograms on a graph (straight lines become broken)...but only in certain very rare cases. It is hard to reproduce, but I have hard-coded one sequence that produces the error often, and this time rounding is taken into account properly.

    Please run the code below a few times (output varies), and observe the last line of output. The contents of bin 7.59 should never increase after step 2...yet it does...sometimes (only), especially if "15" is replaced by "50" at the initialization. The reason why it increases is clear: some samples from bins 7.59, 7.60 and 7.65 (step 2) are discarded when bin 7.58 is filled at step 3. What is left of bins 7.59 and 7.60 is then ADDED and counted as being in bin 7.59. This result is wrong - data at 7.59 and 7.60 cannot be counted in the same bin if BinSize=0.01.....unless the exact width between fence-posts is more than 0.01, which seems to happen in this case
    ----------------------
    DATA (7.60 7.59 7.58) Var-1-List
    DATA ( 6    7    8   ) Var-2-List
    DATA 15#7.65               Var-3-List    'replace 15 (50% error) by 50 (increase probability of error to 99%)
    DATA 0 Explain            'set to 0 to see formatted output, set 1 for raw output

    FOREACH Index 1,3
       SHUFFLE Var-3-List Var-3-List  
       TAKE Var-1-List Index Var-1
       TAKE Var-2-List Index Var-2

       PUT Var-2#Var-1 1,Var-2 Var-3-List    'Multi-variable replacement procedure, with a random factor due to previous SHUFFLE
      
       HISTOGRAMDATA Binsize 0.01 Var-3-List BinNumbersVec BinAStartsVec BinCentersVec BinFrequenciesVec
      
       PRINT Index
       PRINT Var-1
       PRINT Var-2
      
       IF Explain = 0
          WRITE WINDOW ROWS BinAStartsVec%0.3F
          WRITE WINDOW ROWS BinCentersVec%0.3F
          WRITE WINDOW ROWS BinFrequenciesVec%0.3F
       ELSE
          WRITE WINDOW ROWS BinAStartsVec
          WRITE WINDOW ROWS BinCentersVec
          WRITE WINDOW ROWS BinFrequenciesVec
       END
    END
    -----------------------------------

    Please...help?
    Gus
  •  01-31-2010, 6:37 PM 101 in reply to 100

    Re: There really is a bug in HISTOGRAMDATA function

    Gus,

    There is indeed a problem. Your statement that "....unless the exact width between fence-posts is more than 0.01, which seems to happen in this case" is correct. if you put this code ahead of the IF command in your program,

       SHIFT -1 0 binAstartsVec binAStartsVecShifted
       SUBTRACT binAStartsVecShifted binAStartsVec  binWidth
       PRINT binWidth


    it will print out the differences between the "fence-posts" as:

       binWidth: (0.010000228881835938 0.010000228881835938 0.010000228881835938 0.010000228881835938 0.010000228881835938 -7.650001049041748)

    Ignore the last number, -7.65.. That's because there's no corresponding fencepost to subtract from it.

    This inaccuracy in the bin widths is the result of my code doing the calculation using Java's "float" type, then converting the results to Java's "double" type. The double has twice the significant digits of the float, so the lower half of the digits are just noise, but in this case the noise adds to the number, widening the bins.

    I thought that was the source of the problem, so I changed my histogram computations to use the double type everywhere. After that, computing the difference as above prints out this binwidth:

    binWidth: (0.009999999999999787 0.009999999999999787 0.009999999999999787 0.009999999999999787 0.009999999999999787 0.009999999999999787 -7.649999999999999)

    This change yielded different, but still incorrect results. Eventually I found that when computing which bin a sample belonged in I was doing a truncation to integer instead of rounding to the closest integer. In case you cared to know :-)

    I fixed that and now, I think it's ok. Before I put the new version up on the website, I would like you to try it out and see if it solves your problem. Send me your email address (don't post it here) and I'll reply, sending you the installer for the corrected version. It's 3.2MBytes. If that's too big for your mailbox, let me know.

    John





    John Grosberg
    www.statistics101.net
View as RSS news feed in XML
Powered by Community Server, by Telligent Systems