χ2 Analysis of Daily Cosmic Ray Muon Cycle

This activity serves a few purposes.  First, it is intended to give you some experience in downloading, importing, and analyzing data from an internet source; in this case, the Adelaide Muon Monitor.  Secondly,  it is a good opportunity to learn how to apply the “χ2 test of goodness of fit” to a fairly simple, and potentially interesting, data sample.  The χ2 test is a powerful statistical analysis tool that is more universal than the regression approach of Excel’s Trendline feature.  Although it might be ambitious to have a class of high school students perform a χ2 test of data, it is quite possible that a more advanced student might try it for extra credit, for an authentic research project, or in the course of a science internship.

Directions:

2. Click on the file labeled 200308ab.txt to open it, and then save it to a folder.

3. Open the Excel program and open the Adelaide data file that you saved.    Be sure to click Text Files or All Files where it says Files of Type: at the bottom of the Open box.  Click Finish when the Text Import Wizard appears.  Click File, then Save As, and indicate Microsoft Excel Workbook at the bottom of the Save As box where it says Save as Type:.  I recommend that you save the file periodically during the course of the activity, in case your computer rebels and forces you to close the workbook suddenly.

4. At this point you will have a spreadsheet with 15 columns and 2976 rows of data (Who-Ha!).  We will only be using four columns for this activity.  First of all, you should delete columns G through O, and columns C and D.  This will leave four columns of data.  Next, insert 2 rows at the top of the spreadsheet.  To do this, highlight the first two rows, click Insert, and then Rows.  You may wish to write Adelaide: August 2003 in cell A1 for reference.  Then label columns A through D as day, sec (UTC), pres (mB), and rate (Hz), respectively.  The day column is actually days since 1904 (as in the year).  Your spreadsheet should look like this at this point:

5. This would be a good time to adjust the data for barometric pressure.  To do this, graph rate vs pressure, then add a linear trendline with equation and r2 value:

6. You will notice that the correlation coefficient (R2 value) indicates a strong anticorrelation between muon rate and barometric pressure.  Make note of the slope value.  Return to the spreadsheet and label column E as pcor (Hz)Pcor is the abbreviation that I use for “pressure-corrected muon rate”.  Click on cell E3 and click the Edit Formula (or =) button.  Adjust the muon rate for pressure in the same manner as in the previous analysis activity:

pcor = rate + (pres – 1013)*(slope)

Note that the slope is the positive value of your pressure vs rate graph slope, and 1013 is the barometric pressure at sea level in milliBars (mB).  Drag this formula down the entire E column.

7. Label column F hour of day”.  Write .25 in cell F3 and .5 in F4.  Highlight the two cells, click on the lower right corner and drag this down to 24.  Copy the 96 cells that result and paste them below.  Keep doing this until you fill the entire column.  You may want to copy two days of values, then four days to make it go faster.  Or, you might know some slick shortcut for this.  Check to see that your last data cell is 24.  If it isn’t, you may have made an error someplace and should re-check your work.

8. At this point we want to begin analysis of any potential daily cycle associated with muon rates.  To do this you’ll want to create a table to the left of the one that you’ve already created.  I suggest writing hour of day in cell H3 to label the first column of this chart.  Fill this column with quarter hour values .25 through 24 by copying the values from the first day in column F, or by typing in .25 and .5 in the first two cells and dragging down.

9. Label column I avg rate (Hz)”, then click cell I3.  At this point you could sort the entire table at right by hour and start calculating averages, but this would take a long time.  Excel provides a nifty logic-based shortcut for this operation, however, that will save you a lot of time.  Click on the  Edit Formula (=) button while in cell I3 and type the formula:

=AVERAGE(IF(F\$3:F\$2978=H3,E\$3:E\$2978))

What this does is it instructs the program to average all of the pcor values in column E that correspond to hour of day values in column F equal to  H3=.25.  Note that by typing \$ before a number or letter, that indicates an absolute value that will not change as you drag this formula.  You want to include this for all values except the H3 cell, which should change as you apply this formula to all hr of day values.

Note:        You must press Control + Shift + Enter (Apple + Shift + Enter for Macs) when entering this formula in order for this operation to execute properly.

Drag this formula down column I for all of the daily quarter hour values.

10. Label column Jsigma (Hz)”.  Remember that the estimated uncertainty of a rate is given as:

sigma = (rate/time)1/2

Note that for this application, the total time is 900 seconds times 31 (the number of days in August), and don’t forget that SQRT is the Excel.

11. If you now graph avg rate vs hour of day and add error bars using the sigma that you’ve calculated, you should get something that looks like this:

It seems clear from this graph that there is some statistically significant variation in the muon rate over the course of an average Adelaide day in August 2003.  At this point, it would be useful to fit this plot to some type of curve.  Although Excel’s Trendline feature is a powerful tool for certain types of fits (the linear rate vs pressure curve done earlier in this activity, for example), it has limitations.  First of all, it cannot properly combine data that has different uncertainties. This actually is not a big problem in our situation, because our sigma values are essentially the same for all values on our graph. More importantly, however, it is limited in the number of function forms available for fitting data.  In the avg rate vs hr of day graph above, for example, it seems that a trig function might work well, but Excel provides no trig functions for fitting.  A more suitable method in this case is the χ2 (chi2) test of goodness of fit.

11. To begin this analysis, it is first necessary to choose a function to fit the data to.  In this case, I’m going to choose a sine function of the form:

rate = avg rate + amplitude*sin[2π (t – to)/24]

where t is the hour of day, and to is the time at which the curve begins at the average value.  You are going to want to be able to adjust the curve by varying the avg rate, amplitude, and to, so assign cells at the bottom of your table (the smaller one to the right) to place these values  in and label them.  Just by eyeballing the graph, you can pick rough values for these variables, as indicated below:

 23.25 46.401078 0.040781 23.5 46.5724582 0.040857 23.75 46.4877843 0.040819 24 46.5060385 0.040827 average 46.5 amplitude 0.1 to -1

You will want to refer to these cells when you write your formula.  Don’t forget to use a \$ before these values in your formula to indicate that they are absolute and should not vary according to row when you drag the function.  At this point you should label column K predicted” to indicate the value predicted by your function.  Click on cell K3, enter your function, and drag this down for all hour of day values.  If you click on your rate vs hour of day graph, click Chart, then Source Data and Series, you can add your predicted values column to your graph, which should then look like this:

This seems to fit the data reasonably well, but we want a more quantitative assessment of fit, plus, it would be helpful to be able to adjust the variables we chose to improve the fit.

12. To begin assessing how well the Adelaide data fits the sine function that you’ve chosen, you first need to compare each data point to its predicted value.  To do this, label column LNsd”, for the number of standard deviations that each point varies from its predicted value.  Click on cell L3, and enter a formula that subtracts the predicted value from the actual avg rate value and divides this difference by sigma.  Drag this formula down the entire column.

13. In order to make use of the statistical differences that you calculated in column L, you first need to square these terms.  This is done for traditional reasons, perhaps to exaggerate the statistical variation of the measured and predicted values.  Label column MChi2 term”, click on cell M3, enter a formula that will multiply L3 by itself, and drag this down the entire column.  The table that you’ve created on the right side of your spreadsheet should now look like this:

 hr of day avg rate (Hz) sigma (Hz) predicted Nsd Chi2 term 0.25 46.55015018 0.0408468 46.53213 0.441208 0.194664 0.5 46.57493375 0.0408577 46.53825 0.897843 0.806122 0.75 46.57224436 0.0408565 46.54421 0.686214 0.47089 1 46.54548657 0.0408448 46.54998 -0.10994 0.012087 1.25 46.54860286 0.0408462 46.55553 -0.16964 0.028779 1.5 46.55007034 0.0408468 46.56085 -0.2639 0.069643

14. The χ2 (Chi2) value is the sum of the Chi2 terms in column M.  Label a cell below your table “Chi2”, the cell to the right of this “N d.o.f.”,  and the cell to the right of this “Prob”, as shown below:

 23.25 46.401078 0.0407814 46.50622 -2.57816 6.646933 23.5 46.57245819 0.0408566 46.51273 1.461893 2.137132 23.75 46.48778434 0.0408195 46.51919 -0.7693 0.591823 24 46.50603852 0.0408275 46.52556 -0.47818 0.228653 average 46.5 Chi2 N d.o.f. Prob amplitude 0.1 to -1

Click on the cell immediately below where you entered Chi2, click the Edit Formula (=) button, and sum the contents of column M.  In the cell under

N d.o.f. , enter the number of degrees of freedom.  The number of degrees of freedom is the number of comparisons made with the distribution (in this case 96, for the 96 quarter-hour intervals in one day) minus the number of variables in the formula that was used (in this case three: average, amplitude, and to).  Under Prob, click the Edit Formula button, type in CHIDIST, and enter the Chi2 cell value for x, and the N d.o.f. cell for Deg_of freedom.  The statistics part of your table should look something like this now:

 23.5 46.57245819 0.0408566 46.51273 1.461893 2.137132 23.75 46.48778434 0.0408195 46.51919 -0.7693 0.591823 24 46.50603852 0.0408275 46.52556 -0.47818 0.228653 average 46.5 Chi2 N d.o.f. Prob amplitude 0.1 101.9189 93 0.247404 to -1

In general, data “fits” a function if the χ2 term is roughly the same number as the number of degrees of freedom.  The probability indicates the % likelihood that we would get a given χ2 value if our data did, in fact, fit the function chosen.  The .247404 probability value above, therefore, indicates that there is a roughly 25% chance that our data would result if the daily cycle of muon rates did fit the sine function chosen.

15.  The 25% probability above might not seem too encouraging at first glance, but keep in mind that you can adjust the three variables in your function to better fit the data.  Try doing this one variable at a time on your own.  Keep in mind, the objective is to end up with a lower χ2 value.  Fortunately, Excel does offer a tool that can find the combination of values that will produce the lowest χ2 value.  To do this, you first need to click Tools, then Add-ins, and then check Solver Add-in.  After this, click Tools, and then Solver.  Set the Chi2 as the target cell, click Equal to: Min, click by changing cells and include the three cells with the average, amplitude, and to, click Solve, accept the solver solution, and Presto!.  You should notice a decrease in the χ2 value, and an increase in the probability value.  If you check your graph, you should notice that your sine function fits the data a little better.  Note that the χ2 test doesn’t give a cut-and-dried answer.  Some judgment is required.

Finally, there is one caveat regarding this test.  The manner in which it was applied in this activity is appropriate only for data samples in which all data points have the same, or roughly the same, uncertainties.  If a data sample includes measurements taken under different conditions (over different time intervals, for example), then it would become necessary to “weight” each measured data point according to its uncertainty.  This procedure is not covered here.  If the need for such analysis arises, be sure to contact a worthy statistical analysis authority (a.k.a. Kevin McFarland) to learn how to do this properly.