怎样更好地确定数据的置信度?
发布于 2021-10-10 17:39
大家在质量数据分析过程中一定遇到过如何确定数据的置信度的问题,我们都知道数理统计中有详细讲到纯数学的模型,下面这个讲解很贴近与实践,深入浅出,非常好理解。特别是,后面举文章作者儿子考试成绩的例子。
现在摘录下来分享给大家,希望对您有帮助。保持英文原汁原味,小编笨拙的翻译就不献丑了,还是读原文不失真,顺便还可以锻炼一下英语阅读,难度不大,和质量相关性很强,容易理解。
How do you assess how much you really know when so much is unknown? The answer is confidence intervals. This is illustrated by the trivia quiz in Table 1. You can do well on the quiz without knowing the actual answers to any of the questions. For each question, provide a low and high guess such that you are 90% sure the correct answer falls between the two. If you succeed, you should have nine correct answers and only one wrong answer. That is, your answers should be correct 90% of the time.
Table 1
Question | Low | High |
1.What was the world's population in 2020? |
|
|
2.How many tons of CO2 were emitted globally in 2018? |
|
|
3.What were the revenues of Apple Inc. during their fiscal year 2020? |
|
|
4.How many square feet of floor space does the Empire State Building in New York City have? |
|
|
5.What is the Guinness World Record for the most people to fit in a VW Beetle? |
|
|
6.What is the circumference of the earth in miles? |
|
|
7.How many cubic yards of concrete are in the Hoover Dam? |
|
|
8.How many muscles are in the trunk of a African elephant? |
|
|
9.How many packages does UPS deliver each year? |
|
|
The answers are provided in Table 2. How many did you get correct? Were you surprised? You’re probably thinking, “Why does it matter whether I know any of this random trivia? I shouldn’t care if my confidence intervals are wrong because the questions are outside of my expertise.”
Table 2
But you should care. These trivia questions measure how well you know what you don’t know. If you had a lot of domain knowledge about the subject of one of the questions, your 90% confidence level would be narrow. If you knew less, it would be wider. Either way, your 90% confidence interval, by definition, should have captured the true answers 90% of the time.
Confidence intervals allow you to estimate population parameters (such as the mean, standard deviation and proportions) with a known degree of certainty or confidence. When unsure about a decision, confidence intervals are better than providing a single-point estimate. The confidence interval is bounded by a lower and upper limit, which are determined by the risk associated with making a wrong conclusion about the parameter of interest. This is known as alpha risk and is stated in terms of probability.
For example, a 95% confidence interval equates to a 0.05 or 5% alpha risk (1 - confidence interval). For a 95% confidence interval, the area in each tail is 0.025 (0.05 / 2) (see Online Figure 1). With a normal distribution, the empirical rule states that 68% of the values lie within one standard deviation of the mean, 95% within two standard deviations and 99.73% within three standard deviations (see Figure 2).
FIGURE 1
EXAMPLE: 95% CONFIDENCE INTERVAL
FIGURE 2
NORMAL DISTRIBUTION
There are five steps to calculating confidence ranges:
1Collect historical data (n).
1Calculate the mean (X). The Excel formula is “=average(A1:A10).”
1Calculate the standard deviation (s). The Excel formula is “=stdev(A1:A10).”
1Select the appropriate Z value (number of standard deviations) based on your confidence level (see Table 3).
1Calculate the confidence ranges using the formula:
TABLE 3
EMPIRICAL RULE TABLE (WHEN N ≥ 30)
Example
During Thanksgiving break one year, I asked my son how he thought he would do on his calculus II final exam in college. He said, “I think I’ll get an A, but I’m not very sure.”
I rephrased the question: “What’s your confidence range estimate on the final exam, assuming a 95% confidence level?”
During the semester, he took four calculus exams (n = 4) with the following scores: 84, 85, 92 and 87. We calculated the mean and standard deviation at 87 and 3.56, respectively. Because the sample size was less than 30, we used the student’s t table, a more conservative variation of the Z table. The t value for a 95% confidence level (2-tailed) based on three degrees of freedom (n - 1) is 3.1824 (see Figure 3).
FIGURE 3
STUDENT’S T DISTRIBUTION TABLE
“Degrees of freedom” is a statistical term that refers to the number of independent observations in a sample minus the number of population parameters, which, in this case, was one.1 Plugging the values into our formula produced a lower limit of 82 and an upper limit of 93.
My son then said: “Dad, based on my previous scores, I estimate my final score on the calculus final will fall somewhere between 82 and 93. I am 95% confident.”
I responded: “Michael, you’re a fast learner. But let’s try for the upper range, please.”
免责声明:
原作者信息不详,为了保护原则的版权,排版为小编原创,暂且申明小编为原创,如涉及作品版权问题,请及时联系本号,将删除内容以保证您的权益!
往期推荐阅读:
统计过程控制(SPC)和休哈特控制图之八 应用控制图需要考虑的8个问题
新手也能秒懂的SPC、Cpk、Ppk应用简介
统计过程控制(SPC)和休哈特控制图之九-Xbar-R(均值-极差)控制图
统计过程控制(SPC)和休哈特控制图之十-Xbar-R(均值-极差)控制图分析实例之一
统计过程控制(SPC)和休哈特控制图之十一-Xbar-R(均值-极差)控制图分析
统计过程控制(SPC)和休哈特控制图之十二-Xbar-S(均值-标准差)控制图分析
品质人生
本文来自网络或网友投稿,如有侵犯您的权益,请发邮件至:aisoutu@outlook.com 我们将第一时间删除。
相关素材