Biostatistics with R/Printable version


Biostatistics with R

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Biostatistics_with_R

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.
Category:Print Versions#Biostatistics%20with%20R/Printable%20version

Biostatistics with R authors

License

The text of this book is released under the terms of the Creative Commons Attribution-ShareAlike 3.0 and GNU Free Documentation License. The particular version of that license that is being used can be found at:

Wikibooks:Creative Commons Attribution-ShareAlike 3.0 Unported License
Wikibooks:GNU Free Documentation License

Images used in this document are available under various licenses. Clicking on the image will take you to a description page where the licensing information is displayed.

Authors

List


A Brief Introduction To R/The First Step in R

What is R?

How to install R

RStudio

Use R package

Data Entry to R

Some Special Values

Reference

Category:Book:Biostatistics with R#Printable%20version%20


Import

Why R for biostatistics?

R is superior to common statistical packages such as SPSS, SAS and MINITAB because it is

  • powerful
  • available for many platforms (Mac OS X, Windows, Linux etc.)
  • programmable
  • non-commercial
  • extensively documented

Obtaining R/Installation

You may refer to R FAQ

Data Import

The format of data set available in Wiley's website are CSV, Excel, MINITAB, SAS and SPSS. Although you can import the data saved in Excel, SAS and SPSS into R using the foreign package, you should download the data in CSV format. It is because CSV is the easiest one to process in R.

For example, you would like to import the "Large Data set" data file. The downloaded data file (LDS_C02_NCBIRTH800.csv) , assuming stored in the directory "/desktop",can be imported into R as a data.frame called "largedataset" using following syntax:

> largedataset <- read.csv("/Desktop/LDS_C02_NCBIRTH800.csv", header=TRUE,na.strings="NA")

if you prefer to choose the data file using the standard "point-and-click" GUI way, you may use the function file.choose(), i.e.

largedataset <- read.csv(file.choose(), header=TRUE,na.strings="NA")

Now, you should imported the data from the CSV to a data frame called "largedataset". You may try to look inside the data frame by calling its name

> largedataset

You can access the variable (in computer lingo, column) "sex" inside the largedataset dataframe by

largedataset$sex

For example, you want to count the frequency of sex

> table(largedataset$sex)

You can attach the data frame so that you can call the variable directly

> attach(largedataset)
> table(sex)
> detach() #cancel attaching

Basic data management

R is designed to be a analysis system instead of a integrated environment such as SPSS. Unlike SPSS, R doesn't have a spreadsheet-like environment for data input. Usually data are entered using different software (e.g. database, spreadsheet software such as OO.o Calc) and then imported to R as described above. For quick one-off calculations, you can do the data entry in R. For example, if you want to calculate the mean age of ten patients (30,31,32,34,35,36,37,30,40,45) you can enter the data into R using the c() function.

> pt_age <- c(30,31,32,34,35,36,37,30,40,45)

You may call the newly created object pt_age by its name...

> pt_age

...and then calculate the mean age of the ten patients.

> mean (pt_age)
Category:Book:Biostatistics with R#Printable%20version%20


Introduction to Biostatistics

REVIEW EXERCISES

1. Explain what is meant by descriptive statistics.

2. Explain what is meant by inferential statistics.

3. Define: (a) Statistics (b)Biostatistics (c) Variable (d)Quantitative variable (e) Qualitative variable (f)Random variable (g) Population (h)Finite population (i) Infinite population (j)Sample (k) Discrete variable (l)Continuous variable (m) Simple random sample (n)Sampling with replacement (o) Sampling without replacement

4. Define the word measurement.

5. List, describe, and compare the four measurement scales.

6. For each of the following variables, indicate whether it is quantitative or qualitative and specify the measurement scale that is employed when taking measurements on each: (a) Class standing of the members of this class relative to each other (b) Admitting diagnosis of patients admitted to a mental health clinic (c) Weights of babies born in a hospital during a year (d) Gender of babies born in a hospital during a year (e) Range of motion of elbow joint of students enrolled in a university health sciences curriculum (f) Under-arm temperature of day-old infants born in a hospital

7. For each of the following situations, answer questions a through e: (a) What is the sample in the study? (b) What is the population? (c) What is the variable of interest? (d) How many measurements were used in calculating the reported results? (e) What measurement scale was used? Situation A. A study of 300 households in a small southern town revealed that 20 percent had at least one school-age child present. Situation B. A study of 250 patients admitted to a hospital during the past year revealed that, on the average, the patients lived 15 miles from the hospital.

8. Consider the two situations given in Exercise 7. For Situation A describe how you would use a stratified random sample to collect the data. For Situation B describe how you would use systematic sampling of patient records to collect the data.

Category:Book:Biostatistics with R#Printable%20version%20


Descriptive Statistics

Summary For Formular with R

Formula

Number

NameFormulaFormula with R
2.3.1Class interval width using Sturges’s RuleExample
2.4.1Mean of a populationExample
2.4.2SkewnessExample
2.4.2Mean of a sampleExample
2.5.1RangeExample
2.5.2Sample varianceExample
2.5.3Population varianceExample
2.5.4Standard deviationExample
2.5.5Coefficient of variationExample
2.5.6Quartile location in ordered arrayExample
2.5.7Interquartile rangeExample
2.5.8KurtosisExample
Symbol Key
  • = coefficient of variation
  • = Interquartile range
  • = number of class intervals
  • = population mean
  • = population size
  • = sample size
  • =degrees of freedom
  • = first quartile
  • = second quartile = median
  • = third quartile
  • =range
  • =standard deviation
  • = sample variance
  • = population variance
  • = data observation
  • = largest data point
  • =smallest data point
  • = sample mean
  • =class width
Example
Category:Book:Biostatistics with R#Printable%20version%20


The Ordered Array

The Frequency Distribution

Example 2.2.1 detailed the procedure to sort an array. This array is a series of ages in subjects received two kinds of smoking cessation program. Suppose you already import the data set using the following command:

> SmokeCProg <- read.csv("/EXA_C01_S04_01.csv", header=T, na.strings=NA)

It is better to use a descriptive name (SmokeCProg for Smoking Cessation Program) rather than commonly used place holder name such as x,y. We can obtain a sorted array of ages using the following command:

> sort(SmokeCProg$AGE)

The frequency distribution of Ages as shown in table 2.3.1 can be obtained using:

> table(cut(SmokeCProg$AGE, b=c(0,39,49,59,69,79,89)))
(0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 
    11      46      70      45      16       1 

cut command break up AGE variables based on the break points (0,39,49,59,69,79,89) provided. In table 2.3.2, the frequency table of age was provided. As suggested by Venables et al. in the book "An Introduction to R", statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Compared to other statistical packages, R will only give minimal output. We will demonstrate this important characteristic in this example. In previous example, we calculated the frequency distribution of Ages using table() and cut() command. We can store the results in form of an object called "AgeFreqTable" using:

> AgeFreqTable <- table(cut(SmokeCProg$AGE, b=c(0,39,49,59,69,79,89)))

You will get no output. Until you call the object "AgeFreqTable"

> AgeFreqTable
(0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 
    11      46      70      45      16       1

In order to obtain the cumulative frequency, we can process the object "AgeFreqTable" using cumsum() command

> cumsum(AgeFreqTable)
(0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 
    11      57     127     172     188     189

Before we jump to the calculation of relative frequency, we can obtain the total number of observations in a variable using length() function

> length(SmokeCProg$AGE)
[1] 189

We can calculate the relative frequency by dividing each items in the object "AgeFreqTable" by the total number of observations using

> AgeFreqTable/length(SmokeCProg$AGE)
    (0,39]     (39,49]     (49,59]     (59,69]     (69,79]     (79,89] 
0.058201058 0.243386243 0.370370370 0.238095238 0.084656085 0.005291005

Similarly, the cummulative relative frequency can be calculated using

> cumsum(AgeFreqTable)/length(SmokeCProg$AGE)
    (0,39]    (39,49]    (49,59]    (59,69]    (69,79]    (79,89] 
0.05820106 0.30158730 0.67195767 0.91005291 0.99470899 1.00000000

If you would like to round the results of relative frequency to 4 digits, you can use the round() function

> round (AgeFreqTable/length(SmokeCProg$AGE),digits=4)
 (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 
0.0582  0.3016  0.6720  0.9101  0.9947  1.0000 

Alternatively, you can store the results of relative frequency in a new object and then process that object with round() function

> AgeRelFreqTable <- AgeFreqTable/length(SmokeCProg$AGE)
> round (AgeRelFreqTable, digits=4)

Exercise: Try to round the results of cummulative relative frequency to 4 digits using R command To plot a histogram, you can use the hist() function, e.g.

> hist(SmokeCProg$AGE)

You can customize the histogram by adding some arguments (i.e. options), you may type ?hist to learn more about the argument of hist() function. For example, if you want to plot a histogram with only five bars (similar to Figure 2.3.2)

> hist(SmokeCProg$AGE, breaks=5)

You can add more arguments to hist() functions, e.g.

> hist(SmokeCProg$AGE, breaks=5, ylim=c(0,70), main="Histogram of Ages of 189 subjects", col="red", xlab="Age")

Remember, always consult the document (e.g. ?hist or help.search("histogram") ) when you have question. In 95% of the time, you can find the answer in help document. For example, you don't know how to plot a stem-and-leaf graph to display your data. You don't even know the name of the function. You can use help.search() to search for the keyword "stem", i.e.

> help.search("stem")

A function called stem() should be in the results. We then try to use this function to visual our data

> stem(SmokeCProg$AGE)
The decimal point is 1 digit(s) to the right of the |
 3 | 04
 3 | 577888899
 4 | 00223333334444444
 4 | 55566666677777788888889999999
 5 | 0000000011112222223333333333333333344444444444
 5 | 555666666777777788999999
 6 | 000011111111111222222233444444
 6 | 556666667888999
 7 | 0111111123
 7 | 567888
 8 | 2

Not similar to MINITAB, the steam unit is adjusted by the scale argument. The plot above using a default scale of 1 which is equivalent to steam unit =5. To change the steam unit to 10, the value of scale argument should be change to 0.5

> stem(SmokeCProg$AGE, scale=0.5)
 The decimal point is 1 digit(s) to the right of the |
 3 | 04577888899
 4 | 0022333333444444455566666677777788888889999999
 5 | 00000000111122222233333333333333333444444444445556666667777777889999
 6 | 000011111111111222222233444444556666667888999
 7 | 0111111123567888
 8 | 2

Central Tendency

Category:Book:Biostatistics with R#Printable%20version%20


Some Basic Probability Concepts

Formular with R

Formular NumberNameFormularFormular with R
3.2.1Classical probabilityExample
3.2.2Relative frequency probabilityExample
3.3.1–3.3.3Properties of probability

Example
3.4.1Multiplication ruleExample
3.4.2Conditional probabilityExample
3.4.3Addition ruleExample
3.4.4Independent eventsExample
3.4.5Complementary eventsExample
3.4.6Marginal probabilityExample
Sensitivity of a screening testExample
Specificity of a screening testExample
3.5.1Predictive value positive of a screening testExample
3.5.2Predictive value negative of a screening testExample
Symbol Key
  • = disease
  • = Event
  • = the number of times an event E_i occurs
  • = sample size or the total number of times a process occurs
  • =Population size or the total number of mutually exclusive and equally likely events
  • = a complementary event; the probability of an event A, not occurring
  • =probability of some event E_i occurring
  • =an “intersection” or “and” statement; the probability of an event A and an event B occurring
  • =an “union” or “or” statement; the probability of an event A or an event B or both occurring
  • =a conditional statement; the probability of an event A occurring given that an event B has already occurred
  • =test results
Example
Category:Book:Biostatistics with R#Printable%20version%20


Probability Distributions

Summary of Formulars with R

Formular NumberNameFormularFormular with R
4.2.1Mean of a frequency distributionExample
4.2.2Variance of a frequency distribution

or

Example
4.3.1Combination of objectsExample
4.3.2Binomial distribution functionExample
4.3.3–4.3.5Tabled binomial probability equalities

Example
4.4.1Poisson distribution functionExample
4.6.1Normal distribution functionExample
4.6.2z-transformationExample
4.6.3Standard normal distribution functionExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


Some Important Sampling Distributions

Summary of Formulars with R

Formular NumberNameFormularFormular with R
5.3.1z-transformation for sample meanExample
5.4.1z-transformation for difference between two meansExample
5.5.1z-transformation for sample proportionExample
5.5.2Continuity correction when x < npExample
5.5.3Continuity correction when x > npExample
5.6.1z-transformation for difference between two proportionsExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


Estimation

Summary of Formulars with R

Formular NumberNameFormularFormular with R
6.2.1Expression of an interval estimateestimator ± (reliability coefficient)× standard error of the estimator Example
6.2.2Interval estimate for when is known Example
6.3.1t-transformationExample
6.3.2Interval estimate for when is unknownExample
6.4.1Interval estimate for the difference between two population means when and are knownExample
6.4.2Pooled variance estimateExample
6.4.3Standard error of estimateExample
6.4.4Interval estimate for the difference between two population means when s 1 is unknownExample
6.4.5Cochran’s correction for reliability coefficient when variances are not equalExample
6.4.6Interval estimate using Cochran’s correction for tExample
6.5.1Interval estimate for a population proportionExample
6.6.1Interval estimate for the difference between two population proportionsExample
6.7.1–6.7.3Sample size determination when sampling with replacementExample
6.7.4–6.7.5Sample size determination when sampling without replacementExample
6.8.1Sample size determination for proportions when sampling with replacementExample
6.8.2Sample size determination for proportions when sampling without replacementExample
6.9.1Interval estimate for s 2Example
6.9.2Interval estimate for sExample
6.10.1Interval estimate for the ratio of two variancesExample
6.10.2Relationship among F ratiosExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


Hypothesis Testing

Summary of Formulars with R

Formular NumberNameFormularFormular with R
7.1.1, 7.1.2, 7.2.1z-transformation (using either or )Example
7.2.2t-transformationExample
7.2.3Test statistic when sampling from a population that is not normally distributedExample
7.3.1Test statistic when sampling from normally distributed populations:population variances knownExample
7.3.2Test statistic when sampling from normally distributed populations:population variances unknown and equalExampleExample
7.3.3, 7.3.4Test statistic when sampling from normally distributed populations: population variances unknown and unequalExampleExample
7.3.5Sampling from populations that are not normally distributedExampleExample
7.4.1Test statistic for paired differences when the population variance is unknownExampleExample
7.4.2Test statistic for paired differences when the population variance is knownExampleExample
7.5.1Test statistic for a single population proportionExampleExample
7.6.1, 7.6.2Test statistic for the difference between two population proportionsExampleExample
7.7.1Test statistic for a single population varianceExampleExample
7.8.1Variance ratioExampleExample
7.9.1, 7.9.2Upper and lower critical values for � xExampleExample
7.10.1, 7.10.2Critical value for determining sample size to control type II errorsExampleExample
7.10.3Sample size to control type II errorsExampleExample
5.5.3Continuity correction when x > npExampleExample
5.6.1z-transformation for difference between two proportionsExampleExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


Analysis of Variance

Summary of Formulars with R

Formular NumberNameFormularFormular with R
8.2.1One-way ANOVA modelExampleExample
8.2.2Total sum-of-squaresExampleExample
8.2.3Within-group sum-of-squaresExampleExample
8.2.4Among-group sum-of-squaresExampleExample
8.2.5Within-group varianceExampleExample
8.2.6Among-group variance IExampleExample
8.2.9Tukey’s HSD (equal sample sizes)ExampleExample
8.2.10Tukey’s HSD (unequal sample sizes)ExampleExample
8.3.1Two-way ANOVA modelExampleExample
8.3.2Sum-of-squares representationExampleExample
8.3.3Sum-of-squares totalExampleExample
8.3.4Sum-of-squares blockExampleExample
8.3.5Sum-of-squares treatmentsExampleExample
8.3.6Sum-of-squares errorExampleExample
8.4.1Fixed-effects, additive single-factor, repeated-measures ANOVA modelExampleExample
8.4.2Fixed-effects, additive two-factor, repeated-measures ANOVA modelExampleExample
8.5.1Two-factor completely randomized fixed-effects factorial modelExampleExample
8.5.2Probabilistic representation of aExampleExample
8.5.3Sum-of-squares total IExampleExample
8.5.4Sum-of-squares total IIExampleExample
8.5.5Sum-of-squares treatment partitionExampleExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


Simple Linear Regression and Correlation

Summary of Formulars with R

Formular NumberNameFormularFormular with R
9.2.1Assumption of linearityExampleExample
9.2.2Simple linear regression modelExampleExample
9.2.3Error (residual) termExampleExample
9.3.1Algebraic representation of a straight lineExampleExample
9.3.2Least square estimate of the slope of a regression lineExampleExample
9.3.3Least square estimate of the intercept of a regression lineExampleExample
9.4.1Deviation equationExampleExample
9.4.2Sum-of-squares equationExampleExample
9.4.3Estimated population coefficient of determinationExampleExample
9.4.4–9.4.7Means and variances of point estimators a and bExampleExample
9.4.8z statistic for testing hypotheses about bExampleExample
9.4.9t statistic for testing hypotheses about bExampleExample
9.5.1Prediction interval for Y for a given XExampleExample
9.5.2Confidence interval for the mean of Y for a given XExampleExample
9.7.1–9.7.2Correlation coefficientExampleExample
9.7.3t statistic for correlation coefficientExampleExample
9.7.4z statistic for correlation coefficientExampleExample
9.7.5Estimated standard deviation for z statisticExampleExample
9.7.6Z statistic for correlation coefficientExampleExample
9.7.7Z statistic for correlation coefficient when n < 25ExampleExample
9.7.8Standard deviation for z ÃExampleExample
9.7.9Z Ã statistic for correlation coefficientExampleExample
9.7.10Confidence interval for rExampleExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


Multiple Regression and Correlation

Summary of Formulars with R

Formular NumberNameFormularFormular with R
10.2.1Representation of the multiple linear regression equationExampleExample
10.2.2Representation of the multiple linear regression equation with two independent variablesExampleExample
10.2.3Random deviation of a point from a plane when there are two independent variablesExampleExample
10.3.1Sum-of-squared residualsExampleExample
10.4.1Sum-of-squares equationExampleExample
10.4.2Coefficient of multiple determinationExampleExample
10.4.3t statistic for testing hypotheses about b iExampleExample
10.5.1Estimation equation for multiple linear regressionExampleExample
10.5.2Confidence interval for the mean of Y for a given XExampleExample
10.5.3Prediction interval for Y for a given XExampleExample
10.6.1Multiple correlation modelExampleExample
10.6.2Multiple correlation coefficientExampleExample
10.6.3F statistic for testing the multiple correlation coefficientExampleExample
10.6.4–10.6.6Partial correlation between two variables (1 and 2) after controlling for a third (3)ExampleExample
10.6.7t statistic for testing hypotheses about partial correlation coefficientsExampleExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


Regression Analysis: Some Additional Techniques

Summary of Formulars with R

Formular NumberNameFormularFormular with R
11.4.1–11.4.3Representations of the simple linear regression modelExampleExample
11.4.4Simple logistic regression modelExampleExample
11.4.5Alternative representation of the simple logistic regression modelExampleExample
11.4.6Alternative representation of the multiple logistic regression modelExampleExample
11.4.7Alternative representation of the multiple logistic regression modelExampleExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


The Chi-Square Distribution and the Analysis of Frequencies

Summary of Formulars with R

Formular NumberNameFormularFormular with R
12.2.1Standard normal random variableExampleExample
12.2.2Chi-square distribution with n degrees of freedomExampleExample
12.2.3Chi-square probability density functionExampleExample
12.2.4Chi-square test statisticExampleExample
12.4.1Chi-square calculation formula for a 2 Â 2 contingency tableExampleExample
12.4.2Yates’s corrected chi-square calculation for a 2 Â 2 contingency tableExampleExample
12.6.1–12.6.2Large-sample approximation to the chi-squareExampleExample
12.7.1Relative risk estimateExampleExample
12.7.2Confidence interval for the relative risk estimateExampleExample
12.7.3Odds ratio estimateExampleExample
12.7.4Confidence interval for the odds ratio estimateExampleExample
12.7.5Expected frequency in the Mantel–Haenszel statisticExampleExample
12.7.6Stratum expected frequency in the Mantel–Haenszel statisticExampleExample
12.7.7Mantel–Haenszel test statisticExampleExample
12.7.8Mantel–Haenszel estimator of the common odds ratioExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
Symbol Key
Category:Book:Biostatistics with R#Printable%20version%20


Nonparametric and Distribution-Free Statistics

Summary of Formulars with R

Formular NumberNameFormularFormular with R
13.3.1Sign test statisticExampleExample
13.3.2Large-sample approximation of the sign testExampleExample
13.6.1Mann–Whitney test statisticExampleExample
13.6.2Large-sample approximation of the Mann–Whitney testExampleExample
13.6.3Equivalence of the Mann–Whitney and Wilcoxon two-sample statisticsExampleExample
13.7.1–13.7.2Kolmogorov–Smirnov test statisticExampleExample
13.8.1Kruskal–Wallis test statisticExampleExample
13.8.2Kruskal–Wallis test statistic adjustment for tiesExampleExample
13.9.2Friedman test statisticExampleExample
13.10.1Spearman rank correlation test statisticExampleExample
13.10.2Large-sample approximation of the Spearman rank correlationExampleExample
13.10.3–13.10.4Correction for tied observations in the Spearman rank correlationExampleExample
13.11.1Theil's estimator of bExampleExample
Category:Book:Biostatistics with R#Printable%20version%20


Survival Analysis

Summary of Formulars with R

Formular NumberNameFormularFormular with R
14.2.1ExampleExampleExample
14.2.2ExampleExampleExample
14.2.3ExampleExampleExample
14.2.4ExampleExampleExample
14.2.5ExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
Category:Book:Biostatistics with R#Printable%20version%20


Vital Statistics

Summary of Formulars with R

Formular NumberNameFormularFormular with R
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
ExampleExampleExampleExample
Category:Book:Biostatistics with R#Printable%20version%20


Further reading

For Biostatistics

For R programming

Category:Book:Biostatistics with R#Printable%20version%20


Category:Book:Biostatistics with R Category:Pages using the JsonConfig extension Category:Print Versions