Statistics is a branch of mathematics that deals with numerical data analysis. Statistics is the study of the collection, analysis, organization, interpretation and presentation of data.
R language has been an excellent tool for statistical computation of data which includes statistical modeling, data oriented strategies and use of probability distribution and randomization in analysis. R provides various tools and functions to perform the statistical analysis of data with ease.
In this article we’ll look at various measures of central tendencies like mean
, median
, mode
and deviation
of various sample values from the central point in the distribution.
An average is defined as the number in statistics that measures the central tendency of a given a set of numbers. The various types of averages that will be used in statistical computations are
Let’s have a look at each of these measures of central tendency one by one.
Mean is defined as the sum of all the observations divided by total number of sample observations. The basic formula for the mean of all the observations y1, y2, y3,…yn is given by
X = y1+ y2 + y3 + ... + yn / n
Let us consider an example of dataset containing 7 datapoints 3, 5, 7, 9, 11, 13, 15
The Mean in this case would be X = (3 + 5 + 7 + 9 + 11 + 13 + 15 ) / 7 = 9
R provides an in-built function mean()
to compute the mean of all values in the dataset. The function takes numeric or integer vector as an argument and returns the result.
x
is a variable that takes the integer vectors using c()
functionmean(x)
is displayed using print
mean(x, na.rm)
Here, x
is the numeric/integer vector and na.rm
is a Boolean
value to remove NA
(undefined values) from the given dataset.
Mode is defined as a number in a set of numbers that occurs the most (maximum frequency) in the given dataset.
Let us consider an example of dataset containing 12 datapoints 2, 4, 6, 4, 9, 5, 4, 2, 4, 4, 2, 6
In the above example, 4 has the maximum frequency among the set of numbers. Hence, 4 is the mode of this dataset.
R does not have an in-built function to compute the mode for a given set of numbers. So, a user defined function is created.
mode(x)
is a user defined function here.unique(x)
function to filter out all the duplicate elements and in the next step which.max(tabulate(match(x,u)))
is contained in the body to determine the mode of the given dataset.mode(x)
function is then invoked and return value is stored in the variable result
Median is defined as the middle value among the set of numbers when all the numbers are sorted.
(n+1)/2
location.(n/2)
and (n/2)+1
locations.Let us consider an example of dataset containing 9 datapoints 3, 12, 4, 8, 15, 1, 23, 11, 7
1, 3, 4, 7, 8, 11, 12, 15, 23
.(9+1)/2
i.e. at 5th position. Hence, 8 is our median.R provides an in-built function median()
to compute the median or middle value of all values in the dataset. The function takes numeric or integer vector as an argument and returns the median value.
x
is a variable that takes the integer vectors using c()
functionmedian(x)
is displayed using print
In the next section we’ll look at Variance
Help us improve this content by editing this page on GitHub