Statistics For Data Science
Why Statistics is So Important?
In everyday life we make many predictions. For example, we set the alarm for the morning when we don’t know that we will be alive in the morning or not. Here we use statistics basics to make predictions.
Measures of Central Tendency
A measure of central tendency is a summary statistic that represents the center point or typical value of a data set. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution.
Arithmetic Mean
Arithmetic Mean(called mean) is defined as the sum of all observations in a data set divided by the total number of observations.For example,consider a data set containing the following observations:
In symbolic form mean is given by-
Median
Median is the middle most observation when you arrange data in ascending order of magnitude .Median is such that 50% of the observations are above the median and 50% of the observations are below the median.
Median is a very useful measure for ranked data in the context of consumer preferences and rating.It is not affected by extreme values(greater resistance to outliers)
Mode
Defined as the most frequently occurring value in the distribution; it has the largest frequency. Does not require measurement on all observations .Not uniquely defined for multi-modal situations. Not affected by extreme values.Cannot be treated algebraically. That is, Modes of several groups cannot be combined.
Standard Deviation
Standard deviation is a measure of dispersement in statistics. “Dispersement” tells you how much your data is spread out. Specifically, it shows you how much your data is spread out around the mean or average . For example, are all your scores close to the average? Or are lots of scores way above (or way below) the average score?
Range
Range is the simplest of all measures of dispersion.It is calculated as the difference between maximum and minimum value in the data set.
Range = Maximum — Minimum
Interquartile Range (IQR)
The IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
Probability And Distributions
Probability –Meaning & Concepts
• Probability refers to chance or likelihood of a particular event-taking place.
•An event is an outcome of an experiment.
•An experiment is a process that is performed to understand and observe possible outcomes.
•Set of all outcomes of an experiment is called the sample space.
Definition-
Bayes’ Theorem
Bayes’ theorem describes the probability of occurrence of an event related to any condition. It is also considered for the case of Conditional Probability . Bayes theorem is also known as the formula for the probability of “causes”.
Example-
Probability Distribution
•In precise terms, a probability distribution is a total listing of the various values the random variable can take along with the corresponding probability of each value. A real life example could be the pattern of distribution of the machine breakdowns in a manufacturing unit.
•The random variable in this example would be the various values the machine breakdowns could assume.
•The probability corresponding to each value of the breakdown is the relative frequency of occurrence of the breakdown.
•The probability distribution for this example is constructed by the actual breakdown pattern observed over a period of time. Statisticians use the term“observed distribution” of breakdowns.
Binomial Distribution
- The Binomial Distribution is a widely used probability distribution of a discrete random variable.
- It plays a major role in quality control and quality assurance function.Manufacturing units do use the binomial distribution for defective analysis.
- Reducing the number of defectives using the proportion defective control chart(p chart) is an accepted practice in manufacturing organizations.
- Binomial distribution is also being used in service organizations like banks, and insurance corporations to get an idea of the proportion customers who are satisfied with the service quality.
Conditions for Applying Binomial Distribution (Bernoulli Process)
- Trials are independent and random.
- There are fixed number of trials (n trials).
- There are only two outcomes of the trial designated as success or failure.
- The probability of success is uniform through out the n trials
Poisson Distribution
The Poisson distribution is the discrete probability distribution of the number of events occurring in a given time period, given the average number of times the event occurs over that time period.
Examples include-
A certain fast-food restaurant gets an average of 3 visitors to the drive-through per minute..
Number of defects per item, number of defects per transformer produced, number of defects per 100 m2 of cloth, etc.
Other real life examples would include
- The number of cars arriving at a highway check post per hour;
- The number of customers visiting a bank per hour during peak business period.
Normal Distribution (Bell Curve)
A normal distribution, sometimes called the bell curve, is a distribution that occurs naturally in many situations. For example, the bell curve is seen in tests like the SAT and GRE. The bulk of students will score the average, while smaller numbers of students will score a B or D. An even smaller percentage of students score an F or an A. This creates a distribution that resembles a bell (hence the nickname). The bell curve is symmetrical. Half of the data will fall to the left of the mean; half will fall to the right.
The Empirical Rule tells you what percentage of your data falls within a certain number of Standard Deviation from the mean:
• 68% of the data falls within one Standard Deviations of the mean.
• 95% of the data falls within two Standard Deviations of the mean.
• 99.7% of the data falls within three Standard Deviations of the mean.
Example-
The mean weight of a morning breakfast cereal pack is 0.295 kg with a standard deviation of 0.025 kg. The random variable weight of the pack follows a normal distribution.
a)What is the probability that the pack weighs less than 0.280 kg?
b)What is the probability that the pack weighs more than 0.350 kg?
c)What is the probability that the pack weighs between 0.260 kg to 0.340 kg?