Statistics hero

Statistics

~14 min read

In 30 seconds
  • What: Statistics in NDA Maths covers measures of central tendency (mean, median, mode), measures of dispersion (variance, standard deviation), and bivariate analysis (correlation and regression).
  • Why it matters: This topic has appeared consistently across every NDA paper since 2010, often contributing 4–8 questions per paper spanning straightforward calculations to conceptual statement-type questions.
  • Key fact: Variance is independent of change of origin but NOT of change of scale — multiplying each observation by k multiplies the variance by k².

Statistics is one of the highest-yield topics in NDA Mathematics. Every paper carries questions on it — sometimes as many as 8 in a single sitting. The good news: the syllabus is focused. Master mean, median, mode, variance, standard deviation, and the basics of correlation and regression, and you cover almost every question type that appears.

This page walks you through every concept tested, shows you PYQ solutions step by step, and tells you exactly which question patterns to expect on exam day.

What This Topic Covers

Sub-topics in scope

  • Measures of central tendency — arithmetic mean (simple and weighted), geometric mean, harmonic mean, median, mode
  • Measures of dispersion — range, mean deviation, variance, standard deviation, coefficient of variation
  • Frequency distributions — grouped and ungrouped data, class intervals, cumulative frequency, ogive
  • Graphical representation — histogram, frequency polygon, ogive (less-than and more-than), pie chart
  • Bivariate analysis — correlation coefficient, lines of regression (y on x, and x on y)
  • Properties of measures — effect of change of origin and scale on mean, variance, and standard deviation

NDA questions fall into two broad types: calculation-based (find the mean of a dataset, compute the standard deviation) and concept/statement-based (decide which statements about variance or correlation are correct). Both types appear every year, so you need both computational speed and conceptual clarity.

⚡ NDA Alert

The two ogive curves (less-than and more-than) intersect at the median. The abscissa of that intersection point gives the median value. This fact has appeared in multiple papers including 2015-I, 2017-I, and 2011-I.

Exam Pattern & Weightage

The table below is built from PYQ data. It shows how many Statistics questions appeared in each paper and which subtopics were tested.

Year Paper No. Key Subtopics Tested
2010 I & II 7 Standard deviation (shift by k), regression lines, median class, combined mean
2011 I & II 8+ Mean, median, mode, ogive, correlation coefficient, variance units, combined mean
2012 I & II 7 Mean shift, mode, variance scaling, regression lines, coefficient of variation
2013 I & II 8 Median (raw data), variance properties, regression coefficients, cumulative frequency
2014 I & II 7 Regression lines, combined SD, correlation coefficient, mean deviation, histogram
2015 I & II 6 Excluded observation, ogive intersection, geometric mean, regression coefficients
2016 I & II 8 Regression (y on x), variance formula, correlation coefficient from covariance
2017 I & II 9 Variance scaling, empirical relation, ogive median, regression equation, CV
2018 I 6 Correlation coefficient, median of raw data, regression lines intersection, pie chart

Takeaway: Statistics consistently delivers 5–9 questions per paper. Correlation and regression, variance scaling, and median/ogive concepts are the most repeated subtopics.

⚡ NDA Alert

Statement-type questions (decide which of the given statements is/are correct) make up roughly 40% of Statistics questions. These test conceptual knowledge — you rarely need to calculate anything. Know your properties cold.

Core Concepts

Arithmetic Mean

The arithmetic mean of \(n\) observations \(x_1, x_2, \ldots, x_n\) is the sum divided by \(n\). For a frequency distribution, multiply each value by its frequency, sum the products, then divide by the total frequency.

Arithmetic Mean (raw data) $$\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum_{i=1}^{n} x_i$$
Combined Mean (two groups) $$\bar{x}_{\text{combined}} = \frac{n_1 \bar{x}_1 + n_2 \bar{x}_2}{n_1 + n_2}$$

Effect of change of origin and scale: If every observation is increased by \(k\), the mean increases by \(k\). If every observation is multiplied by \(k\), the mean is multiplied by \(k\). The algebraic sum of deviations from the mean is always zero.

Median

For raw data, arrange in ascending order and pick the middle value. For a frequency distribution, find the median class using cumulative frequencies, then apply the interpolation formula.

Median (grouped data) $$\text{Median} = L + \frac{\frac{n}{2} - cf}{f} \times h$$
where \(L\) = lower limit of median class, \(cf\) = cumulative frequency before median class, \(f\) = frequency of median class, \(h\) = class width

The less-than ogive and more-than ogive intersect at the median. The cumulative frequency curve is commonly called an ogive.

Mode

The mode is the value with the highest frequency in a dataset. For a frequency distribution, the modal class has the maximum frequency. In moderately asymmetric distributions, the empirical relation connects mean, median, and mode:

Empirical Relation (moderately skewed data) $$\text{Mode} = 3 \times \text{Median} - 2 \times \text{Mean}$$

This 3-2-1 rule is a guaranteed one-mark NDA question: if any two of mean, median, and mode are given, the third can be obtained instantly.

⚡ Common Trap

The sum of absolute deviations \(\sum |x_i - M|\) is minimum when measured from the median, not the mean. Setters often swap these two in statement questions — the mean minimises the sum of squared deviations; the median minimises the sum of absolute deviations.

Variance and Standard Deviation

Variance measures how spread out the data is. Standard deviation is the positive square root of variance. If values are measured in cm, variance is in cm² — it has squared units.

Variance (population) $$\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$$
Also: $$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n} x_i^2 - \bar{x}^2$$
Standard Deviation $$\sigma = \sqrt{\sigma^2}$$

Key properties of variance:

  • Adding a constant \(k\) to every observation leaves the variance unchanged (change of origin does not affect variance).
  • Multiplying every observation by \(k\) multiplies the variance by \(k^2\).
  • The standard deviation of identical observations is \(0\).
SD of First n Natural Numbers $$\sigma = \sqrt{\frac{n^2 - 1}{12}}$$

This direct formula appears as a one-liner — for \(n = 10\), \(\sigma = \sqrt{99/12} \approx 2.87\). No need to compute the mean and run the variance summation.

⚡ Common Trap

"Adding a constant doesn't change the SD" is a goldmine. If SD of a dataset is 5 and you add 10 to every observation, SD is still 5. Setters slip this in among options like 15 or 5 + 10 = 15 to catch candidates who confuse origin and scale.

⚡ NDA Alert

This is the single most-tested property in Statistics: variance is independent of change of origin but NOT of change of scale. If variance is V and each observation is multiplied by 3, the new variance is 9V, not 3V. Confirmed in NDA papers 2013-I, 2014-I, 2016-I, 2017-I.

Coefficient of Variation

The coefficient of variation (CV) lets you compare variability across datasets with different means. A lower CV means less relative variability.

Coefficient of Variation $$\text{CV} = \frac{\sigma}{\bar{x}} \times 100\%$$

From a 2012-II question: if mean = 40 and SD = 8, then \(\text{CV} = \frac{8}{40} \times 100 = 20\%\).

Correlation Coefficient

The Pearson correlation coefficient \(r\) measures the strength and direction of the linear relationship between two variables \(x\) and \(y\).

Correlation Coefficient $$r = \frac{\operatorname{Cov}(x, y)}{\sigma_x \cdot \sigma_y}$$
Range: $$-1 \le r \le 1$$
  • \(r = +1\) or \(-1\): perfect linear relationship.
  • \(r = 0\): no linear relationship; if \(r = 0\), the two regression lines are perpendicular to each other.
  • \(r^2\) is the coefficient of determination — a measure of the proportion of linear relationship between the variables.
  • Both regression coefficients always have the same sign as \(r\).
  • If one regression coefficient is greater than 1, the other must be less than 1 — both cannot simultaneously exceed 1 in magnitude.
  • \(r\) is independent of change of origin and scale (provided the scale factors are positive).
Correlation from Regression Coefficients $$r^2 = b_{yx} \cdot b_{xy}$$
$$r = \pm\sqrt{b_{yx} \cdot b_{xy}}$$ — sign is the same as the sign of both regression coefficients

Lines of Regression

Two regression lines exist for any bivariate dataset. They intersect at the point \((\bar{x}, \bar{y})\) — the means of \(x\) and \(y\). When \(r = 0\), the two lines are perpendicular. When \(r = \pm 1\), the two lines coincide.

Regression Line: y on x $$y - \bar{y} = b_{yx} (x - \bar{x})$$
where $$b_{yx} = r \cdot \frac{\sigma_y}{\sigma_x}$$
Regression Line: x on y $$x - \bar{x} = b_{xy} (y - \bar{y})$$
where $$b_{xy} = r \cdot \frac{\sigma_x}{\sigma_y}$$

To find \(\bar{x}\) and \(\bar{y}\) from two regression equations, solve the pair of equations simultaneously — the intersection point is \((\bar{x}, \bar{y})\).

Worked Examples

Example 1 — Effect of Shift on Standard Deviation (2010-I)

Question: A set of \(n\) values has standard deviation \(\sigma\). What is the standard deviation of the \(n\) values obtained by adding \(k\) to each value?

  • Standard deviation measures spread from the mean. Adding a constant \(k\) shifts every value and the mean by the same amount.
  • The difference $$(x_i + k) - (\bar{x} + k) = x_i - \bar{x}$$ is unchanged for every observation.
  • Since the deviations from the mean are identical, the standard deviation remains \(\sigma\).
  • Answer: (a) \(\sigma\)

Example 2 — Variance Scaling (2017-I)

Question: The variance of 20 observations is 5. If each observation is multiplied by 3, what is the new variance?

  • Let the observations be \(x_1, x_2, \ldots, x_{20}\) with variance \(\sigma^2 = 5\).
  • New observations are \(3x_1, 3x_2, \ldots, 3x_{20}\). New mean = \(3\bar{x}\).
  • New variance: $$\frac{1}{n}\sum (3x_i - 3\bar{x})^2 = 9 \cdot \frac{1}{n}\sum (x_i - \bar{x})^2 = 9 \times 5 = 45$$
  • Answer: (d) 45

Example 3 — Correlation Coefficient from Regression Coefficients (2014-I)

Question: For two variables \(x\) and \(y\), \(b_{yx} = -3/2\) and \(b_{xy} = -1/6\). Find the correlation coefficient.

  • Use $$r^2 = b_{yx} \cdot b_{xy} = \left(-\frac{3}{2}\right) \cdot \left(-\frac{1}{6}\right) = \frac{3}{12} = \frac{1}{4}$$
  • So \(r = \pm 1/2\). Since both regression coefficients are negative, \(r\) is negative.
  • \(r = -1/2\).
  • Answer: (c) \(-1/2\)

Example 4 — Mean of Combined Distributions (2010-I)

Question: Distribution X has 36 observations with mean 4. Distribution Y has 64 observations with mean 3. What is the mean of the combined distribution X + Y?

  • Combined mean: $$\bar{x}_{\text{combined}} = \frac{n_1 \bar{x}_1 + n_2 \bar{x}_2}{n_1 + n_2}$$
  • $$= \frac{36 \times 4 + 64 \times 3}{36 + 64} = \frac{144 + 192}{100} = \frac{336}{100} = 3.36$$
  • Answer: (c) 3.36

Example 5 — Finding \(\bar{x}\) and \(\bar{y}\) from Two Regression Lines (2018-I)

Question: Two lines of regression are \(4x - 5y + 33 = 0\) and \(20x - 9y = 107\). Find the values of \(\bar{x}\) and \(\bar{y}\).

  • The two regression lines intersect at \((\bar{x}, \bar{y})\). Solve the system simultaneously.
  • Equation 1: \(4x - 5y = -33\). Multiply by 5: \(20x - 25y = -165\).
  • Equation 2: \(20x - 9y = 107\). Subtract equation 2 from the scaled equation 1:
  • $$(20x - 25y) - (20x - 9y) = -165 - 107 \;\Rightarrow\; -16y = -272 \;\Rightarrow\; y = 17$$
  • Substitute into equation 1: $$4x - 5(17) = -33 \;\Rightarrow\; 4x = -33 + 85 = 52 \;\Rightarrow\; x = 13$$
  • Answer: (c) \(\bar{x} = 13,\ \bar{y} = 17\)

Example 6 — Correcting a Misread Observation

Question: The mean of 100 observations is 40. Later it was discovered that one observation was misread as 83 instead of the correct value 53. Find the corrected mean.

  • Use the correction shortcut: $$\text{Correct Mean} = \text{Old Mean} + \frac{\text{Correct} - \text{Incorrect}}{N}$$.
  • Substitute: $$40 + \frac{53 - 83}{100} = 40 + \frac{-30}{100} = 40 - 0.3$$.
  • No need to reconstruct the full sum — adjust only the affected term.
  • Answer: 39.7

Example 7 — SD After Scale Multiplication

Question: The standard deviation of a dataset of 20 observations is 5. If every observation is multiplied by 3 and then 7 is added to each, find the new standard deviation.

  • SD is independent of change of origin: adding 7 to every observation does nothing.
  • SD is dependent on change of scale: multiplying every observation by 3 multiplies SD by \(|3| = 3\).
  • New SD = \(3 \times 5 = 15\).
  • Answer: 15 (variance, by contrast, would become \(3^2 \times 25 = 225\).)

Exam Shortcuts (Pro-Tips)

Statistics rewards pattern recognition. The three shortcuts below collapse classic NDA setups into 15-second solves — every one has appeared in past papers. Memorise them; they regularly turn 3-minute calculations into one-liners.

Shortcut 1 — Incorrect Observation Correction

When a mean is reported and one observation is later found to be misread, do not recompute the whole sum. Apply the correction directly to the mean.

Corrected Mean Formula $$\text{Correct Mean} = \text{Old Mean} + \frac{\text{Correct Value} - \text{Incorrect Value}}{N}$$

Example: mean of 100 observations is 40; an observation 83 was actually 53. Correct mean \(= 40 + (53 - 83)/100 = 39.7\). For corrected variance, use $$\text{Correct } \sum x^2 = \text{Old } \sum x^2 - (\text{Incorrect})^2 + (\text{Correct})^2$$ and reapply the variance formula.

Shortcut 2 — Combined Variance Formula

When two groups are merged, combined variance is not the simple weighted average of the individual variances — you must add a correction for how far each group's mean sits from the combined mean.

Combined Variance of Two Groups $$\sigma^2 = \frac{n_1(\sigma_1^2 + d_1^2) + n_2(\sigma_2^2 + d_2^2)}{n_1 + n_2}$$
where $$d_1 = \bar{x}_1 - \bar{x}_{12}$$ and $$d_2 = \bar{x}_2 - \bar{x}_{12}$$

Compute the combined mean first, then the deviations \(d_1, d_2\) of each group mean from it, and plug in. Forgetting the \(d^2\) terms is the single most common error in this pattern.

Shortcut 3 — Regression Line Quick Solve

If a question gives two regression line equations and asks for the means \(\bar{x}\) and \(\bar{y}\), ignore every statistics formula. The two lines always intersect at \((\bar{x}, \bar{y})\), so just solve them as ordinary simultaneous linear equations.

Means from Two Regression Lines Solve the two given linear equations simultaneously \(\to\) the solution \((x, y)\) is exactly $$(\bar{x}, \bar{y})$$

Example: given \(3x + 2y - 26 = 0\) and \(6x + y - 31 = 0\), solve to get \(x = 4, y = 7\). So \(\bar{x} = 4, \bar{y} = 7\) — done in under 30 seconds. This pattern appeared in NDA 2012-I and 2018-I.

Common Question Patterns

How NDA Tests Statistics

After analysing papers from 2010 to 2018, six recurring question patterns emerge. Every paper tests at least three of these.

Pattern 1 — Variance and SD After Scaling or Shifting

You are told the variance (or SD) of a dataset, then asked to find the new variance if each observation is multiplied by \(k\) or increased by \(k\). The rule is: adding \(k\) does nothing to variance; multiplying by \(k\) multiplies variance by \(k^2\). This pattern appeared in 2010-I, 2011-II, 2013-I, 2014-II, 2016-I, 2016-II, 2017-I.

Pattern 2 — Median from Frequency Distribution

You are given a grouped frequency table (sometimes with a missing frequency), told the median value, and asked to find the missing frequency or the median class. Apply the interpolation formula. Appeared in 2010-II (the TV tubes question with median life 17 months), 2011-I, 2012-I.

Pattern 3 — Correlation Coefficient from Regression Coefficients

Two regression coefficients \(b_{yx}\) and \(b_{xy}\) are given. Use \(r = \pm\sqrt{b_{yx} \cdot b_{xy}}\). The sign of \(r\) matches the sign of both coefficients. Appeared in 2014-I, 2015-II, 2017-II, 2018-I.

Pattern 4 — Finding Intersection of Two Regression Lines

Two regression line equations are given. Solve them simultaneously — the solution is \((\bar{x}, \bar{y})\). Appeared in 2012-I (regression lines \(x - y + 1 = 0\) and \(2x - y + 4 = 0\) giving intersection \((-3, -2)\)), and 2018-I.

Pattern 5 — Properties of Measures (Statement Type)

Two or three statements about mean, median, variance, regression, or correlation are given. You pick which are correct. These questions test knowledge of properties like "algebraic sum of deviations from mean is zero" (2013-II), "both regression coefficients have the same sign" (2013-II), "variance is independent of origin" (2018-I). Prepare a list of all key properties.

Pattern 6 — Combined Mean or Corrected Mean

You are told the mean of a group, then one observation is found to be wrong and corrected. Find the new mean. Or combine two groups with given means and sizes. Formula: new total = old total ± correction, then divide by n (or combined n). Appeared in 2012-I, 2013-I, 2017-I.

Preparation Strategy

Week 1 — Central Tendency and Dispersion

Start with mean, median, and mode for both raw and grouped data. Practice the combined mean formula with 3–4 PYQ examples. Then move to variance and standard deviation — compute them by hand for small datasets. Memorise the scaling rule (variance scales by \(k^2\)) and the shift rule (variance unchanged by adding \(k\)).

Week 2 — Correlation and Regression

Learn the correlation coefficient formula and its properties. Then learn how to find regression lines, how to extract \(\bar{x}\) and \(\bar{y}\) by solving two regression equations simultaneously, and how to compute \(r\) from \(b_{yx}\) and \(b_{xy}\). Practise with PYQ questions from 2010 to 2018.

Week 3 — Statement-Type Questions

List every key property from the PYQs you have solved. Convert them into flashcard-style statements. For each one, know whether it is true or false and why. Statement-type questions in Statistics are almost always about properties you have already seen — the wording changes, but the fact does not.

High-Value Properties to Memorise

  • Algebraic sum of deviations from the mean = 0.
  • Mean deviation is least when measured about the median.
  • Variance is independent of change of origin; coefficient of variation is independent of units.
  • Both regression coefficients have the same sign. If one exceeds 1, the other must be less than 1.
  • The two regression lines intersect at \((\bar{x}, \bar{y})\). When \(r = 0\), they are perpendicular. When \(|r| = 1\), they coincide.
  • The abscissa of the intersection of less-than and more-than ogives is the median.
  • Geometric mean is used in construction of index numbers.

Time Allocation in the Exam

Statement-type questions: 30–45 seconds each (no calculation, just recall). Calculation questions like variance scaling or combined mean: 60–90 seconds. Regression line intersection (solve two simultaneous equations): 90–120 seconds. Grouped data median or mode: 2 minutes if the table is complex. Skip and return if stuck — Statistics has enough easy questions that you can score well without attempting the hardest ones.

Test Your Statistics Prep

Mock tests replicate the real NDA paper pattern — time pressure, mixed difficulty, and the exact same question formats. Check where you stand before exam day.

Start Free Mock Test

Frequently Asked Questions

How many questions from Statistics appear in NDA Maths?

Based on PYQ data from 2010 to 2018, Statistics consistently delivers 5–9 questions per paper. Papers in 2011, 2016, and 2017 had 8–9 questions. It is one of the most heavily weighted topics in the NDA Maths syllabus.

What happens to variance when each observation is multiplied by a constant k?

The variance is multiplied by \(k^2\). So if variance is 5 and each observation is multiplied by 3, the new variance is \(9 \times 5 = 45\). The standard deviation is multiplied by \(k\) (not \(k^2\)). Adding a constant to every observation has no effect on variance or standard deviation.

What is the difference between the two lines of regression?

The regression line of \(y\) on \(x\) minimises the sum of squared vertical deviations — use it to predict \(y\) from \(x\). The regression line of \(x\) on \(y\) minimises the sum of squared horizontal deviations — use it to predict \(x\) from \(y\). Both lines pass through the point \((\bar{x}, \bar{y})\). They coincide only when \(|r| = 1\), and they are perpendicular when \(r = 0\).

How do I find the mean and SD of y given the two regression lines?

From the 2010-II PYQ: the lines were \(8x - 10y = 66\) and \(40x - 18y = 214\), with variance of \(x = 9\) (so \(\sigma_x = 3\)). Identify which line is \(y\) on \(x\) (lower coefficient on \(x\)) and which is \(x\) on \(y\). Extract \(b_{yx}\) from the \(y\)-on-\(x\) line. Then use \(b_{yx} = r \cdot (\sigma_y/\sigma_x)\) along with \(b_{xy} = r \cdot (\sigma_x/\sigma_y)\). Multiply the two regression coefficients: \(b_{yx} \cdot b_{xy} = r^2\). Solve for \(r\) and then for \(\sigma_y\). The answer for that question was \(\sigma_y = 4\).

What is the coefficient of variation and when is it tested?

\(\text{CV} = (\sigma / \bar{x}) \times 100\%\). It is a unit-free relative measure of dispersion. NDA tests it in two ways: direct calculation (e.g., mean = 40, SD = 8 \(\to\) CV = 20%, from 2012-II) and as a conceptual statement (CV is independent of the unit of measurement, which is true). It is also used to compare variability between two groups — the group with a lower CV is less variable relative to its mean.

How does the median change when new observations are added?

From the 2012-II PYQ: the median of 27 observations was 18. Three more observations — 16, 18, and 50 — were added. The median of the 30 observations was still 18. The key insight: adding observations that are near or equal to the existing median often leaves the median unchanged, but you must recount positions carefully. Always re-sort and find the new middle position.

Which measure of central tendency is used in constructing index numbers?

Geometric mean. This was asked directly in 2015-I. The geometric mean is preferred for index numbers because it gives equal weight to equal ratios of change, making it suitable for combining price relatives across different commodities.