In this article we will use hyper-geometric distribution in a real life situation for problem solving. The objective of this article is to motivate the idea of looking at statistics and probability from an application perspective. I shall use a text book example to explain the distribution and then apply it to a practical setting.

*Note: Some basic understanding of permutations and combinations (high school level) might be required to fully appreciate this article. One may brush up on the basics from *__here.__* *

*Theoretical Example - Text book type *

Imagine there are N balls in a box. Let r out of N balls be red and remaining N-r balls be black. After shuffling the box adequately, let us take a sample of n balls from the box one by one (without replacing). Now, can you estimate how many balls out of this sample of n balls will be red?

Hyper-geometric distribution will help you to do this.

Probability of x red balls in that sample of n balls is given by

That is

[(r choose x) *(N-r choose n-x)] /(N choose n) ------ Let us call this formula A

*Those who are not familiar with combination or what (n choose r) means can find the details *__here__

__Application in practical setting : Count the tigers example__

Suppose there is a huge forest which is known to be a natural habitat for a particular type of tiger population. As a forest official, you are required to estimate the number of tigers in that forest every year. However, this exercise should be done with minimal disturbance to the animals and their habitat. Physical counting by covering the entire forest might not be feasible due to the dense nature of the forest. In this situation, can we use probability and statistics to accomplish the task?

** Objective: **To get the best estimate of the total number of tigers N in the forest.

** Step 1:** Capture a fixed number of tigers and tag them so that they can be identified later on

*(in the text book example, these tagged tigers are the red balls)*. Suppose we captured 40 tigers and released them back into forest after tagging them (r=40)

**Step 2: **After a few days, capture a bigger sample of tigers (n) and check how many of them are tagged. Suppose we take a sample of 100 tigers and we could find that 4 of them are tagged. (n=100,x=4)

So, we have n=100,r=40, x=4 and N=?

Now, using formula A, we know that the probability of getting 4 tagged tigers in the second step is [(40 choose 4) *( N-40 choose 96)] /(N choose 100)

Using MS Excel, it is very easy to estimate these probabilities for various values of N. I have tabulated some of the values of N and corresponding likelihood/probabilities in the format {N,likelihood}

{990, 0.210081}, {991, 0.21009}, {992, 0.210098}, {993, 0.210105},

{994, 0.21011}, {995, 0.210115}, {996, 0.210119}, {997, 0.210122},

{998, 0.210124}, {999, 0.210125}, {1000, 0.210125}, {1001, 0.210124},

{1002, 0.210122}, {1003, 0.210119}, {1004, 0.210116}, {1005,0.210111}, {1006, 0.210105}, {1007, 0.210098}, {1008, 0.210091},

{1009, 0.210082}, {1010, 0.210073}

Observe that as N increases the likelihood/probabilities increases till N=999 or 1000 and then starts decreasing. This means the maximum likelihood estimator of N is 1000 *(with maximum likelihood equal to 0.210125)*. Hence, according to the results we obtained from our experiment, we can claim that N=1000 has the highest likelihood of being the total number of tigers in that forest.

There are many other real life situations where hyper-geometric distribution can be used. Understanding applied probability becomes fun when theory is motivated using real life examples.

I hope you enjoyed counting cats using statistics!

References:

## Comments