Saturday, October 24, 2009

why data collection is important

% of iitians areawise.

every year, i am told by an iit professor, there are 10 lakh students appearing for a JEE exam. there are a total of 5000 seats in all the IIT's put together. what are the odds for a student to get in?

5000/1000000, or more precisely, 1 in every 2000. what is the population of three streets put together in madras? probably 1 kid can make it to the IIT's from the one half of kodambakkam. what about the coaching classes? if we consider their success rates, and update the probability, the student who doesnt have the advantage of coaching and who is mediocre, and relies completely on his effort is almost out of contention.(exceptions.)

what is the number of people missing a train by a couple of seconds because of a crowd? lets say a train can accomodate 500 people, at any given time, at the rate of 60 passengers to a 8 car rake train.(very conservative). for every station, lets say there are 20 people boarding, and 10 people getting down. what happens now? assuming the time for detrain of one passenger is 1 second, and the time for boarding is 1 second, and it is a perfect world with perfect order, and the train stops for 25 seconds, there is a possibility that 5 people may not make it. and that multiplied by 8. 40, out of a total of 160 people. 25%!

this is why we need to improve our models. why is this model crude? it does not take into account irrational behavior, nor does it account of lack of order. everything is assumed, even if we have the data for all passengers getting in, or getting out, which could be any one of the million probability distributions. (or completely empirical for that matter).

some other interesting things that we can model: infiltration of terrorists on the borders, number of people missing from an area, correlation of garbage density and flu, suicide rates and living index...

an example of exaggeration:

how much can you save if you dont brush your teeth twice, and do it only once? a standard toothpaste has 150 mg of toothpaste. and, every use takes around 1.5 to 2 g. so, for a day, its 3g.(lets say it is 1.5 g). now, at this rate, it is possible the toothpaste tube lasts for 50 days. but, what happens when you do it only once? 3 months and 10 days. in terms of savings, that is half of what you would have to spend. so, for 100 years, the amount of money you would spend(at Rs 30 per tube) would have been reduced from Rs 21000 to Rs 10500. now, whoever told you that brushing twice was healthy, was clearly not aware of this!(a stray idea just to emphasize data manipulation to absurdity).

now that, is nonsense! blowing it out of proportion to make the numbers seem significant is an extreme exercise that can only addle the weak.

what does this post really mean? that we have to careful about data. and also, that i am jobless in the middle of the fall quarter, during which i managed to get my hands on outliers.



  1. hey dude, what is your major? Looks like your well versed in probability. Anyway, FYI business need to create something called recurring business model i.e. in simple terms, encourage people to use the products again. Most of the businesses work in this model.

  2. Knowing what to measure and how to measure (Freakonomics) means everything. Anyone who uses shampoo must ve read the directions "Rinse well and repeate for best results". Of course, repeating makes us run out of shampoo twice as fast and thereby doubles sales for the manufacturer.

    Gautham, I m interested in knowing how you re taking this interest in numbers forward to something big in your chosen field/interest.

  3. @sat
    i am supposed to be a mechanical engineer. then, i started working on reliability and event simulations.that's when i started going mad. :)
    also, i suck at probability, its too tough.
    numbers is everything. quality control, reliability models, discrete event simulations, queuing...
    your point is interesting, i should probably write about the shampoo stuff... :)

  4. Bloody brilliant post! Really enjoyed it! Yes of course, most models don't take into account irrational behaviour, and I reckon it depends mainly on the scalability issues and also I think irrationality is something that is difficult to model mathematically (we all like well behaved functions, don't we!) and simulate. I also saw something on the tele, when I was very young-it may be pertinent to the topic, but I have no idea- it said that for a crowd passing through an open door, the frequency of those people on the edges of the doors getting through per second is higher than those going through from the centre.

    Top post, and maybe I can expect more of the same! :P

  5. hey wow, that open door is cool isnt it! and anyway, this post is mainly thanks to you, and your outliers suggestion!!! :)

  6. Haha! Yeah, but sometimes, the way he treats numbers is a bit literal isn't it? :D And yeah, the door thing is totally cool! If you can, try to get the "Chaos and cauliflowers" lecture from the "Christmas Lecture series", 1997, I think- Prof. Ian Stewart- loved that series when I was young!