“As a rule not knowing is a step towards new knowledge.” – Laila (Sophie’s World)

Sunday, January 3, 2016

Zipf Mystery

Try reading a body of text (of natural language) and count the frequency of each word. After that, try ranking them in decreasing frequency.

You will notice that that frequency of the second in rank is about half as much as the first, while the third being one-third as much as the first, and so on. (This may become more accurate with a larger sample size, i.e. more word count.) Why is this so?

This phenomenon involves something we call "Zipf's Law".

In any language, any corpus of literature and any set of natural language utterances, the frequency of occurrence of a word is inversely proportional to its rank. For example, "the" is the most commonly used word in the English language. "Be", which ranks second, occurs about half as much as "the". "To", which ranks third, occurs about one-third as much as "the".

According to this video and Wikipedia as well, this kind of distribution or pattern is found in city populations, solar flare intensities, protein sequences, immune receptors, amount of traffic internet websites get, earthquake magnitudes, number of times academic papers are cited, last names, firing patterns of neural networks, ingredients used in cookbooks, number of phone calls people receive, diameter of Moon craters, number of people die in wars, popularity of opening chess moves, even the rate at which we forget, the planets, the elements in the periodic table.

This distribution followed what the video mentions as "Pareto Principle", where 80% of the effects come from the 20% of the causes. This principle is in fact found in our society, wherein 20% of the population owns about 80% of the wealth! Another example from the video is that Microsoft noted that by fixing the top 20% of the most-reported bugs, 80% of the related errors and crashes in a given system would be eliminated. In the business world, 20% of the customers are responsible for 80% of the profits, and 80% of complaints come from 20% of customers.

According to the video, try having a pool of paper clips. Get any two paper clips and link them together. After that, return it back to the pool. Repeat the process. After some time you may notice a disproportionate paper clip chain ranking first relative to the others. This is simply because the first has more chances of being chosen or being applied an effect.

Is not this similar with "The rich get richer, and the poor get poorer." and "The popular ones become more popular." - something like a positive feedback mechanism? The video mentions "Principle of Least Action" and "Preferential Attachment" as possible "mechanisms".

If all natural processes follow Zipf's Law, then is there any hope of getting out of this natural phenomenon, most especially when such phenomenon is bad?

The Zipf Mystery

No comments:

Post a Comment

Search This Blog

Popular Posts