If you are into Entity Analytics (like I am) then a cursory understanding of Zipf’s Law should be a tool in your bag. It’s a really cool mathematical relationship that governs most of the distributions you will encounter in our line of work, mainly dealing with natural data sets that follow a consistent frequency distribution.

https://en.wikipedia.org/wiki/Zipf%27s_law

The representation of the law:

zipfs-law

The basic premise is we have a power distribution that has three key variables that govern the shape of the resulting curve:

  1. The total number of entities,
  2. The maximum size of an entity within the population,
  3. And the “skew” or exponent value.

The total number of entities and maximum entity size are fairly straight forward. Skew is a bit more difficult to grok. More on that below. First, let’s check out how cool Zipf’s Law is in practice.

Zipf’s Law and natural data sets

If you plot the number of people that live in a city (y) by its rank (x), you can see an interesting linear relationship on a log-log plot:

zipfs-cities

If you take a similar view of Brown’s Corpus, a frequency count of the use of English words, you see a similar linear relationship on a log-log plot:

zipfs-browns

These are just two examples of how Zipf’s law governs natural data sets. There are many more, and as I find time, I’ll add them to this post.

So how can we use Zipf’s Law in Entity Analytics?

It’s actually a pretty useful model for building test data sets. It’s most often the case that customers will not or cannot share their data with offsite facilities for testing purposes. We also have use cases where “what if” scenarios cannot be tested such as “what if I doubled the volume of data in my repository in the next year?” It’s useful to have the ability to create realistic entity distributions within an artificial data set to see how our linking approach scales.

We know from field implementations that entity distributions follow Zipf’s Law, so we can use this law when creating entity distributions for test purposes. Normally folks create records, then try and inject duplication into the record set. In this approach, we’re flipping the processing and creating the entity distribution first that fits a Zipf’s model, then we go back and populate the entity distribution with records.

First we have to build out an entity distribution:

zipfs-entity

Now that we have an entity distribution, we can use that as a basis for generating a person-based data set by walking through each entity size and populating the record set representing each person with attributes of interest. Once we have completed the record generation process, we will have a representative data set of the types of records we have in our system populated with representative frequencies of tokens, errors and omissions based on our real reference data set, and a known truth set that was generated by the entity distribution we used as the basis for our data generation efforts.

Zipf’s Law…more cool mathy stuff to fiddle with.