As a performance data scientist, my day job is about finding non-obvious data access patterns in workloads. These patterns can be leveraged to tune a system, or learn more about the behaviors of users driving these workloads. We can often tell a lot from the metadata without seeing the contents of the transactional data which may contain private or sensitive data. This leads us to develop broad transitional profiling methodologies that allow us to provide feedback loops into applications to self-tune configurations to optimize the cost of running a workload, or in some cases, provide insights back to operators about their users and their usage patterns.

This effort can be taken to extreme levels when B2C data is involved. With the explosion of IoT and personal wearables, entirely new data sources are being compiled which provide rich opportunities for deep analytics and user profiling. This also blurs the boundaries of privacy, and likely pushes beyond expected privacy boundaries that users expect.

This concept was captured fairly succinctly in this PCAST report to the President from 2014 (if the link is not active you can find a local copy here):

Big data is big in two different senses. It is big in the quantity and variety of data that are available to be processed. And, it is big in the scale of analysis (termed “analytics”) that can be applied to those data, ultimately to make inferences and draw conclusions. By data mining and other kinds of analytics, non‐ obvious and sometimes private information can be derived from data that, at the time of their collection, seemed to raise no, or only manageable, privacy issues. Such new information, used appropriately, may often bring benefits to individuals and society – Chapter 2 of this report gives many such examples, and additional examples are scattered throughout the rest of the text. Even in principle, however, one can never know what information may later be extracted from any particular collection of big data, both because that information may result only from the combination of seemingly unrelated data sets, and because the algorithm for revealing the new information may not even have been invented at the time of collection.

Interesting articles on the subject:

Fitbit / wearables and sacrificing privacy

Trane thermostat vulnerability