The dark side of data

How data garbage hurts business and the environment

Everyone is talking about big data. There is indeed a large potential for extracting economic and societal value out of huge amounts of data. By feeding algorithms with data, machine learning could provide solutions to almost everything. So much about the bright side. However, in the shadows of the big data vision lurks a less pleasant reality: huge piles of data garbage, gazillions of data files lingering unused on servers around the world – dark data.

According to the “Databerg Report” published by information solution provider Veritas in 2015, organisations in Europe, the Middle East and Africa hold on average 14% of identifiable business critical data, 32% ROT (redundant, obsolete and trivial) data, and 54% dark data.

According to market research firm Gartner, dark data is defined as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” These other, more productive purposes could be, for example, analytics, business relationships and direct monetisation.

How dark data is produced

The critical question is, how dark data comes into existence in the first place. There are various causes and reasons. One of the underlying enablers of dark data is that data storage is seemingly cheap and abundant. Thus, all data that could possibly be useful is stored, whether they are actually used or not. And once data is stored, there is usually nobody who cares about checking and reducing data amounts.

On the production side, there are many contributors. Organisations often retain dark data for compliance purposes only. That is ironic, as in some cases storing data could cause bigger compliance risk than benefits, just think of private data and the risks of violating data privacy regulations.

While in the past, dark data was mainly produced by humans, nowadays the biggest share of dark data is produced by machines, including information gathered by sensors and telematics. According to an estimate by IBM from 2015, roughly 90 % of data generated by sensors and analogue-to-digital conversions never get used. It is doubtful, whether this has improved in the last six years. I wouldn’t be surprised, if it is even worse now.

Some organisations seem to believe that dark data could be useful to them in the future, once they have acquired better analytic and business intelligence technology to process the information. While this is theoretically possible, in practice I find it hard to believe that a lot of value will be generated in ten years from analysing dark data generated by humans and mostly machines today. Even if a small amount of today’s dark data could be pure gold in ten years’ time, the question is, whether it would be worth the problems dark data already creates today.

Why dark data is a problem

Given that cloud storage is cheap, the question is, why dark data should be a problem at all. The answer is in the huge scale of dark data. Once the amount of dark data exceeds a certain level, storage cost is no longer cheap. The “Databerg Report” from 2015 predicted that dark data could cause 891 billion dollars of avoidable storage and management costs by 2020, if left unchecked. I have not seen any recent study on the amount and cost of dark data. However, I have a strong suspicion that the real cost might be even higher today.

As storing huge amounts of data consumes a lot of energy and material for the data centre infrastructure, there is not just a financial cost, but also an environmental cost in the shape of carbon-dioxide emissions.

One of the reasons why the problem persists and might actually grow over the coming years is that most companies probably have no idea about the volume and cost of dark data.

What can be done about dark data

In my view an important part of the solution can be derived from a famous quote by Lord Kelvin: “If you can not measure it, you can not improve it.” Applying these words of wisdom to dark data, you could say: if you can measure dark data, you can remove it. Even if removing dark data is not always the preferred solution, for example because of compliance needs or expected value to be derived in the future, it would be a good start to be aware of the scope of the problem and to know, which data on an organisation’s server is dark. Maybe the machines that increasingly generate dark data could also help to remedy the problem through the use of machine learning in weeding out useless data.