A beginner's guide to data lakes

​As data gets bigger, so the terms we use to describe where and how it is stored have evolved – from databases to data warehouses and now data lakes. But size isn't the only thing that differentiates this new concept from its predecessors. 

First published on November 20, 2018

What is a data lake?

James Dixon, CTO of the business intelligence software platform Pentaho, is believed to have coined the term "data lakewhen he contrasted this form of storage with a data mart: "If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."

In short, a data lake is a storage repository – either on-premises, in the cloud with Google, Microsoft, Oracle or Amazon, or hybrid, which can accommodate a steady stream of incoming data, from multiple sources, in its original format. These are typically built using Hadoop or big data technologies that enables organizations to cost-effectively store significant volumes of data.

What does it do?

Fundamentally, a data lake holds data in its rawest form, without the need for it to have been processed or analyzed. The source of this data may be relational (from operational databases or line of business applications) or non-relational (from mobile apps, IoT devices and social media).

Once the data has been imported, functions within your organization – such as data scientists, developers or business analysts – can crawl, catalog, index and analyze it without the need for it to be run through a separate analytics system. 

How could it benefit my business?

Because the data is imported "as-is", it can be worked on by a wide range of applications, including big data processing, data visualization, machine learning tools and AI. This level of analytical agility can translate to substantial RoI: a survey by Aberdeen found that organizations with a data lake outperformed similar companies by 9% in organic revenue growth, while Markets and Markets estimates that the market for data lakes will be worth almost $9bn by 2021.

Do I need one?

This is a reasonable question to ask, given the doom-laden warnings about data lakes turning into "swamps" overflowing with petabytes of useless data. However, according to an article on Forbes.com by Shant Hovsepian, co-founder and CTO of Arcadia Data, most organizations that use data lakes have positive things to say, in particular about their ability to enable non-technical users to analyze data. Among the leading cheerleaders for the technology is Epic Games, which uses a data lake to store and analyze the colossal amount of data from game clients, servers and services generated by Fortnite, the world's most popular game. 

How do I secure information stored in this way?

The flexibility and agility of data lakes – they allow you to dump data in its original format, and can become a sandbox in which analysts and developers can play – plus their cloud-based storage, make them a potential security nightmare, especially from a regulatory compliance point of view. Authentication, access controls and data encryption must be applied, all traffic to the lake secured and scrutinized, and the data backed up to guard against the risk of a ransomware attack. 

Discover more about cloud security here.

TAGGED IN data security; data breaches