DATA LAKES

April 25, 2023 by filsha.va

Large amounts of raw data in its native format is being rapidly ingested in a storage repository called data lake.Whenever needed business users and data scientists can quickly access or can apply analytics to get insights .Unlike data warehouse , unstructured big data like tweets, images, voice and streaming data can be stored in data lake.Also it can be used to store all types of data of any source, any size, any speed or any structure.The data is stored in the hierarchical dimensions and tables in the traditional data warehouse,while the data is stored in a flat architecture using data lake , primarily in files or object storage and gives more flexibility on data management, storage and usage for the users .

The Data Lakes are often associated with Hadoop systems.Deployments are based on the distributed processing framework, where data is loaded into the Hadoop Distributed File System (HDFS) and resides in a Hadoop cluster on the different computer nodes. Though,increasingly data lakes are being built on cloud object storage services instead of Hadoop.Also ,some NoSQL databases are used as data lake platforms.

Why do the organizations use data lakes?

As you know, the data lakes generally store sets of big data which include a combination of structured, unshaped and semistructured data.This surroundings aren’t a good fit for the relational databases on which utmost of the data storages are erected on.A rigid schema for data is needed by the Relational systems, which limits to the storehouse of structured sales data.The data lakes support colorful schemas and do not bear any to be defined outspoken. This in turn enables them to handle different types of data in separate formats.

This results in making the data lakes a key data architecture component in many organizations.Companies primarily use data lakes as a platform for big data analytics or other data science applications that require large volumes of data also for platforms involving advanced analytics techniques, like data mining, predictive modeling or machine learning.

The data lake will provide a central location for data scientists and analysts for finding, preparing and analyzing relevant data.Without which ,the process would be more complicated. It will be harder for the organizations to take full advantage of their data assets to help drive into more informed business decisions or strategies.

Data lake architecture

Data lakes can support many technologies and can be combined in different ways by the organizations .Therefore the architecture of the data lake often varies from one organization to another.

Data lakes can store other than raw data .Data can be filtered and processed for analysis when they’re ingested.In that case , the data lake architecture must enable this to include sufficient storage capacity for prepared data.Most of the data lakes includes analytics sandboxes, dedicated storage spaces that individual data scientists can use to work with data.

Three main architectural principles that distinguish the data lakes from the conventional data repositories are listed below :

Collected data is loaded and retained in a data lake if desired.
Stored in an untransformed state as it was received from the source system.
This data is transformed later into a schema as needed based on specific analytics requirements, this approach is known as schema-on-read.

However,we need to make sure that whatever technology is been used in a data lake deployment ,other elements should also be included to ensure that the data lake is functional and the data it contains doesn’t go to waste.This includes the following :

The common folder structure with naming conventions.
To help users find and understand data,keep up a searchable data catalog.
To identify sensitive data, with information such as data type, content, usage scenarios and groups of possible users needs a data classification taxonomy.
To provide insights for classifying data and identifying data quality issues require data profiling tools
To help control and keep track of who is accessing data need a standardized data access
Requires data protections, such as data masking, data encryption and automated usage monitoring.

Data awareness among the users of a data lake is a must, especially if they include business users acting as citizen data scientists.Being trained on how to navigate the data lake, users should get proper understanding on data management and data quality techniques, as well as the organization’s data governance and usage policies.

The data lake architecture

Data lake versus data warehouse:

The lake can be liquid, shifting, amorphous and fed by rivers, streams and other unfiltered water sources. Conversely, a warehouse is structured one with shelves, aisles and designated places to store the items it contains, which are sourced purposefully for specific uses.

This difference manifests itself in several ways, including the following:

Technology platforms – Data warehouse includes relational database running on a conventional server whereas data lake gets deployed on Hadoop cluster.
Data sources – In warehouse primarily extracted from internal transaction processing applications whereas in lakes store combination of data from mobile apps etc.
Users – Warehouses are useful for BI teams or business analysts whereas lakes are useful for data scientists.
Data quality – Warehouse data is generally trusted because it has been consolidated,whereas data from lakes is less reliable because it’s often left in its raw state.
Agility and scalability – lakes are highly agile platforms whereas warehouses are schema .
Security- warehouses have more mature security protections whereas lakes security methods are improving .

As a result most of the organizations use hybrid deployment that integrates the two platforms.

Some of the benefits of a data lake

Enables data scientists
Relatively inexpensive to implement
Various analytical methods can be used

Some of the challenges do data lakes pose

Data swamps
Technology overload
Unexpected costs.
Data governance

Some of the data lake vendors

AWS
Cloudera
Databricks
Dremio
Google
Microsoft
Oracle
Snowflake