Data Integration Methodology - Best Practices and Comparison

Best practices to implement an ETL/ELT Integration pattern which further sums up to a Data-warehouse implementation

Decide a plan to test the consistency, accuracy, and integrity of the data
The data warehouse must be well-integrated, well-defined and time stamped
While designing a Data-warehouse make sure you use right tool, stick to the customer life cycle, take care of data conflicts and be ready to learn from your mistakes
Ensure to involve all stakeholders including business personnel in the data warehouse implementation process. Establish that data warehousing is a joint/ team project. You don’t want to create a Data warehouse that is not useful to the end users
Include the stakeholders that own the data at the point of generation and at the point of consumption to further validate all the customer reports/dashboards look meaningful
Prepare a training plan for the end users

Best practices to implement an EL pattern that in itself is a Data Lake

Architectural components, their interaction, and identified products should support native data types
Design of the Data Lake should be driven by what is available instead of what is required. The schema and data requirement is not defined until it is queried
Data discovery, ingestion, storage, administration, quality, transformation, and visualization should be managed independently
Faster on-boarding of newly discovered data sources is important
It helps customized management to extract maximum value
It should support existing enterprise data management techniques and methods

Best practices to implement an ELTL pattern that speaks of a Data Lakehouse

Curating data by establishing a layered (or multi-hop) architecture is a critical best practice for the Lakehouse, as it allows data teams to structure the data according to quality levels and define roles and responsibilities per layer.

Bilde

Adopting an Organization-wide Data Governance Strategy – This can be further segregated into the below three dimensions so let’s dive in:

Data Quality: The most important prerequisite for correct and meaningful reports, analysis results, and models is high-quality data. Quality assurance (QA) needs to exist around all pipeline steps. Examples of how to execute this include having data contracts, meeting SLAs and keeping schemata stable, and evolving them in a controlled way.
Data Catalog: Another important aspect is data discovery: Users of all business areas, especially in a self-service model, need to be able to discover relevant data easily. Therefore, a Lakehouse needs a data catalog that covers all business-relevant data
Access control: As the value creation from the data in the Lakehouse happens across all business areas, the Lakehouse needs to be built with security as a first-class citizen. Companies might have a more open data access policy or strictly follow the principle of least privileges. Independent of that, data access controls need to be in place in every layer. Lakehouse governance not only has strong access controls in place but should also track data lineage.

Bilde

Build to Scale and Optimize for Performance & Cost - Standard ETL processes, business reports, and dashboards often have a predictable resource need from a memory and computation perspective. However, new projects, seasonal tasks, or modern approaches like model training (churn, forecast, maintenance) will generate peaks of resource need. To enable a business to perform all these workloads, a scalable platform for memory and computation is necessary. A core aspect of optimization is storage versus compute resources. Since there is no clear relation between the volume of the data and workloads using this data (e.g. only using parts of the data or doing intensive calculations on small data), it is a good practice to settle on an infrastructure platform that decouples storage and compute resources

Bilde

ETL vs. EL vs. ELTL: A Quick comparison in type versus different topics

	ETL	EL	ELTL
Purpose of Data	Optimal for data analytics and business intelligence (BI) use-cases.	Optimal for data analytics and business intelligence (BI) use-cases.	Suitable for both data analytics and machine learning workloads.
Type of Data	Works well with structured data.	Works well with semi-structured and unstructured data.	Can handle structured, semi-structured and unstructured data.
Flexibility	Less flexible since can only use structured data.	More flexible as can work with streaming data and log analytics.	Most flexible as deals with both structured and unstructured data.
Users	Business professionals	Data scientists and engineers.	Business professionals and data teams.
Data Quality	Highly curated data, reliable. Raw Data, Low Quality and not reliable.	Raw and curated data, high quality with in-built data governance.
ACID Compliance	ACID-compliant: guarantees the highest levels of integrity.	Non-ACID-compliant: updates and deletes are complex operations.	ACID-compliant to ensure consistency while man sources concurrently read/write data.
Cost	Storage is costly and time-consuming.	Storage is cost-effective, fast and flexible.	Storage is cost-effective, fast and flexible.
Schema	Schema on write.	Schema on read.	Schema enforcement.
Security - on-premise	Most secure option when supported by a rigid data access policy.	Most secure option when supported by a rigid data access policy like a data warehouse.	Gives organizations full control to implement the right security protocols, compliance measures and audit logging for their needs.
Security - cloud	Strict user access control so that users of the warehouse can only access data they need to do their jobs.	Since it sources data from multiple source that may contain sensitive information, a data lake security system should entail administration, authorization, authentication and protection of data.	Users can access raw data through a governance layer that enforces security and controls access to the data.
Scalability - on-premise	Manual scaling with purchasing and configuring hardware.	Manual scaling with purchasing and configuring hardware.	Manual scaling with purchasing and configuring hardware.
Scalability - cloud	Auto-scalable.	Auto-scalable.	Auto-scalable.

References : Guiding Principles to Build an Effective Data Lakehouse - The Databricks Blog Data Warehouse Tutorial (guru99.com)

Data Integration Methodology - Best Practices and Comparison

December 20, 2022

Best practices to implement an ETL/ELT Integration pattern which further sums up to a Data-warehouse implementation

Best practices to implement an EL pattern that in itself is a Data Lake

Best practices to implement an ELTL pattern that speaks of a Data Lakehouse

ETL vs. EL vs. ELTL: A Quick comparison in type versus different topics

Samir Kurry

Time tracking systems - software

Time tracking systems - general thoughts

Creating and using a script to Install Arch linux through wifi