This article appeared before on Dzone: https://dzone.com/articles/the-state-of-etl-traditional-to-cloud
The old way of doing ETL couldn't keep up, so companies began shifting to the cloud. Does your team use the cloud for ETL processes?
Every team in every department across every organization has a treasure trove of potentially high value data. But as much as 73% of it goes unused because it’s historically been difficult to access. Different sources, various formats, and other inconsistencies in aggregating and gaining anything of value from the data led organizations to devise Extract, Transform, Load (ETL) processes so they could gather data from a range of sources, standardize it, and centralize it into a single repository.
Yet, the original ETL processes were built for business needs from a decade ago. How times have changed. Today’s businesses have exponentially more data sources to unite. Research shows that modern enterprises can have as many as 400 enterprise applications in their environment, along with social media platforms and mobile technologies that produce massive quantities of data. To incorporate it all, modern data management leaders need new ways to balance the demand for increased requests for longer history data and more granular details, with the imperative of having immediate access to that information for strategic business planning.
In the good old days, ETL processes for a select few data sources were reasonably manageable by a small team of data scientists. However, as the volume and velocity of data increased, the systems and processes broke down. Traditional on-premise ETL tools came with a litany of shortcomings and challenges.
For starters, many ETL functions have historically been coded manually, a lengthy and often complex process that most companies chalked up to the cost of joining the Big Data revolution. But hand-coded data integration processes are inherently challenging: they make it difficult for one developer to learn another’s code, leading many developers to simply rewrite the code from scratch and adding time and expense to the operation.
Worse, these homespun environments thrust the burden and cost of maintaining data onto the company’s engineering team so that any time data needs updating, a team member leaves, or code (or configuration) goes undocumented, the company runs a real risk of losing valuable institutional knowledge.
In terms of daily operations and the impact on business users, on-premise ETL systems have traditionally been slow in delivering the kinds of insights businesses need to make intelligent decisions. Often these systems are based on batch processing, compelling teams to run nightly ETL and data consolidation jobs using free compute resources during off-hours. And adding capacity to adjust for increased demand will ultimately result in greater costs — power consumption, hardware, and staff overhead — and higher risk of downtime or service interruptions.
Traditional ETL processes feature extracting data in batches, transforming it in a staging area, and then loading it into the data warehouse or other data destination.
That model doesn’t align with modern business needs. In today’s business environment, data ingestion must work in real-time and give users the self-service capabilities to run queries and see the present picture at any time. And, as companies increasingly move more of their applications and workloads to the cloud (or from one cloud provider to another), they’ll face exponentially more data — in larger data sets, various formats, and from numerous sources and streams.
Their ETL tools must handle this mountain of data effortlessly. Modern ETL tools should be able to work well on any cloud provider and should be able to migrate easily as companies change providers.
They must be fault-tolerant, secure, scalable, and accurate from end to end, especially when providing crucial information for new machine learning (ML) or artificial intelligence (AI) models. They should enable error message configuration, event rerouting, and programmatic data enrichment on demand. And they should leverage modern object-based storage like Amazon S3 for immediate retrieval or leading cloud data warehouses such as Amazon Redshift, Google BigQuery, and Snowflake to directly transform massive datasets without requiring a dedicated staging area.
|Manual coding and SQL queries.||Cloud-based and fully managed to reduce maintenance and automate updates.|
Batch processing only.
Capable of supporting batch and real-time data ingestion.
|On-premise systems needing maintenance, upgrades.||Rapid, API-enabled ingestion from virtually any data source.|
|Slow, inflexible data ingestion and from limited sources.||Automated one-click and manual custom mapping.|
|Steep learning curve/long onboarding and training processes.||Easily transform any type of data into any format for immediate use.|
Difficulty transforming unstructured or semi-structured data.
|Interconnects data for more visibility and greater insight.|
|Restart streams from scratch in case of interruption.||Real-time data stream visualization and automated stream queueing after interruption.|
|Resource-intensive, taking valuable compute from other systems or applications.||Lightweight and non resource-intensive.|
To remain competitive, businesses need to adapt to an ever-changing competitive landscape. In some cases, collecting and analyzing data in overnight or even weekly batches may be worthwhile. But in the digital age when consumer preferences change virtually overnight and companies scramble to be first to market with new products and services, businesses need deep actionable insight now, not next week.
Written by Garret Alley for Dzone.