Amazon Redshift vs Traditional Data Warehouses - Kiến Thức Cho Người lao Động Việt Nam

Over the past 12 years, Amazon’s cloud ecosystem has experienced astounding growth. It’s estimated that by 2020, Amazon Web Services (AWS) will register revenues of $44 billion, twice the combined revenue of its two key cloud competitors: Google Cloud and Microsoft Azure. AWS Redshift is a major driver of that growth.

One of Amazon’s flagship products, AWS Redshift’s cloud-based data warehouse service is an industry gamechanger. From unbeatable performance to unlimited scalability, the number of enterprise customers using Redshift is increasing by the day. In this article, we explore the world of Redshift, its powerful features, and why so many companies are choosing Redshift for data storage and analytics.

Nội Dung Chính

What is AWS Redshift?

AWS Redshift is a cloud-based petabyte-scale data warehouse service offered as one of Amazon’s ecosystem of data solutions. Based on PostgreSQL, the platform integrates with most third-party applications by applying its ODBC and JDBC drivers.

Redshift delivers incredibly fast performance using two key architectural elements: columnar data storage and massively parallel processing design. In 2012, Amazon invested in the data warehouse vendor, ParAccel (now acquired by Actian) and leveraged its parallel processing technology in Redshift. The solution has quickly become an integral part of the big data analytics landscape through its ability to perform SQL-based queries on large databases containing a mix of structured, unstructured, and unstructured data.

Over the past five years, Redshift has emerged as one of the leading cloud solutions through its unparallelled ability to provide organizations with business intelligence.

How Redshift scores over traditional data warehouses

Traditionally, enterprises have encountered several challenges while setting up data warehouses. First, on-premise warehouses are expensive and take months to get running. This factor required firm budgetary and strategic commitment from leadership. Second, after a few months or years, data size invariably tended to increase, meaning companies needed to choose between investing in new hardware or tolerating slow performance.

Redshift’s cloud-based solution helps enterprises overcome these issues. It takes just minutes to create a cluster from the AWS console. Data ingestion into Redshift is performed by issuing a simple COPY command from Amazon S3 (Simple Storage Service), or DynamoDB. Additionally, the scalable architecture of Redshift allows companies to place a dynamic request to scale infrastructure up or down as requirements change.

As the server clusters are fully managed by AWS, Redshift eliminates the hassle of routine database administration tasks. Complex tasks such as data encryption are easily managed easily through Redshift’s built-in security features. The platform also performs a continuous backup of data, eliminating the risk of losing data or need to plan for backup hardware.

Given that Redshift is a cost-effective, reliable, scalable, and fast performing solution, companies are naturally gravitating towards the option of data-warehouse-as-a-service (DWaaS).

Learn how Talend runs its business on trusted data

Get the ebook

AWS Redshift — under the hood

The key reason for Redshift’s emergence as one of the most popular cloud data warehousing solutions is its underlying architectural elements. Moreover, AWS focuses on continuous innovation of the platform by adding newer features and offering product extensions. Let’s look into several Redshift design features that have changed the way we gain business insights.

Columnar Data Storage

Traditional relational databases use row-based storage. This is ideal for use cases that involve querying and updating specific rows, such as in CRM and ERP applications.
A row-based database would store the data in Table 1 as:

Row Id 001: 201,John,M,35,500000
Row Id 002: 2503,Matt,M,26,98000
Row Id 003: 350,Sarah,F,34,350000

In this approach, tables are normalized, and indexes are created to speed up querying large sets of records. However, since indexes take up bandwidth and resources on their own and many contribute to slowing down database performance, database architects need to carefully evaluate which columns may be queried more often and create indexes accordingly.

For big data OLAP operations, this approach places constraints. It’s hard to conclude which columns will be queried more, as the objective of analytics is to perform slice-and-dice operations without abandon and arrive at interesting insights. Hence, arises the need for column-based storage.

A column-based database such as Redshift would store the data in Table 1 as:

201:001, 2503:002, 350:003;
John:001, Matt:002, Sarah:003;
M:001, M:002, F:003;
35:001, 26:002, 34:003;
500000:001, 98000:002, 350000:003;

This ends up grouping similar data types, allowing for better compression. Data compression reduces storage requirements and I/O activities. Since less memory is utilized for loading compressed data, there is free memory available for analysis, improving query performance.

Hence, when you want to run a query in Redshift, say, to find out the average salary of employees, analyze whether there are gender biases in employee salaries, or understand the correlation between salary and age, the platform can run through millions of records much faster than a row-oriented database.

Massively Parallel Processing (MPP)

MPP is a distributed design approach, developed by ParAccel, that enables faster processing of larger datasets. Here, the dataset is split into many parts and processed in parallel to return results quicker.

Redshift uses the MPP design in its clusters. Although technically, you can deploy a single-node instance, the advantage of Redshift lies in multi-node cluster that can parallel process data. A cluster consists of several compute nodes with a leader node to manage the compute nodes.

The leader node is responsible for splitting the data, assigning chunks to different compute nodes, and then consolidating the results back. Each compute node is divided further into slices based on the number of cores, and hence every data chunk gets its own memory and operating system to work with.

This type of architecture is called ‘shared-nothing’ as there is no shared knowledge between systems and the nodes are completely independent of each other. Because each node only contains the data it is responsible for, MPP is much easier to maintain and deploy.

Compiled code

Redshift utilizes an interesting strategy to expedite query running time. After compiling the query once, the platform distributes this compiled code across the cluster. When the leader node distributes data across compute nodes, it also sends the compiled code, removing any additional processing overhead.

The compiled code is also cached and shared across sessions on a cluster, allowing the same query to be executed at a later point, when[?]it will be executed much faster.

Concurrency scaling

AWS Redshift recently launched concurrency scaling, a new feature built to tackle the challenge of uneven cluster use. Many organizations have use cases where data analysts run heavy workloads during a certain time window, but at other times the clusters remain unutilized.

To counter these high spurts of activities and save costs during low usage hours, Redshift provides the flexibility to scale your clusters up or down as needed. The elastic resizing feature lets you add or remove nodes in a cluster in minutes. You can either use the resize option directly from the AWS console or by calling on an API.

The difference between elastic resize and the classic Redshift resize feature is that while classic resize helps you create a new cluster, elastic resize adds or removes nodes to an existing cluster with minimal disruption.

Security

Redshift contains a majority of the security features that other AWS cloud services offer. Basic access privileges are controlled through an AWS account and given to specific Redshift resources through Identity and Access Management (IAM) accounts. In addition, cluster management is performed through creation of specific cluster security groups and through a Virtual Private Cloud (VPC) environment, if your use case permits a private cloud.

As well, data encryption can be enabled at the time of cluster creation. However, this is an immutable property, which cannot switch from an encrypted to unencrypted cluster directly. For transit data, SSL encryption is supported.

Integration with machine learning

By using Redshift, users can leverage the entire AWS cloud ecosystem. One example of this is Redshift’s capability to integrate with the AWS Machine Learning (ML) service. Although Redshift enables users to perform ETL operations at an incredible speed, data scientists still need to write their own algorithms to perform analysis.

AWS ML expands upon this offering by providing a service that can crunch data through internal algorithms and then arrive at applicable predictions. Ultimately, AWS ML’s integration with Redshift as a data source elevates the potential of this DWaaS.

Why choose Redshift?

The unprecedented explosion of data volumes in recent times has created an environment where companies have to formulate a thorough data strategy. This strategy and the resulting solution have to account for data volumes, costs, scalability, and simplicity of use.

Redshift ticks all these boxes.

Until recently, most companies only had to work with internal data sources. In some cases, data from external sources were fed through files or middleware technologies like publish-subscribe. Today, it’s not as useful to look at isolated data inside an organization, without correlating it with other cloud sources.

Ingesting data from multiple sources into Redshift, which is an off-the-shelf solution, is a quick way for analysts to get a deeper understanding of data. For example, many companies use Salesforce to store their CRM information. Integration of Salesforce data with Redshift helps derive customer behavior patterns, inform important decisions on lead generation, and run targeted marketing campaigns.

Similarly, data from other internal business applications and even log files can be transformed and fed into Redshift tables, providing businesses with a deeper insight into available data.

Data integration with AWS Redshift

Most enterprises tend to use a combination of on-premise and cloud data management tools. This creates the challenge of consolidating and integrating disparate types of data into a unified platform such as AWS Redshift, in order to mine data for actionable insights.

Talend Cloud Integration Platform automates data preparation processes and simplifies ETL in order to reduce expenses and speed up time-to-insights. Talend Cloud Integration includes over 900 connectors to help you quickly and easily move data between virtually any application or source, including Redshift. Talend Cloud is compatible with on-premises, cloud, or hybrid data management solutions.

Start your free trial of Talend Cloud to see how Redshift and other data management tools can transform your approach to data and deliver powerful business intelligence.