AWS (Amazon Web Services) Data sources Data analysis Cloud Data warehouse Data storage

Published:
Est. reading time: 6 minutes
Author: Mia Hatton

Amazon Redshift is a cloud-based data warehousing solution that allows you to store, transform and query vast volumes of data at high speed.

Mia Hatton

Budding data scientist with an entrepreneurial and science communication background.

More

Amazon Redshift is an Amazon Web Services (AWS) product. Read more about AWS here.

Definition of Amazon Redshift

From Amazon Web Services:

Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against petabytes of structured data using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. Most results come back in seconds. With Redshift, you can start small for just $0.25 per hour with no commitments and scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional solutions. Amazon Redshift also includes Amazon Redshift Spectrum, allowing you to directly run SQL queries against exabytes of unstructured data in Amazon S3 data lakes. No loading or transformation is required, and you can use open data formats, including Avro, CSV, Ion, JSON, ORC, Parquet, and more. Redshift Spectrum automatically scales query compute capacity based on the data being retrieved, so queries against Amazon S3 run fast, regardless of data set size.

Read more about:

Does your organisation need Amazon Redshift?

Amazon Redshift is a cloud-based data warehousing solution that allows you to store, transform and query vast volumes of data at high speed. The data warehouse is structured as a cluster of nodes. Each node can store and query data and their communication and applications are managed by a single leader node. The cluster of nodes performs Massively Parallel Processing (MPP), wherein each node performs a small part of each processing task in parallel, which is what allows you to perform high-speed processing across large datasets. This makes Amazon Redshift ideal for business intelligence operations, especially when you want to obtain real-time insights from streaming data such as app performance and manufacturing information. Amazon Redshift allows you to scale your storage and compute power to meet your needs and budget.

You may need Amazon Redshift if:

  • You want to gain insight from large volumes of data that are currently stored in a number of separate locations.
  • The data you use and process varies in quantity over time and you need a flexible storage solution.
  • Your data insights come from disparate sources so that gaining insight is time-consuming.
  • Your data workloads are difficult to manage.
  • You want to scale your data infrastructure to support real-time streaming and analysis.

Benefits

Benefits of Amazon Redshift include:

  • It can automate scaling of your storage and computing power to suit your needs, with each cluster supporting up to 8PB of storage.
  • It allows you to easy query and write data to your data lake solution, giving you the flexibility to work with highly structured and unstructured data.
  • It integrates with a suite of AWS analytics solutions.
  • It is fast and flexible.
  • It scales resources in real-time to manage performance as you run queries.
  • AWS provides one hour of free Concurrency Scaling credits per day, allowing resources to scale whilst keeping pricing predictable.
  • Automated provisioning and back-ups allow you to focus on your analytics, rather than your data warehouse management.

Technical considerations

Prerequisites and Integrations

To get started with Amazon Redshift, you need an AWS account. Read more about AWS here.

You can set up Amazon Redshift in a matter of minutes by following Amazon’s comprehensive Getting Started Guide. You can also migrate to Amazon Redshift from Oracle with minimal downtime.

Setting up Amazon Redshift involves creating and configuring your cluster, and setting up security and permissions. Once you have a cluster you can load your data and start analysing it. You will need some familiarity with web technologies and SQL to complete these steps.

You can load data into Amazon Redshift from a range of data sources including Amazon S3, Amazon DynamoDB, Amazon EMR, AWS Glue, AWS Data Pipeline and or any SSH-enabled host on Amazon EC2 or on-premises.

When you want to use your data warehouse for business intelligence you can access the data in Amazon Redshift using standard JDBC and ODBC drivers. There are a variety of business intelligence tools that offer connectors to Amazon Redshift, including Power BI, Tableau Server and Mode Analytics. You can see a list of Amazon Redshift Partners who offer BI technology that integrates with Amazon Redshift here.

Security and Compliance

From Amazon:

Amazon Redshift encrypts and keeps your data secure in transit and at rest using industry-standard encryption techniques. To keep data secure in transit, Amazon Redshift supports SSL-enabled connections between your client application and your Redshift data warehouse cluster. To keep your data secure at rest, Amazon Redshift encrypts each block using hardware-accelerated AES-256 as it is written to disk. This takes place at a low level in the I/O subsystem, which encrypts everything written to disk, including intermediate query results. The blocks are backed up as is, which means that backups are encrypted as well. By default, Amazon Redshift takes care of key management but you can choose to manage your keys using your own hardware security modules (HSMs) or manage your keys through AWS Key Management Service.

There is no direct access to your compute nodes in Amazon Redshift except through the data warehouse cluster’s lead node, which means that your data is equally secure regardless of how much you choose to store.

You can read more about AWS security here.

Amazon Redshift has been assessed by third-party auditors to ensure its security and compliance against a range of international standards. AWS provide several resources and services to help you ensure that your configuration is compliant with industry standards.

Pricing

Amazon Redshift is a pay-as-you-go service so you only pay for what you use, and there are no up-front set-up fees. Read more about AWS and its pricing structure here.

With Pay-As-You-Go (PAYG) pricing for Amazon Redshift, your monthly bill calculated from an hourly rate based on the type and number of nodes in your cluster, so you only pay for the storage you actually use. Amazon provides backup storage equal in size to your provisioned storage for free, and charges standard Amazon S3 rates for additional backup storage. There are no additional charges for data transfer to Amazon Redshift within the same AWS region, but additional charges do incur for data transfer from other sources.

If you use Amazon Redshift Spectrum to query your data, additional charges will incur based the amount of Amazon S3 data scanned to execute your query. You can keep these costs to a minimum by compressing your data (using one of Redshift’s supported formats, and by storing your data in a columnar format such as Apache Parquet or Apache ORC. Amazon provides documentation for converting your data to one of these formats if necessary.

Reserved pricing is available for Amazon Redshift, and further discounts are available if you pay for your reserved instances upfront. You can read more about the savings opportunities of reserved pricing here.

You can calculate the monthly cost of using Amazon Redshift here.

Alternatives to Amazon Redshift

Amazon Redshift performs best when it utilises Massively Parallel Processing (MPP) to load and analyse data, but MPP is only supported when your data is in Amazon S3 or relational DynamoDB or on Amazon EMR, unless you adopt an ETL (extract, transform, load) solution. Furthermore, Amazon Redshift’s high speed can run up high costs, so if speed is not your highest priority you might find a different solution is friendlier to your budget.

Alternatives to Amazon Redshift include: