
The infrastructure of Redshift is cluster-based, which means that there is an infrastructure (compute and storage) that you pay for (if you pause a Redshift cluster you won’t get compute charges, but you still get storage charges). For optimal performance, your data needs to be loaded into a Redshift cluster and sorted/distributed in a certain way to lower the execution time of your queries.

After the hands-on example, we will discuss how we can connect Tableau with Redshift, and finally, we will conclude with an overview of everything Redshift has to offer.Īmazon Redshift is a fully-managed data warehouse service in the AWS cloud which scales to petabytes of data. It is designed for On-Line Analytical Processing (OLAP) and BI performance drops when used for Transaction Processing (OLTP).

from TPC-H, and we will see how Spectrum performs, compared to using data stored in Redshift. In this example we will use the same dataset and queries used in our previous blogs, i.e. We will also examine a feature in Redshift called Spectrum, which allows querying data in S3 then we will walk through a hands-on example to see how Redshift is used. In this article, we will go through the basic concepts of Redshift and also discuss some technical aspects thanks to which the data stored in Redshift can be optimised for querying. Actually, the combination of S3, Athena and Redshift is what AWS proposes as a data lakehouse. We are not going to make a thorough comparison between Athena and Redshift, but if you are interested in the comparison of these two technologies and what situations are more suited to one or the other, you can find interesting articles online such as this one. In this article, we are going to learn about Amazon Redshift, an AWS data warehouse that, in some situations, might be better suited to your analytical workloads than Athena. As we commented, Athena is great for relatively simple ad hoc queries in S3 data lakes even when data is large, but there are situations (complex queries, heavy usage of reporting tools, concurrency) in which it is important to consider alternative approaches, such as data warehousing technologies. In our second article, we introduced Athena and its serverless querying capabilities. The rest of tables are left unpartitioned. Partitioned Parquets: 32.5 GB – the largest tables, which are partitioned, are lineitem with 21.5GB and orders with 5GB, with one partition per day each partition has one file and there around 2,000 partitions per table.Parquets without partitions: 31.5 GB – the largest tables are lineitem with 21GB and orders with 4.5GB, also split into 80 files.Raw (CSV): 100 GB – the largest tables are lineitem with 76GB and orders with 16GB, split into 80 files.In that example, we used a dataset from the popular TPC-H benchmark, and generated three versions of the TPC-H dataset:
#ATHENA VS REDSHIFT HOW TO#
We also introduced the concept of the data lakehouse, as well as giving an example of how to convert raw data (most data landing in data lakes is in a raw format such as CSV) into partitioned Parquet files with Athena and Glue in AWS. In the first article of the series, we discussed how to optimise data lakes by using proper file formats ( Apache Parquet) and other optimisation mechanisms (partitioning). This is the third article in the ‘Data Lake Querying in AWS’ blog series, in which we introduce different technologies to query data lakes in AWS, i.e.
