Amazon Redshift Step By Step Process
Download SQL workbench
Download JDBC Driver
SQL workbench
Redshift
SQL workbench
CSV some DATA File
SQL workbench
S3
SQL workbench
Redshift
Architecture
Copy
Command
Copy from 's3://xxxxxxx/xxxx/xxx.xxx'
Credentials
'aws_access_key_id=xxxxxxxxxxxxxxxxxxxxxxxxx;
aws_secret_access_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
token;
----------------------------------------------------------------
copy part from 's3:///load/part-csv.tbl'
credentials
'aws_access_key_id=;
aws_secret_access_key='
csv;
--------------------------------
copy part from 's3:///load/part-csv.tbl'
credentials
'aws_access_key_id=;
aws_secret_access_key='
csv
null as '\000';
---------------------------------------Upload command------------------------
UNLOAD ('select * from customers')
TO 's3://flydata-test-unload/unload-folder/customer_' credentials
'aws_access_key_id=your_access_key;
aws_secret_access_key=your_secret_key';
unload ('select * from venue')
to 's3://mybucket/tickit/venue_'
access_key_id ''
secret_access_key ''
token '';
FAQ's
Q:
What is Amazon Redshift?
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Customers can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions.
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Customers can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions.
Traditional
data warehouses require significant time and resource to administer, especially
for large datasets. In addition, the financial cost associated with building,
maintaining, and growing self-managed, on-premise data warehouses is very high.
Amazon Redshift not only significantly lowers the cost of a data warehouse, but
also makes it easy to analyze large amounts of data very quickly.
Amazon
Redshift gives you fast querying capabilities over structured data using
familiar SQL-based clients and business intelligence (BI) tools using standard
ODBC and JDBC connections. Queries are distributed and parallelized across
multiple physical resources. You can easily scale an Amazon Redshift data
warehouse up or down with a few clicks in the AWS Management Console or with a
single API call. Amazon Redshift automatically patches and backs up your data
warehouse, storing the backups for a user-defined retention period. Amazon
Redshift uses replication and continuous backups to enhance availability and
improve data durability and can automatically recover from component and node
failures. In addition, Amazon Redshift supports Amazon Virtual Private Cloud
(Amazon VPC), SSL, AES-256 encryption and Hardware Security Modules (HSMs) to
protect your data in transit and at rest.
As with
all Amazon Web Services, there are no up-front investments required, and you
pay only for the resources you use. Amazon Redshift lets you pay as you go. You
can even try Amazon Redshift for free.
Q:
What does Amazon Redshift manage on my behalf?
Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups, and patching. Amazon Redshift automatically monitors your nodes and drives to help you recover from failures.
Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups, and patching. Amazon Redshift automatically monitors your nodes and drives to help you recover from failures.
Q:
How does the performance of Amazon Redshift compare to most traditional
databases for data warehousing and analytics?
Amazon Redshift uses a variety of innovations to achieve up to ten times higher performance than traditional databases for data warehousing and analytics workloads:
Amazon Redshift uses a variety of innovations to achieve up to ten times higher performance than traditional databases for data warehousing and analytics workloads:
·
Columnar Data Storage: Instead
of storing data as a series of rows, Amazon Redshift organizes the data by
column. Unlike row-based systems, which are ideal for transaction processing,
column-based systems are ideal for data warehousing and analytics, where
queries often involve aggregates performed over large data sets. Since only the
columns involved in the queries are processed and columnar data is stored
sequentially on the storage media, column-based systems require far fewer I/Os,
greatly improving query performance.
·
Advanced Compression: Columnar
data stores can be compressed much more than row-based data stores because
similar data is stored sequentially on disk. Amazon Redshift employs multiple
compression techniques and can often achieve significant compression relative
to traditional relational data stores. In addition, Amazon Redshift doesn't
require indexes or materialized views and so uses less space than traditional
relational database systems. When loading data into an empty table, Amazon
Redshift automatically samples your data and selects the most appropriate
compression scheme.
·
Massively Parallel Processing (MPP): Amazon
Redshift automatically distributes data and query load across all nodes. Amazon
Redshift makes it easy to add nodes to your data warehouse and enables you to
maintain fast query performance as your data warehouse grows.
Q:
How do I get started with Amazon Redshift?
You can sign up and get started within minutes from the Amazon Redshift detail page or via the AWS Management Console. If you don't already have an AWS account, you'll be prompted to create one. Visit our Getting Started Page to see how to try Amazon Redshift for free.
You can sign up and get started within minutes from the Amazon Redshift detail page or via the AWS Management Console. If you don't already have an AWS account, you'll be prompted to create one. Visit our Getting Started Page to see how to try Amazon Redshift for free.
Q:
How do I create an Amazon Redshift data warehouse cluster?
You can easily create an Amazon Redshift data warehouse cluster by using the AWS Management Console or the Amazon Redshift APIs.You can start with a single node, 160GB data warehouse and scale all the way to a petabyte or more with a few clicks in the AWS Console or a single API call.
You can easily create an Amazon Redshift data warehouse cluster by using the AWS Management Console or the Amazon Redshift APIs.You can start with a single node, 160GB data warehouse and scale all the way to a petabyte or more with a few clicks in the AWS Console or a single API call.
The
single node configuration enables you to get started with Amazon Redshift
quickly and cost-effectively and scale up to a multi-node configuration as your
needs grow. The multi-node configuration requires a leader node that manages
client connections and receives queries, and two compute nodes that store data
and perform queries and computations. The leader node is provisioned for you
automatically and you are not charged for it.
Simply
specify your preferred Availability Zone (optional), the number of nodes, node
types, a master name and password, security groups, your preferences for backup
retention, and other system settings. Once you've chosen your desired
configuration, Amazon Redshift will provision the required resources and set up
your data warehouse cluster.
Q:
What does a leader node do? What does a compute node do?
A leader node receives queries from client applications, parses the queries and develops execution plans, which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes and finally returns the results back to the client applications.
A leader node receives queries from client applications, parses the queries and develops execution plans, which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes and finally returns the results back to the client applications.
Compute
nodes execute the steps specified in the execution plans and transmit data
among themselves to serve these queries. The intermediate results are sent back
to the leader node for aggregation before being sent back to the client
applications.
Q:
What is the maximum storage capacity per compute node? What is the
recommended amount of data per compute node for optimal performance?
You can create a cluster using either Dense Storage (DS) nodes or Dense Compute nodes (DC). Dense Storage nodes allow you to create very large data warehouses using hard disk drives (HDDs) for a very low price point. Dense Compute nodes allow you to create very high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs).
You can create a cluster using either Dense Storage (DS) nodes or Dense Compute nodes (DC). Dense Storage nodes allow you to create very large data warehouses using hard disk drives (HDDs) for a very low price point. Dense Compute nodes allow you to create very high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs).
Dense
Storage (DS) nodes are available in two sizes, Extra Large and Eight Extra
Large. The Extra Large (XL) has 3 HDDs with a total of 2TB of magnetic storage,
whereas Eight Extra Large (8XL) has 24 HDDs with a total of 16TB of magnetic
storage. DS2.8XL has 36 Intel Xeon E5-2676 v3 (Haswell) virtual cores and
244GiB of RAM, and DS2.XL has 4 Intel Xeon E5-2676 v3 (Haswell) virtual cores
and 31GiB of RAM. Please see our pricing page for more detail. You can get
started with a single Extra Large node, 2TB data warehouse for $0.85 per hour
and scale up to a petabyte or more. You can pay by the hour or use reserved
instance pricing to lower your price to under $1,000 per TB per year.
Dense
Compute (DC) nodes are also available in two sizes. The Large has 160GB of SSD
storage, 2 Intel Xeon E5-2670v2 (Ivy Bridge) virtual cores and 15GiB of RAM.
The Eight Extra Large is sixteen times bigger with 2.56TB of SSD storage, 32
Intel Xeon E5-2670v2 virtual cores and 244GiB of RAM. You can get started with
a single Large node for $0.25 per hour and and scale all the way up to 128 8XL
nodes with 326TB of SSD storage, 3,200 virtual cores and 24TiB of RAM.
Amazon
Redshift's MPP architecture means you can increase your performance by
increasing the number of nodes in your data warehouse cluster. The optimal
amount of data per compute node depends on your application characteristics and
your query performance needs.
Q:
How many nodes can I specify per Amazon Redshift data warehouse cluster?
An Amazon Redshift data warehouse cluster can contain from 1-128 compute nodes, depending on the node type. For details please see our documentation.
An Amazon Redshift data warehouse cluster can contain from 1-128 compute nodes, depending on the node type. For details please see our documentation.
Q:
How do I access my running data warehouse cluster?
Once your data warehouse cluster is available, you can retrieve its endpoint and JDBC and ODBC connection string from the AWS Management Console or by using the Redshift APIs. You can then use this connection string with your favorite database tool, programming language, or Business Intelligence (BI) tool. You will need to authorize network requests to your running data warehouse cluster. For a detailed explanation please refer to our Getting Started Guide.
Once your data warehouse cluster is available, you can retrieve its endpoint and JDBC and ODBC connection string from the AWS Management Console or by using the Redshift APIs. You can then use this connection string with your favorite database tool, programming language, or Business Intelligence (BI) tool. You will need to authorize network requests to your running data warehouse cluster. For a detailed explanation please refer to our Getting Started Guide.
Q:
When would I use Amazon Redshift vs. Amazon RDS?
Both Amazon Redshift and Amazon RDS enable you to run traditional relational databases in the cloud while offloading database administration. Customers use Amazon RDS databases both for online-transaction processing (OLTP) and for reporting and analysis. Amazon Redshift harnesses the scale and resources of multiple nodes and uses a variety of optimizations to provide order of magnitude improvements over traditional databases for analytic and reporting workloads against very large data sets. Amazon Redshift provides an excellent scale-out option as your data and query complexity grows or if you want to prevent your reporting and analytic processing from interfering with the performance of your OLTP workload.
Both Amazon Redshift and Amazon RDS enable you to run traditional relational databases in the cloud while offloading database administration. Customers use Amazon RDS databases both for online-transaction processing (OLTP) and for reporting and analysis. Amazon Redshift harnesses the scale and resources of multiple nodes and uses a variety of optimizations to provide order of magnitude improvements over traditional databases for analytic and reporting workloads against very large data sets. Amazon Redshift provides an excellent scale-out option as your data and query complexity grows or if you want to prevent your reporting and analytic processing from interfering with the performance of your OLTP workload.
Q:
When would I use Amazon Redshift vs. Amazon Elastic MapReduce (Amazon EMR)?
Amazon Redshift is ideal for large volumes of structured data that you want to persist and query using standard SQL and your existing BI tools. Amazon EMR is ideal for processing and transforming unstructured or semi-structured data to bring in to Amazon Redshift and is also a much better option for data sets that are relatively transitory, not stored for long-term use.
Amazon Redshift is ideal for large volumes of structured data that you want to persist and query using standard SQL and your existing BI tools. Amazon EMR is ideal for processing and transforming unstructured or semi-structured data to bring in to Amazon Redshift and is also a much better option for data sets that are relatively transitory, not stored for long-term use.
Q:
Why should I use Amazon Redshift instead of running my own MPP data warehouse
cluster on Amazon EC2?
Amazon Redshift automatically handles many of the time-consuming tasks associated with managing your own data warehouse including:
Amazon Redshift automatically handles many of the time-consuming tasks associated with managing your own data warehouse including:
·
Setup: With Amazon Redshift, you
simply create a data warehouse cluster, define your schema, and begin loading
and querying your data. Provisioning, configuration and patching are all
managed for you.
·
Data Durability: Amazon
Redshift replicates your data within your data warehouse cluster and
continuously backs up your data to Amazon S3, which is designed for eleven
nines of durability. Amazon Redshift mirrors each drive's data to other nodes
within your cluster. If a drive fails, your queries will continue with a slight
latency increase while Redshift rebuilds your drive from replicas. In case of
node failure(s), Amazon Redshift automatically provisions new node(s) and
begins restoring data from other drives within the cluster or from Amazon S3.
It prioritizes restoring your most frequently queried data so your most
frequently executed queries will become performant quickly.
·
Scaling: You can
add or remove nodes from your Amazon Redshift data warehouse cluster with a
single API call or via a few clicks in the AWS Management Console as your
capacity and performance needs change.
·
Automatic Updates and Patching: Amazon
Redshift automatically applies upgrades and patches your data warehouse so you
can focus on your application and not on its administration.
Billing
Q:
How will I be charged and billed for my use of Amazon Redshift?
You pay only for what you use, and there are no minimum or setup fees. You are billed based on:
You pay only for what you use, and there are no minimum or setup fees. You are billed based on:
·
Compute node hours –
Compute
node hours are the total number of hours you run across all your compute nodes
for the billing period. You are billed for 1 unit per node per hour, so a
3-node data warehouse cluster running persistently for an entire month would
incur 2,160 instance hours. You will not be charged for leader node hours; only
compute nodes will incur charges.
·
Backup Storage – Backup
storage is the storage associated with your automated and manual snapshots for
your data warehouse. Increasing your backup retention period or taking
additional snapshots increases the backup storage consumed by your data
warehouse. There is no additional charge for backup storage up to 100% of your
provisioned storage for an active data warehouse cluster. For example, if you
have an active Single Node XL data warehouse cluster with 2TB of local instance
storage, we will provide up to 2TB-Month of backup storage at no additional
charge. Backup storage beyond the provisioned storage size and backups stored
after your cluster is terminated are billed at standard Amazon S3 rates.
·
Data transfer – There is
no Data Transfer charge for data transferred to or from Amazon Redshift outside
of Amazon VPC. Data Transfer to or from Redshift in Amazon VPC accrues standard AWS data transfer charges.
·
Q:
When does billing of my Amazon Redshift data warehouse clusters begin and end?
Billing commences for a data warehouse cluster as soon as the data warehouse cluster is available. Billing continues until the data warehouse cluster terminates, which would occur upon deletion or in the event of instance failure.
Billing commences for a data warehouse cluster as soon as the data warehouse cluster is available. Billing continues until the data warehouse cluster terminates, which would occur upon deletion or in the event of instance failure.
Q:
What defines billable Amazon Redshift instance hours?
Node usage hours are billed for each hour your data warehouse cluster is running in an available state. If you no longer wish to be charged for your data warehouse cluster, you must terminate it to avoid being billed for additional node hours. Partial node hours consumed are billed as full hours.
Node usage hours are billed for each hour your data warehouse cluster is running in an available state. If you no longer wish to be charged for your data warehouse cluster, you must terminate it to avoid being billed for additional node hours. Partial node hours consumed are billed as full hours.
Q:
Do your prices include taxes?
Except as
otherwise noted, our prices are exclusive of applicable taxes and duties,
including VAT and applicable sales tax. For customers with billing address in
Japan, use of the Asia Pacific (Tokyo) Region is subject to Japanese
Consumption Tax.
Data Integration and Loading
Q:
How do I load data into my Amazon Redshift data warehouse?
You can load data into Amazon Redshift from a range of data sources including Amazon S3, Amazon DynamoDB, Amazon EMR, AWS Data Pipeline and or any SSH-enabled host on Amazon EC2 or on-premises. Amazon Redshift attempts to load your data in parallel into each compute node to maximize the rate at which you can ingest data into your data warehouse cluster. For more details on loading data into Amazon Redshift please view our Getting Started Guide.
You can load data into Amazon Redshift from a range of data sources including Amazon S3, Amazon DynamoDB, Amazon EMR, AWS Data Pipeline and or any SSH-enabled host on Amazon EC2 or on-premises. Amazon Redshift attempts to load your data in parallel into each compute node to maximize the rate at which you can ingest data into your data warehouse cluster. For more details on loading data into Amazon Redshift please view our Getting Started Guide.
Q:
Can I load data using SQL ‘INSERT' statements?
Yes, clients can connect to Amazon Redshift using ODBC or JDBC and issue 'insert' SQL commands to insert the data. Please note this is slower than using S3 or DynamoDB since those methods load data in parallel to each compute node while SQL insert statements load via the single leader node.
Yes, clients can connect to Amazon Redshift using ODBC or JDBC and issue 'insert' SQL commands to insert the data. Please note this is slower than using S3 or DynamoDB since those methods load data in parallel to each compute node while SQL insert statements load via the single leader node.
Q:
How do I load data from my existing Amazon RDS, Amazon EMR, Amazon DynamoDB,
and Amazon EC2 data sources to Amazon Redshift?
You can use our COPY command to load data in parallel directly to Amazon Redshift from Amazon EMR, Amazon DynamoDB, or any SSH-enabled host. Moreover, many ETL companies have certified Amazon Redshift for use with their tools, and a number are offering free trials to help you get started loading your data. Finally, AWS Data Pipeline provides a high performance, reliable, fault tolerant solution to load data from a variety of AWS data sources. You can use AWS Data Pipeline to specify the data source, desired data transformations, and then execute a pre-written import script to load your data into Amazon Redshift.
You can use our COPY command to load data in parallel directly to Amazon Redshift from Amazon EMR, Amazon DynamoDB, or any SSH-enabled host. Moreover, many ETL companies have certified Amazon Redshift for use with their tools, and a number are offering free trials to help you get started loading your data. Finally, AWS Data Pipeline provides a high performance, reliable, fault tolerant solution to load data from a variety of AWS data sources. You can use AWS Data Pipeline to specify the data source, desired data transformations, and then execute a pre-written import script to load your data into Amazon Redshift.
Q:
I have a lot of data for initial loading into Amazon Redshift. Transferring via
the Internet would take a long time. How do I load this data?
You can use AWS Import/Export to transfer the data to Amazon S3 using portable storage devices. In addition, you can use AWS Direct Connect to establish a private network connection between your network or data center and AWS. You can choose 1Gbit/sec or 10Gbit/sec connection ports to transfer your data.
You can use AWS Import/Export to transfer the data to Amazon S3 using portable storage devices. In addition, you can use AWS Direct Connect to establish a private network connection between your network or data center and AWS. You can choose 1Gbit/sec or 10Gbit/sec connection ports to transfer your data.
Q:
How does Amazon Redshift keep my data secure?
Amazon Redshift encrypts and keeps your data secure in transit and at rest using industry-standard encryption techniques. To keep data secure in transit, Amazon Redshift supports SSL-enabled connections between your client application and your Redshift data warehouse cluster. To keep your data secure at rest, Amazon Redshift encrypts each block using hardware-accelerated AES-256 as it is written to disk. This takes place at a low level in the I/O subsystem, which encrypts everything written to disk, including intermediate query results. The blocks are backed up as is, which means that backups are encrypted as well. By default, Amazon Redshift takes care of key management but you can choose to manage your keys using your own hardware security modules (HSMs) or manage your keys throughAWS Key Management Service.
Amazon Redshift encrypts and keeps your data secure in transit and at rest using industry-standard encryption techniques. To keep data secure in transit, Amazon Redshift supports SSL-enabled connections between your client application and your Redshift data warehouse cluster. To keep your data secure at rest, Amazon Redshift encrypts each block using hardware-accelerated AES-256 as it is written to disk. This takes place at a low level in the I/O subsystem, which encrypts everything written to disk, including intermediate query results. The blocks are backed up as is, which means that backups are encrypted as well. By default, Amazon Redshift takes care of key management but you can choose to manage your keys using your own hardware security modules (HSMs) or manage your keys throughAWS Key Management Service.
Q:
Can I use Amazon Redshift in Amazon Virtual Private Cloud (Amazon VPC)?
Yes, you can use Amazon Redshift as part of your VPC configuration. With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you might operate in your own datacenter. This gives you complete control over who can access your Amazon Redshift data warehouse cluster.
Yes, you can use Amazon Redshift as part of your VPC configuration. With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you might operate in your own datacenter. This gives you complete control over who can access your Amazon Redshift data warehouse cluster.
Q:
Can I access my Amazon Redshift compute nodes directly?
No. Your Amazon Redshift compute nodes are in a private network space and can only be accessed from your data warehouse cluster's leader node. This provides an additional layer of security for your data.
No. Your Amazon Redshift compute nodes are in a private network space and can only be accessed from your data warehouse cluster's leader node. This provides an additional layer of security for your data.
Q:
What happens to my data warehouse cluster availability and data durability if a
drive on one of my nodes fails?
Your Amazon Redshift data warehouse cluster will remain available in the event of a drive failure however you may see a slight decline in performance for certain queries. In the event of a drive failure, Amazon Redshift will transparently use a replica of the data on that drive which is stored on other drives within that node. In addition, Amazon Redshift will attempt to move your data to a healthy drive or will replace your node if it is unable to do so. Single node clusters do not support data replication. In the event of a drive failure you will need to restore the cluster from snapshot on S3. We recommend using at least two nodes for production.
Your Amazon Redshift data warehouse cluster will remain available in the event of a drive failure however you may see a slight decline in performance for certain queries. In the event of a drive failure, Amazon Redshift will transparently use a replica of the data on that drive which is stored on other drives within that node. In addition, Amazon Redshift will attempt to move your data to a healthy drive or will replace your node if it is unable to do so. Single node clusters do not support data replication. In the event of a drive failure you will need to restore the cluster from snapshot on S3. We recommend using at least two nodes for production.
Q:
What happens to my data warehouse cluster availability and data durability in
the event of individual node failure?
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. The data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the DB. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from S3 first to allow you to resume querying your data as quickly as possible. Single node clusters do not support data replication. In the event of a drive failure you will need to restore the cluster from snapshot on S3. We recommend using at least two nodes for production.
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. The data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the DB. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from S3 first to allow you to resume querying your data as quickly as possible. Single node clusters do not support data replication. In the event of a drive failure you will need to restore the cluster from snapshot on S3. We recommend using at least two nodes for production.
Q:
What happens to my data warehouse cluster availability and data durability in
the event if my data warehouse cluster's Availability Zone (AZ) has an outage?
If your Amazon Redshift data warehouse cluster's Availability Zone becomes unavailable, you will not be able to use your cluster until power and network access to the AZ are restored. Your data warehouse cluster's data is preserved so you can start using your Amazon Redshift data warehouse as soon as the AZ becomes available again. In addition, you can also choose to restore any existing snapshots to a new AZ in the same Region. Amazon Redshift will restore your most frequently accessed data first so you can resume queries as quickly as possible.
If your Amazon Redshift data warehouse cluster's Availability Zone becomes unavailable, you will not be able to use your cluster until power and network access to the AZ are restored. Your data warehouse cluster's data is preserved so you can start using your Amazon Redshift data warehouse as soon as the AZ becomes available again. In addition, you can also choose to restore any existing snapshots to a new AZ in the same Region. Amazon Redshift will restore your most frequently accessed data first so you can resume queries as quickly as possible.
Q:
Does Amazon Redshift support Multi-AZ Deployments?
Currently, Amazon Redshift only supports Single-AZ deployments. You can run data warehouse clusters in multiple AZ's by loading data into two Amazon Redshift data warehouse clusters in separate AZs from the same set of Amazon S3 input files. In addition, you can also restore a data warehouse cluster to a different AZ from your data warehouse cluster snapshots.
Currently, Amazon Redshift only supports Single-AZ deployments. You can run data warehouse clusters in multiple AZ's by loading data into two Amazon Redshift data warehouse clusters in separate AZs from the same set of Amazon S3 input files. In addition, you can also restore a data warehouse cluster to a different AZ from your data warehouse cluster snapshots.
Q:
How does Amazon Redshift back up my data?
Amazon Redshift replicates all your data within your data warehouse cluster when it is loaded and also continuously backs up your data to S3. Amazon Redshift always attempts to maintain at least three copies of your data (the original and replica on the compute nodes and a backup in Amazon S3). Redshift can also asynchronously replicate your snapshots to S3 in another region for disaster recovery.
Amazon Redshift replicates all your data within your data warehouse cluster when it is loaded and also continuously backs up your data to S3. Amazon Redshift always attempts to maintain at least three copies of your data (the original and replica on the compute nodes and a backup in Amazon S3). Redshift can also asynchronously replicate your snapshots to S3 in another region for disaster recovery.
Q:
How long does Amazon Redshift retain backups? Is it configurable?
By default, Amazon Redshift retains backups for 1 day. You can configure this to be as long as 35 days.
By default, Amazon Redshift retains backups for 1 day. You can configure this to be as long as 35 days.
Q:
How do I restore my Amazon Redshift data warehouse cluster from a backup?
You have access to all the automated backups within your backup retention window. Once you choose a backup from which to restore, we will provision a new data warehouse cluster and restore your data to it.
You have access to all the automated backups within your backup retention window. Once you choose a backup from which to restore, we will provision a new data warehouse cluster and restore your data to it.
Q:
Do I need to enable backups for my data warehouse cluster or is it done
automatically?
By default, Amazon Redshift enables automated backups of your data warehouse cluster with a 1-day retention period. Free backup storage is limited to the total size of storage on the nodes in the data warehouse cluster and only applies to active data warehouse clusters. For example, if you have total data warehouse storage of 8TB, we will provide at most 8TB of backup storage at no additional charge. If you would like to extend your backup retention period beyond one day, you can do so using the AWS Management Console or the Amazon Redshift APIS. For more information on automated snapshots, please refer to the Amazon Redshift Management Guide. Amazon Redshift only backs up data that has changed so most snapshots only use up a small amount of your free backup storage.
By default, Amazon Redshift enables automated backups of your data warehouse cluster with a 1-day retention period. Free backup storage is limited to the total size of storage on the nodes in the data warehouse cluster and only applies to active data warehouse clusters. For example, if you have total data warehouse storage of 8TB, we will provide at most 8TB of backup storage at no additional charge. If you would like to extend your backup retention period beyond one day, you can do so using the AWS Management Console or the Amazon Redshift APIS. For more information on automated snapshots, please refer to the Amazon Redshift Management Guide. Amazon Redshift only backs up data that has changed so most snapshots only use up a small amount of your free backup storage.
Q:
How do I manage the retention of my automated backups and snapshots?
You can use the AWS Management Console or Modify Cluster API to manage the period of time your automated backups are retained by modifying the Retention Period parameter. If you desire to turn off automated backup’s altogether, you can do so by setting the retention period to 0 (not recommended).
You can use the AWS Management Console or Modify Cluster API to manage the period of time your automated backups are retained by modifying the Retention Period parameter. If you desire to turn off automated backup’s altogether, you can do so by setting the retention period to 0 (not recommended).
Q:
What happens to my backups if I delete my data warehouse cluster?
When you delete a data warehouse cluster, you have the ability to specify whether a final snapshot is created upon deletion, which enables a restore of the deleted data warehouse cluster at a later date. All previously created manual snapshots of your data warehouse cluster will be retained and billed at standard Amazon S3 rates, unless you choose to delete them.
When you delete a data warehouse cluster, you have the ability to specify whether a final snapshot is created upon deletion, which enables a restore of the deleted data warehouse cluster at a later date. All previously created manual snapshots of your data warehouse cluster will be retained and billed at standard Amazon S3 rates, unless you choose to delete them.
Scalability
Q:
How do I scale the size and performance of my Amazon Redshift data warehouse
cluster?
If you would like to increase query performance or respond to CPU, memory or I/O over-utilization, you can increase the number of nodes within your data warehouse cluster via the AWS Management Console or the Modify Cluster API. When you modify your data warehouse cluster, your requested changes will be applied immediately. Metrics for compute utilization, memory utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon Cloud Watch APIs. You can also add additional, user-defined metrics via Amazon Cloud watch's custom metric functionality.
If you would like to increase query performance or respond to CPU, memory or I/O over-utilization, you can increase the number of nodes within your data warehouse cluster via the AWS Management Console or the Modify Cluster API. When you modify your data warehouse cluster, your requested changes will be applied immediately. Metrics for compute utilization, memory utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon Cloud Watch APIs. You can also add additional, user-defined metrics via Amazon Cloud watch's custom metric functionality.
Q:
Will my data warehouse cluster remain available during scaling?
The existing data warehouse cluster remains available for read operations while a new data warehouse cluster gets created during scaling operations. When the new data warehouse cluster is ready, your existing data warehouse cluster will be temporarily unavailable while the canonical name record of the existing data warehouse cluster is flipped to point to the new data warehouse cluster. This period of unavailability typically lasts only a few minutes, and will occur during the maintenance window for your data warehouse cluster, unless you specify that the modification should be applied immediately. Amazon Redshift moves data in parallel from the compute nodes in your existing data warehouse cluster to the compute nodes in your new cluster. This enables your operation to complete as quickly as possible.
The existing data warehouse cluster remains available for read operations while a new data warehouse cluster gets created during scaling operations. When the new data warehouse cluster is ready, your existing data warehouse cluster will be temporarily unavailable while the canonical name record of the existing data warehouse cluster is flipped to point to the new data warehouse cluster. This period of unavailability typically lasts only a few minutes, and will occur during the maintenance window for your data warehouse cluster, unless you specify that the modification should be applied immediately. Amazon Redshift moves data in parallel from the compute nodes in your existing data warehouse cluster to the compute nodes in your new cluster. This enables your operation to complete as quickly as possible.
Q: Is Amazon Redshift
compatible with my preferred business intelligence software package and ETL
tools?
Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC and ODBC drivers. You can download Amazon Redshift custom JDBC and ODBC drivers from the Connect Client tab of our Console. We have validated integrations with popular BI and ETL vendors, a number of which are offering free trials to help you get started loading and analyzing your data. You can also go to the AWS Marketplace to deploy and configure solutions designed to work with Amazon Redshift in minutes.
Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC and ODBC drivers. You can download Amazon Redshift custom JDBC and ODBC drivers from the Connect Client tab of our Console. We have validated integrations with popular BI and ETL vendors, a number of which are offering free trials to help you get started loading and analyzing your data. You can also go to the AWS Marketplace to deploy and configure solutions designed to work with Amazon Redshift in minutes.
Q:
How do I monitor the performance of my Amazon Redshift data warehouse cluster?
Metrics for compute utilization,
storage utilization, and read/write traffic to your Amazon Redshift data
warehouse cluster are available free of charge via the AWS Management Console or Amazon CloudWatch APIs. You
can also add additional, user-defined metrics viaAmazon Cloudwatch’s custom metric functionality. In addition to
CloudWatch metrics, Amazon Redshift also provides information on query and
cluster performance via the AWS Management Console. This information enables
you to see which users and queries are consuming the most system resources and
diagnose performance issues. In addition, you can see the resource utilization
on each of your compute nodes to ensure that you have data and queries that are
well balanced across all nodes.
Q:
What is a maintenance window? Will my data warehouse cluster be available
during software maintenance?
You can think of the Amazon Redshift maintenance window as an
opportunity to control when data warehouse cluster modifications (such as
scaling data warehouse cluster by adding more nodes) and software patching
occur, in the event either are requested or required. If a
"maintenance" event is scheduled for a given week, it will be
initiated and completed at some point during the thirty-minute maintenance
window you identify.
Required patching
is automatically scheduled only for patches that are security and durability
related. Such patching occurs infrequently (typically once every few months).
If you do not specify a preferred weekly maintenance window when creating your
data warehouse cluster, a default value will be assigned. If you wish to modify
when maintenance is performed on your behalf, you can do so by modifying your
data warehouse cluster in the AWS Management Console or by using the
ModifyCluster API. Each of your data warehouse clusters can have different
preferred maintenance windows.
Abbreviations
Load FAVORITEMOVIES from an DynamoDB Table
The AWS SDKs include a simple example of creating a DynamoDB table called Movies. (For this example, see Getting Started with DynamoDB.) The following example loads the Amazon Redshift MOVIES table with data from the DynamoDB table. The Amazon Redshift table must already exist in the database.
copy favoritemovies from 'dynamodb://Movies'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
readratio 50;
Load LISTING from an Amazon S3 Bucket
The following example loads LISTING from an Amazon S3 bucket. The COPY command loads all of the files in the
/data/listing/
folder.copy listing
from 's3://mybucket/data/listing/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
Load LISTING from an Amazon EMR Cluster
The following example loads the SALES table with tab-delimited data from lzop-compressed files in an Amazon EMR cluster. COPY will load every file in the
myoutput/
folder that begins with part-
.copy sales
from 'emr://j-SAMPLE2B500FC/myoutput/part-*'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
The following example loads the SALES table with JSON formatted data in an Amazon EMR cluster. COPY will load every file in the
myoutput/json/
folder.copy sales
from 'emr://j-SAMPLE2B500FC/myoutput/json/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
JSON 's3://mybucket/jsonpaths.txt';
Using a Manifest to Specify Data Files
You can use a manifest to ensure that your COPY command loads all of the required files, and only the required files, from Amazon S3. You can also use a manifest when you need to load multiple files from different buckets or files that do not share the same prefix.
For example, suppose you need to load the following three files:
custdata1.txt
, custdata2.txt
, andcustdata3.txt
. You could use the following command to load all of the files in mybucket
that begin with custdata
by specifying a prefix:copy category
from 's3://mybucket/custdata'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
If only two of the files exist because of an error, COPY will load only those two files and finish successfully, resulting in an incomplete data load. If the bucket also contains an unwanted file that happens to use the same prefix, such as a file named
custdata.backup
for example, COPY will load that file as well, resulting in unwanted data being loaded.
To ensure that all of the required files are loaded and to prevent unwanted files from being loaded, you can use a manifest file. The manifest is a JSON-formatted text file that lists the files to be processed by the COPY command. For example, the following manifest loads the three files in the previous example.
{
"entries": [
{"url":"s3://mybucket/custdata.1","mandatory":true},
{"url":"s3://mybucket/custdata.2","mandatory":true},
{"url":"s3://mybucket/custdata.3","mandatory":true}
]
}
The optional
mandatory
flag indicates whether COPY should terminate if the file does not exist. The default is false
. Regardless of any mandatory settings, COPY will terminate if no files are found. In this example, COPY will return an error if any of the files is not found. Unwanted files that might have been picked up if you specified only a key prefix, such as custdata.backup
, are ignored, because they are not on the manifest.
The following example uses the manifest in the previous example, which is named
cust.manifest
.copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
You can use a manifest to load files from different buckets or files that do not share the same prefix. The following example shows the JSON to load data with files whose names begin with a date stamp.
{
"entries": [
{"url":”s3://mybucket/2013-10-04-custdata.txt","mandatory":true},
{"url":”s3://mybucket/2013-10-05-custdata.txt”,"mandatory":true},
{"url":”s3://mybucket/2013-10-06-custdata.txt”,"mandatory":true},
{"url":”s3://mybucket/2013-10-07-custdata.txt”,"mandatory":true}
]
}
The manifest can list files that are in different buckets, as long as the buckets are in the same region as the cluster.
{
"entries": [
{"url":"s3://mybucket-alpha/custdata1.txt","mandatory":false},
{"url":"s3://mybucket-beta/custdata1.txt","mandatory":false},
{"url":"s3://mybucket-beta/custdata2.txt","mandatory":false}
]
}
Load LISTING from a Pipe-Delimited File (Default Delimiter)
The following example is a very simple case in which no options are specified and the input file contains the default delimiter, a pipe character ('|').
copy listing
from 's3://mybucket/data/listings_pipe.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
Load LISTING Using Temporary Credentials
The following example uses the SESSION_TOKEN parameter to specify temporary session credentials:
copy listing
from 's3://mybucket/data/listings_pipe.txt'
access_key_id ' '
secret_access_key '
'
session_token ' ';
Load EVENT with Options
The following example loads pipe-delimited data into the EVENT table and applies the following rules:
- If pairs of quotation marks are used to surround any character strings, they are removed.
- Both empty strings and strings that contain blanks are loaded as NULL values.
- The load will fail if more than 5 errors are returned.
- Timestamp values must comply with the specified format; for example, a valid timestamp is
2008-09-26 05:43:12
.
copy event
from 's3://mybucket/data/allevents_pipe.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
removequotes
emptyasnull
blanksasnull
maxerror 5
delimiter '|'
timeformat 'YYYY-MM-DD HH:MI:SS';
Load VENUE from a Fixed-Width Data File
copy venue
from 's3://mybucket/data/venue_fw.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
fixedwidth 'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6';
The preceding example assumes a data file formatted in the same way as the sample data shown. In the sample following, spaces act as placeholders so that all of the columns are the same width as noted in the specification:
1 Toyota Park Bridgeview IL0
2 Columbus Crew Stadium Columbus OH0
3 RFK Stadium Washington DC0
4 CommunityAmerica BallparkKansas City KS0
5 Gillette Stadium Foxborough MA68756
Load CATEGORY from a CSV File
Suppose you want to load the CATEGORY with the values shown in the following table.
catid | catgroup | catname | catdesc |
---|---|---|---|
12 | Shows | Musicals | Musical theatre |
13 | Shows | Plays | All "non-musical" theatre |
14 | Shows | Opera | All opera, light, and "rock" opera |
15 | Concerts | Classical | All symphony, concerto, and choir concerts |
The following example shows the contents of a text file with the field values separated by commas.
12,Shows,Musicals,Musical theatre
13,Shows,Plays,All "non-musical" theatre
14,Shows,Opera,All opera, light, and "rock" opera
15,Concerts,Classical,All symphony, concerto, and choir concerts
If you load the file using the DELIMITER parameter to specify comma-delimited input, the COPY command will fail because some input fields contain commas. You can avoid that problem by using the CSV parameter and enclosing the fields that contain commas in quote characters. If the quote character appears within a quoted string, you need to escape it by doubling the quote character. The default quote character is a double quotation mark, so you will need to escape each double quotation mark with an additional double quotation mark. Your new input file will look something like this.
12,Shows,Musicals,Musical theatre
13,Shows,Plays,"All ""non-musical"" theatre"
14,Shows,Opera,"All opera, light, and ""rock"" opera"
15,Concerts,Classical,"All symphony, concerto, and choir concerts"
Assuming the file name is
category_csv.txt
, you can load the file by using the following COPY command:copy category
from 's3://mybucket/data/category_csv.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
csv;
Alternatively, to avoid the need to escape the double quotation marks in your input, you can specify a different quote character by using the QUOTE AS parameter. For example, the following version of
category_csv.txt
uses '%
' as the quote character:12,Shows,Musicals,Musical theatre
13,Shows,Plays,%All "non-musical" theatre%
14,Shows,Opera,%All opera, light, and "rock" opera%
15,Concerts,Classical,%All symphony, concerto, and choir concerts%
The following COPY command uses QUOTE AS to load
category_csv.txt
:copy category
from 's3://mybucket/data/category_csv.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
csv quote as '%';
Load VENUE with Explicit Values for an IDENTITY Column
The following example assumes that when the VENUE table was created that at least one column (such as the
venueid
column) was specified to be an IDENTITY column. This command overrides the default IDENTITY behavior of auto-generating values for an IDENTITY column and instead loads the explicit values from the venue.txt file.copy venue
from 's3://mybucket/data/venue.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
explicit_ids;
Load TIME from a Pipe-Delimited GZIP File
The following example loads the TIME table from a pipe-delimited GZIP file:
copy time
from 's3://mybucket/data/timerows.gz'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
gzip
delimiter '|';
Load a Timestamp or Datestamp
The following example loads data with a formatted timestamp.
Note
The TIMEFORMAT of
HH:MI:SS
can also support fractional seconds beyond the SS
to a microsecond level of detail. The file time.txt
used in this example contains one row, 2009-01-12 14:15:57.119568
.copy timestamp1
from 's3://mybucket/data/time.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
timeformat 'YYYY-MM-DD HH:MI:SS';
The result of this copy is as follows:
select * from timestamp1;
c1
----------------------------
2009-01-12 14:15:57.119568
(1 row)
Load Data from a File with Default Values
The following example uses a variation of the VENUE table in the TICKIT database. Consider a VENUE_NEW table defined with the following statement:
create table venue_new(
venueid smallint not null,
venuename varchar(100) not null,
venuecity varchar(30),
venuestate char(2),
venueseats integer not null default '1000');
Consider a venue_noseats.txt data file that contains no values for the VENUESEATS column, as shown in the following example:
1|Toyota Park|Bridgeview|IL|
2|Columbus Crew Stadium|Columbus|OH|
3|RFK Stadium|Washington|DC|
4|CommunityAmerica Ballpark|Kansas City|KS|
5|Gillette Stadium|Foxborough|MA|
6|New York Giants Stadium|East Rutherford|NJ|
7|BMO Field|Toronto|ON|
8|The Home Depot Center|Carson|CA|
9|Dick's Sporting Goods Park|Commerce City|CO|
10|Pizza Hut Park|Frisco|TX|
The following COPY statement will successfully load the table from the file and apply the DEFAULT value ('1000') to the omitted column:
copy venue_new(venueid, venuename, venuecity, venuestate)
from 's3://mybucket/data/venue_noseats.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|';
Now view the loaded table:
select * from venue_new order by venueid;
venueid | venuename | venuecity | venuestate | venueseats
---------+----------------------------+-----------------+------------+------------
1 | Toyota Park | Bridgeview | IL | 1000
2 | Columbus Crew Stadium | Columbus | OH | 1000
3 | RFK Stadium | Washington | DC | 1000
4 | CommunityAmerica Ballpark | Kansas City | KS | 1000
5 | Gillette Stadium | Foxborough | MA | 1000
6 | New York Giants Stadium | East Rutherford | NJ | 1000
7 | BMO Field | Toronto | ON | 1000
8 | The Home Depot Center | Carson | CA | 1000
9 | Dick's Sporting Goods Park | Commerce City | CO | 1000
10 | Pizza Hut Park | Frisco | TX | 1000
(10 rows)
For the following example, in addition to assuming that no VENUESEATS data is included in the file, also assume that no VENUENAME data is included:
1||Bridgeview|IL|
2||Columbus|OH|
3||Washington|DC|
4||Kansas City|KS|
5||Foxborough|MA|
6||East Rutherford|NJ|
7||Toronto|ON|
8||Carson|CA|
9||Commerce City|CO|
10||Frisco|TX|
Using the same table definition, the following COPY statement will fail because no DEFAULT value was specified for VENUENAME, and VENUENAME is a NOT NULL column:
copy venue(venueid, venuecity, venuestate)
from 's3://mybucket/data/venue_pipe.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|';
Now consider a variation of the VENUE table that uses an IDENTITY column:
create table venue_identity(
venueid int identity(1,1),
venuename varchar(100) not null,
venuecity varchar(30),
venuestate char(2),
venueseats integer not null default '1000');
As with the previous example, assume that the VENUESEATS column has no corresponding values in the source file. The following COPY statement will successfully load the table, including the predefined IDENTITY data values instead of autogenerating those values:
copy venue(venueid, venuename, venuecity, venuestate)
from 's3://mybucket/data/venue_pipe.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|' explicit_ids;
This statement fails because it does not include the IDENTITY column (VENUEID is missing from the column list) yet includes an EXPLICIT_IDS parameter:
copy venue(venuename, venuecity, venuestate)
from 's3://mybucket/data/venue_pipe.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|' explicit_ids;
This statement fails because it does not include an EXPLICIT_IDS parameter:
copy venue(venueid, venuename, venuecity, venuestate)
from 's3://mybucket/data/venue_pipe.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|';
COPY Data with the ESCAPE Option
The following example shows how to load characters that match the delimiter character (in this case, the pipe character). In the input file, make sure that all of the pipe characters (|) that you want to load are escaped with the backslash character (\). Then load the file with the ESCAPE parameter.
$ more redshiftinfo.txt
1|public\|event\|dwuser
2|public\|sales\|dwuser
create table redshiftinfo(infoid int,tableinfo varchar(50));
copy redshiftinfo from 's3://mybucket/data/redshiftinfo.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|' escape;
select * from redshiftinfo order by 1;
infoid | tableinfo
-------+--------------------
1 | public|event|dwuser
2 | public|sales|dwuser
(2 rows)
Without the ESCAPE parameter, this COPY command fails with an
Extra column(s) found
error.
Important
If you load your data using a COPY with the ESCAPE parameter, you must also specify the ESCAPE parameter with your UNLOAD command to generate the reciprocal output file. Similarly, if you UNLOAD using the ESCAPE parameter, you will need to use ESCAPE when you COPY the same data.
Copy from JSON Examples
In the following examples, you will load the CATEGORY table with the following data.
CATID | CATGROUP | CATNAME | CATDESC |
---|---|---|---|
1 | Sports | MLB | Major League Baseball |
2 | Sports | NHL | National Hockey League |
3 | Sports | NFL | National Football League |
4 | Sports | NBA | National Basketball Association |
5 | Concerts | Classical | All symphony, concerto, and choir concerts |
Topics
Load from JSON Data Using the 'auto' Option
To load from JSON data using the
'auto'
argument, the JSON data must consist of a set of objects. The key names must match the column names, but in this case, order does not matter. The following shows the contents of a file named category_object_auto.json
.{
"catdesc": "Major League Baseball",
"catid": 1,
"catgroup": "Sports",
"catname": "MLB"
}
{
"catgroup": "Sports",
"catid": 2,
"catname": "NHL",
"catdesc": "National Hockey League"
}{
"catid": 3,
"catname": "NFL",
"catgroup": "Sports",
"catdesc": "National Football League"
}
{
"bogus": "Bogus Sports LLC",
"catid": 4,
"catgroup": "Sports",
"catname": "NBA",
"catdesc": "National Basketball Association"
}
{
"catid": 5,
"catgroup": "Shows",
"catname": "Musicals",
"catdesc": "All symphony, concerto, and choir concerts"
}
To load from the JSON data file in the previous example, execute the following COPY command.
copy category
from 's3://mybucket/category_object_auto.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
json 'auto';
Load from JSON Data Using a JSONPaths file
If the JSON data objects don't correspond directly to column names, you can use a JSONPaths file to map the JSON elements to columns. Again, the order does not matter in the JSON source data, but the order of the JSONPaths file expressions must match the column order. Suppose you have the following data file, named
category_object_paths.json
.{
"one": 1,
"two": "Sports",
"three": "MLB",
"four": "Major League Baseball"
}
{
"three": "NHL",
"four": "National Hockey League",
"one": 2,
"two": "Sports"
}
{
"two": "Sports",
"three": "NFL",
"one": 3,
"four": "National Football League"
}
{
"one": 4,
"two": "Sports",
"three": "NBA",
"four": "National Basketball Association"
}
{
"one": 6,
"two": "Shows",
"three": "Musicals",
"four": "All symphony, concerto, and choir concerts"
}
The following JSONPaths file, named
category_jsonpath.json
, maps the source data to the table columns.{
"jsonpaths": [
"$['one']",
"$['two']",
"$['three']",
"$['four']"
]
}
To load from the JSON data file in the previous example, execute the following COPY command.
copy category
from 's3://mybucket/category_object_paths.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
json 's3://mybucket/category_jsonpath.json';
Load from JSON Arrays Using a JSONPaths file
To load from JSON data that consists of a set of arrays, you must use a JSONPaths file to map the array elements to columns. Suppose you have the following data file, named
category_array_data.json
.[1,"Sports","MLB","Major League Baseball"]
[2,"Sports","NHL","National Hockey League"]
[3,"Sports","NFL","National Football League"]
[4,"Sports","NBA","National Basketball Association"]
[5,"Concerts","Classical","All symphony, concerto, and choir concerts"]
The following JSONPaths file, named
category_array_jsonpath.json
, maps the source data to the table columns.{
"jsonpaths": [
"$[0]",
"$[1]",
"$[2]",
"$[3]"
]
}
To load from the JSON data file in the previous example, execute the following COPY command.
copy category
from 's3://mybucket/category_array_data.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
json 's3://mybucket/category_array_jsonpath.json';
Copy from Avro Examples
In the following examples, you will load the CATEGORY table with the following data.
CATID | CATGROUP | CATNAME | CATDESC |
---|---|---|---|
1 | Sports | MLB | Major League Baseball |
2 | Sports | NHL | National Hockey League |
3 | Sports | NFL | National Football League |
4 | Sports | NBA | National Basketball Association |
5 | Concerts | Classical | All symphony, concerto, and choir concerts |
Load from Avro Data Using the 'auto' Option
To load from Avro data using the
'auto'
argument, field names in the Avro schema must match the column names. However, when using the 'auto'
argument, order does not matter. The following shows the schema for a file named category_auto.avro
.{ "name": "category", "type": "record", "fields": [ {"name": "catid", "type": "int"}, {"name": "catdesc", "type": "string"}, {"name": "catname", "type": "string"}, {"name": "catgroup", "type": "string"}, }
The data in an Avro file is in binary format, so it is not human-readable. The following shows a JSON representation of the data in the
category_auto.avro
file.{ "catid": 1, "catdesc": "Major League Baseball", "catname": "MLB", "catgroup": "Sports" } { "catid": 2, "catdesc": "National Hockey League", "catname": "NHL", "catgroup": "Sports" } { "catid": 3, "catdesc": "National Basketball Association", "catname": "NBA", "catgroup": "Sports" } { "catid": 4, "catdesc": "All symphony, concerto, and choir concerts", "catname": "Classical", "catgroup": "Concerts" }
To load from the Avro data file in the previous example, execute the following COPY command.
copy category
from 's3://mybucket/category_auto.avro'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
format as avro 'auto';
Load from Avro Data Using a JSONPaths File
If the field names in the Avro schema don't correspond directly to column names, you can use a JSONPaths file to map the schema elements to columns. The order of the JSONPaths file expressions must match the column order.
Suppose you have a data file named
category_paths.avro
that contains the same data as in the previous example, but with the following schema.{ "name": "category", "type": "record", "fields": [ {"name": "id", "type": "int"}, {"name": "desc", "type": "string"}, {"name": "name", "type": "string"}, {"name": "group", "type": "string"}, {"name": "region", "type": "string"} ] }
The following JSONPaths file, named
category_path.avropath
, maps the source data to the table columns.{ "jsonpaths": [ "$['id']", "$['group']", "$['name']", "$['desc']" ] }
To load from the Avro data file in the previous example, execute the following COPY command.
copy category
from 's3://mybucket/category_object_paths.avro'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
format avro 's3://mybucket/category_path.avropath ';
Preparing Files for COPY with the ESCAPE Option
The following example describes how you might prepare data to "escape" newline characters before importing the data into an Amazon Redshift table using the COPY command with the ESCAPE parameter. Without preparing the data to delimit the newline characters, Amazon Redshift will return load errors when you run the COPY command, because the newline character is normally used as a record separator.
For example, consider a file or a column in an external table that you want to copy into an Amazon Redshift table. If the file or column contains XML-formatted content or similar data, you will need to make sure that all of the newline characters (\n) that are part of the content are escaped with the backslash character (\).
A good thing about a file or table containing embedded newlines characters is that it provides a relatively easy pattern to match. Each embedded newline character most likely always follows a
>
character with potentially some white space characters (' '
or tab) in between, as you can see in the following example of a text file named nlTest1.txt
.$ cat nlTest1.txt
|1000
|2000
With the following example, you can run a text-processing utility to pre-process the source file and insert escape characters where needed. (The
|
character is intended to be used as delimiter to separate column data when copied into an Amazon Redshift table.)$ sed -e ':a;N;$!ba;s/>[[:space:]]*\n/>\\\n/g' nlTest1.txt > nlTest2.txt
Similarly, you can use Perl to perform a similar operation:
cat nlTest1.txt | perl -p -e 's/>\s*\n/>\\\n/g' > nlTest2.txt
To accommodate loading the data from the
nlTest2.txt
file into Amazon Redshift, we created a two-column table in Amazon Redshift. The first column c1, is a character column that will hold XML-formatted content from the nlTest2.txt
file. The second column c2 holds integer values loaded from the same file.
After running the
sed
command, you can correctly load data from the nlTest2.txt
file into an Amazon Redshift table using the ESCAPE parameter.
Note
When you include the ESCAPE parameter with the COPY command, it escapes a number of special characters that include the backslash character (including newline).
copy t2 from 's3://mybucket/data/nlTest2.txt'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
escape
delimiter as '|';
select * from t2 order by 2;
c1 | c2
-------------+------
| 1000
| 2000
(2 rows)
You can prepare data files exported from external databases in a similar way. For example, with an Oracle database, you can use the REPLACE function on each affected column in a table that you want to copy into Amazon Redshift.
SELECT c1, REPLACE(c2, \n',\\n' ) as c2 from my_table_with_xml
In addition, many database export and extract, transform, load (ETL) tools that routinely process large amounts of data provide options to specify escape and delimiter characters.
No comments:
Post a Comment