{"id":56884,"date":"2023-03-16T11:38:07","date_gmt":"2023-03-16T06:08:07","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=56884"},"modified":"2023-09-08T16:50:03","modified_gmt":"2023-09-08T11:20:03","slug":"best-practices-and-hacks-for-data-loading-in-snowflake-from-stages","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/best-practices-and-hacks-for-data-loading-in-snowflake-from-stages\/","title":{"rendered":"Best practices and hacks for Data Loading in Snowflake from Stages"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Continuing our Snowflake blog series, after learning about setting up a<\/span>\u00a0 <a href=\"https:\/\/www.tothenew.com\/blog\/snowflake-account-setup-using-system-defined-roles\/\"><span style=\"font-weight: 400;\">Snowflake account using System-defined Roles<\/span><\/a><span style=\"font-weight: 400;\">, we will explore the best practices for loading data from a file into Snowflake.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let\u2019s begin.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Snowflake supports file-based data ingestion through Internal and External stages.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, there are various factors to consider when performing data ingestion, including the frequency of data arrival, file sizes, and the data loading techniques used, such as Copy command, External tables, and Snowpipe, among others. Additionally, the file format used, such as CSV, Parquet, JSON, etc., also plays a critical role in choosing the right approach.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Making the wrong choice can result in increased costs and slower performance. <\/span><span style=\"font-weight: 400;\">This blog provides insights into these approaches to help you select the best one while loading data into Snowflake.<\/span><\/p>\n<h2><b>Faster Loading of Parquet files from S3<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">If you want to load Parquet files from S3 faster into Snowflake, you should not use the COPY command, which is suitable for CSV format files placed in External stages.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Instead, it is recommended to use External tables on top of the Parquet file and enable the &#8220;vector scan performance flag&#8221; scan property.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, to enable this property, you need to contact Snowflake support and have it enabled at the account level.<\/span><\/p>\n<h2><b>Parallel operations<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">To improve the efficiency of loading data from stages into Snowflake, it is recommended to create files in the range of 100-250MB with compression.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">By default, each node can handle 8 threads in parallel. Therefore, if you have 32 files, a Medium warehouse with 4 nodes can process all of them simultaneously, running 32 threads in parallel.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">It&#8217;s important to remember that the performance of parallel operations in Snowflake can also be affected by the complexity of the operations being performed. Additionally, Snowflake may not always run the threads as defined in parallel due to various factors such as resource availability and load balancing. Therefore, it&#8217;s important to monitor and adjust the parallelism accordingly to ensure optimal performance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">It is important to ensure that the warehouse used for data loading runs for at least a minute to fully utilize the cost for active time.<\/span><\/p>\n<h2><b>Purging files from Stage<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">To optimize performance, it is recommended to remove files from the stage after successful loading using the COPY command with the PURGE=True option.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This ensures that the staged files are deleted once loaded, which not only improves performance but also eliminates the need for COPY commands to scan the entire bucket path to check for files to load.<\/span><\/p>\n<h2><b>Loading large files in Snowflake<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Suppose you need to load a single large file in gigabytes, which may or may not contain bad data in some of its rows. In such cases, it is advisable to load the data by ignoring the errors. Failing to do so can result in the wastage of credits. For example, if the data is being processed using a warehouse and an error row is identified after an hour, the entire operation will be aborted, leading to credit wastage.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Using &#8220;ON_ERROR&#8221; as &#8220;CONTINUE&#8221; will load the good data and ignore the bad rows. However, it is always recommended to load large files by splitting them into smaller chunks so that parallel processing can be utilized using a warehouse. If that&#8217;s not possible and you still want to load a large file, it&#8217;s recommended to check if it&#8217;s okay to continue loading by ignoring the bad data.<\/span><\/p>\n<h2><b>Best practice while loading JSON data<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">When loading JSON data into Snowflake, it&#8217;s important to note that the output from some applications can be a single large array consisting of multiple rows combined. However, due to the limitations of the VARIANT datatype (which has a maximum size of 16 MB), such a huge JSON output cannot be loaded in its entirety.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Moreover, it&#8217;s not appropriate to load a single huge array as a single row. Instead, the array should be split into multiple rows. To achieve this, use the STRIP_OUTER_ARRAY=TRUE option in the FILE_FORMAT parameter of the COPY command. This option excludes the outer array and identifies individual rows, allowing for efficient loading of JSON data into Snowflake.<\/span><\/p>\n<h2><b>Snowpipe File loading best strategies<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Loading data using Snowpipe can result in charges for the time taken to load the file placed on the stage and the overhead of maintaining the queue for files waiting to be loaded. This means that if a large number of files with different sizes are continuously loaded into the stage location, it could lead to slower loading of the data and increased expenses.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To ensure effective loading, it is best to ensure that the file sizes are within the limits of 100-250MB, and that there is no sudden surge of files in the stage location, which could increase the queue.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You can optimize the loading process by placing the files in the stage at a certain frequency, which is close to the time it takes to load a single file using Snowpipe. This will help ensure efficient utilization of resources and reduce costs.<\/span><\/p>\n<h2><b>Make use of the Internal stage wherever applicable<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">To optimize data loading in Snowflake, it is recommended to use the Internal stage instead of External Stages such as S3, as it results in faster loading to the Snowflake table.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">It is also important to compare the costs of storing data in the Stage location of Snowflake with that of the object storage of cloud providers. To reduce costs, it is advisable to purge the data from the stage when it is no longer needed.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Thank you for reading this article.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In the upcoming installment of this series, we will explore the best practices for optimizing the use of snowflake warehouses and tables.<\/span><\/p>\n<div class=\"ap-custom-wrapper\"><\/div><!--ap-custom-wrapper-->","protected":false},"excerpt":{"rendered":"<p>Continuing our Snowflake blog series, after learning about setting up a\u00a0 Snowflake account using System-defined Roles, we will explore the best practices for loading data from a file into Snowflake. Let\u2019s begin. Snowflake supports file-based data ingestion through Internal and External stages.\u00a0 However, there are various factors to consider when performing data ingestion, including the [&hellip;]<\/p>\n","protected":false},"author":1552,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":333},"categories":[4685,1395,4308,4682,4831],"tags":[5137],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/56884"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1552"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=56884"}],"version-history":[{"count":3,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/56884\/revisions"}],"predecessor-version":[{"id":58434,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/56884\/revisions\/58434"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=56884"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=56884"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=56884"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}