SELECTSELECT

SELECT

60x faster database clones in Snowflake

By Niall WoodwardOct 22, 20223 min read

This page is also available in Deutsch, Español, Français, Italiano, 日本語, and Português.

Introduction

I had the pleasure of attending dbt’s Coalesce conference in London last week, and dropped into a really great talk by Felipe Leite and Stephen Pastan of Miro. They mentioned how they’d achieved a considerable speed improvement by switching database clones out for multiple table clones. I had to check it out.

Experiments

Results were collected using the following query which measures the duration of each process by passing in the earliest query start time and last query end time to the DATEDIFF function:

select
    count(*) as query_count,
    datediff(seconds, min(start_time), max(end_time)) as duration,
    sum(credits_used_cloud_services) as credits_used_cloud_services
from snowflake.account_usage.query_history where query_tag = X;

Optimize your Snowflake usage

SELECT automatically optimizes and helps you manage your Snowflake usage with ease.

Setup

Create a database with 10 schemas, 100 tables in each:

import snowflake.connector

con = snowflake.connector.connect(
    ...
)

for i in range(1, 11):
    con.cursor().execute(f"create schema test.schema_{i};")
    for j in range(1, 101):
	    con.cursor().execute(f"create table test.schema_{i}.table_{j} (i number) as (select 1);")

Control - Database clone

1create database test_1 clone test;

This operation took 22m 34s to execute.

Results:

Query Count 1
Duration 22m 34s
Cloud services credits 0.179

Experiment 1 - Schema level clones

import snowflake.connector
from snowflake.connector import DictCursor

def clone_database_by_schema(con, source_database, target_database):
    con.cursor().execute(f"create database {target_database};")
    cursor = con.cursor(DictCursor)
    cursor.execute(f"show schemas in database {source_database};")
    for i in cursor.fetchall():
        if i["name"] not in ("INFORMATION_SCHEMA", "PUBLIC"):
            con.cursor().execute_async(f"create schema {target_database}.{i['name']} clone {source_database}.{i['name']};")

con = snowflake.connector.connect(
    ...
    session_parameters={
        'QUERY_TAG': 'test 2',

Expand Code

Results:

Query Count 12
Duration 1m 47s
Cloud services credits 0.148

Using execute_async executes each SQL statement without waiting for each to complete, resulting in all 10 schemas being cloned concurrently. A whopping 10x faster from start to finish compared with the regular database clone.

Experiment 2 - Table level clones

import snowflake.connector
from snowflake.connector import DictCursor

def clone_database_by_table(con, source_database, target_database):
    con.cursor().execute(f"create database {target_database};")
    cursor = con.cursor(DictCursor)
    cursor.execute(f"show tables in database {source_database};")
    results = cursor.fetchall()
    schemas_to_create = {r['schema_name'] for r in results}
    tables_to_clone = [f"{r['schema_name']}.{r['name']}" for r in results]

    for schema in schemas_to_create:
        con.cursor().execute(f"create schema {target_database}.{schema};")

    for table in tables_to_clone:

Expand Code

This took 1 minute 48s to complete, the limiting factor being the rate at which the queries could be dispatched by the client (likely due to network waiting times). To help mitigate that, I distributed the commands across 10 threads:

1import snowflake.connector

2from snowflake.connector import DictCursor

3import threading

4

5class ThreadedRunCommands():

6    """Helper class for running queries across a configurable number of threads"""

7    def __init__(self, con, threads):

8        self.threads = threads

9        self.register_command_thread = 0

10        self.thread_commands = [\
\
11            [] for _ in range(self.threads)\
\
12        ]

13        self.con = con

14

15    def register_command(self, command):

Expand Code

Results:

Query Count 1012
Duration 22s
Cloud services credits 0.165

Using 10 threads, the time between the create database command starting and the final create table ... clone command completing was only 22 seconds. This is 60x faster than the create database ... clone command. The bottleneck is still the rate at which queries can be dispatched.

In Summary

The complete results:

Clone Strategy Control - Database clone Experiment 1 - Schema level clones Experiment 2 - Table level clones
Query count 1 12 1012
Duration 22m 34s 1m 47s 22s
Cloud services credits 0.179 0.148 0.165

All the queries ran were cloud services only, and did not require a running warehouse or resume a suspended one.

I hope that Snowflake improves their schema and database clone functionality, but in the mean time, cloning tables seems to be the way to go.

Thanks again to Felipe Leite and Stephen Pastan of Miro for sharing this!

Niall Woodward·Co-founder & CTO of SELECT

Niall is the Co-Founder & CTO of SELECT, a SaaS Snowflake cost management and optimization platform. Prior to starting SELECT, Niall was a data engineer at Brooklyn Data Company and several startups. As an open-source enthusiast, he's also a maintainer of SQLFluff, and creator of three dbt packages: dbt_artifacts, dbt_snowflake_monitoring and dbt_query_tags.