Version: devel

Deploy with Modal

Modal is a serverless platform designed for developers. It allows you to run and deploy code in the cloud without managing infrastructure.

With Modal, you can perform tasks like running generative models, large-scale batch jobs, and job queues, all while easily scaling compute resources.

Serverless Compute: No infrastructure management; scales automatically from zero to thousands of CPUs/GPUs.
Cloud Functions: Run Python code in the cloud instantly and scale horizontally.
GPU/CPU Scaling: Easily attach GPUs for heavy tasks like AI model training with a single line of code.
Web Endpoints: Expose any function as an HTTPS API endpoint quickly.
Scheduled Jobs: Convert Python functions into scheduled tasks effortlessly.

To know more, please refer to Modals's documentation.

Here’s a dlt project setup to copy data from public MySQL database into DuckDB as a destination:

Step 1: Initialize source

Run the dlt init CLI command to initialize the SQL database source and set up the sql_database_pipeline.py template.

dlt init sql_database duckdb

Open the file and define the Modal Image you want to run dlt in:

# Define the Modal Image
image = modal.Image.debian_slim().pip_install(
    "dlt>=1.1.0",
    "dlt[duckdb]",  # destination
    "dlt[sql_database]",  # source (MySQL)
    "pymysql",  # database driver for MySQL source
)

app = modal.App("example-dlt", image=image)

# Modal Volume used to store the duckdb database file
vol = modal.Volume.from_name("duckdb-vol", create_if_missing=True)

A Modal Function is a containerized environment that runs tasks. It can be scheduled (e.g., daily or on a Cron schedule), request more CPU/memory, and scale across multiple containers.

Here’s how to include your SQL pipeline in the Modal Function:

@app.function(
    volumes={"/data/": vol},
    schedule=modal.Period(days=1),
    secrets=[modal.Secret.from_name("sql-secret")],
    serialized=True,
)
def load_tables() -> None:
    import dlt
    from dlt.sources.sql_database import sql_database

    # Define the source database credentials; in production, you would save this as a Modal Secret which can be referenced here as an environment variable
    os.environ["SOURCES__SQL_DATABASE__CREDENTIALS"] = (
        "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam"
    )
    # Load tables "family" and "genome"
    source = sql_database().with_resources("family", "genome")

    # Create dlt pipeline object
    pipeline = dlt.pipeline(
        pipeline_name="sql_to_duckdb_pipeline",
        destination=dlt.destinations.duckdb(
            "/data/rfam.duckdb"
        ),  # write the duckdb database file to this file location, which will get mounted to the Modal Volume
        dataset_name="sql_to_duckdb_pipeline_data",
        progress="log",  # output progress of the pipeline
    )

    # Run the pipeline
    load_info = pipeline.run(source)

    # Print run statistics
    print(load_info)

Step 4: Set up credentials

You can securely store your credentials using Modal secrets. When you reference secrets within a Modal script, the defined secret is automatically set as an environment variable. dlt natively supports environment variables, enabling seamless integration of your credentials. For example, to declare a connection string, you can define it as follows:

SOURCES__SQL_DATABASE__CREDENTIALS=mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam

In the script above, the credentials specified are automatically utilized by dlt. For more details, please refer to the documentation.

Step 5: Run pipeline

Execute the pipeline once. To run your pipeline a single time, use the following command:

modal run sql_pipeline.py

Step 6: Deploy

If you want to deploy your pipeline on Modal for continuous execution or scheduling, use this command:

modal deploy sql_pipeline.py

Advanced configuration

If your database is in a private VPN, you can use Modal Proxy as a bastion server (available for Enterprise customers). To connect to a production read replica, attach the proxy to the function definition and change the hostname to localhost:

@app.function(
    secrets=[
        modal.Secret.from_name("postgres-read-replica-prod"),
    ],
    schedule=modal.Cron("24 6 * * *"),
    proxy=modal.Proxy.from_name("prod-postgres-proxy", environment_name="main"),
    timeout=3000,
)
def task_pipeline(dev: bool = False) -> None:
    pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}'

Capturing deletes

To capture updates or deleted rows from your Postgres database, consider using dlt's Postgres CDC replication feature, which is useful for tracking changes and deletions in the data.

Sync Multiple Tables in Parallel

To sync multiple tables in parallel, map each table copy job to a separate container using Modal.starmap:

@app.function(timeout=3000, schedule=modal.Cron("29 11 * * *"))
def main(dev: bool = False):
    tables = [
        ("task", "enqueued_at", dev),
        ("worker", "launched_at", dev),
        ...
    ]
    list(load_table_from_database.starmap(tables))

More examples

For a practical, real-world example, check out the article "Building a Cost-Effective Analytics Stack with Modal, dlt, and dbt".

This article illustrates how to automate a workflow for loading data from Postgres into Snowflake using dlt, providing valuable insights into building an efficient analytics pipeline.

Deploy with Modal

Introduction to Modal​

Modal features​

How to run dlt on Modal​

Step 1: Initialize source​

Step 2: Define Modal Image​

Step 3: Define Modal Function​

Step 4: Set up credentials​

Step 5: Run pipeline​

Step 6: Deploy​

Advanced configuration​

Modal Proxy​

Capturing deletes​

Sync Multiple Tables in Parallel​

More examples​

DHelp

Ask a question

Introduction to Modal

Modal features

How to run dlt on Modal