Caching

Manage file caching inside of Kestra.

Kestra provides a file caching, which is especially useful when you work with sizeable package dependencies that don’t change often.

Cache files in a WorkingDirectory task

The file caching functionality on the WorkingDirectory task allows you to cache a subset of files to speed up your workflow execution. This is especially useful when you work with sizeable package dependencies that don’t change often.

Use cases for file caching

The file caching is useful if you want to install some pip or npm packages before running your script. You can cache the node_modules or Python venv folder to avoid re-installing the dependencies on each run.

To do that, add a cache to your WorkingDirectory task. The cache property accepts a list of glob patterns to match files to cache. The cache will be automatically invalidated after a specified time-to-live using the ttl property accepting a duration.

id: caching_files
namespace: company.team
tasks:
- id: working_dir
type: io.kestra.plugin.core.flow.WorkingDirectory
cache:
patterns:
- some_directory/**
ttl: PT1H

How does it work under the hood

Kestra packages the files that need to be cached and stores them in the internal storage. When the task is executed again, the cached files are retrieved, initializing the working directory with their contents.

Node.js example

Below is an example of a flow that installs the colors package before running a Node.js script. The node_modules folder is cached for one hour.

id: node_cached_dependencies
namespace: company.team
tasks:
- id: working_dir
type: io.kestra.plugin.core.flow.WorkingDirectory
cache:
patterns:
- node_modules/**
ttl: PT1H
tasks:
- id: node_script
type: io.kestra.plugin.scripts.node.Script
beforeCommands:
- npm install colors
script: |
const colors = require("colors");
console.log(colors.red("Hello"));

Python example

Below is an example of a flow that installs the pandas package before running a Python script. The venv folder is cached for one day.

id: python_cached_dependencies
namespace: company.team
tasks:
- id: working_dir
type: io.kestra.plugin.core.flow.WorkingDirectory
tasks:
- id: python_script
type: io.kestra.plugin.scripts.python.Script
taskRunner:
type: io.kestra.plugin.core.runner.Process
beforeCommands:
- python -m venv venv
- source venv/bin/activate
- pip install pandas
script: |
import pandas as pd
print(pd.__version__)
cache:
patterns:
- venv/**
ttl: PT24H

How to invalidate the cache

Below are the details how to invalidate the cache:

  • After the first run, the files are cached
  • The next time the task is executed:
    • If the ttl didn’t pass, then the files are retrieved from cache.
    • If the ttl passed, then the cache is invalidated and no files will be retrieved from cache; because cache is no longer present, the npm install command from the beforeCommands property will take a bit longer to execute.
  • If you edit the task and change the ttl to:
    • a longer duration e.g., PT5H — the files will be cached for five hours using the new ttl duration
    • a shorter duration e.g., PT5M — the cache will be invalidated after five minutes using the new ttl duration.

The ttl is evaluated at runtime. If the most recently set ttl duration has passed as compared to the last task run execution date, the cache is invalidated and the files are no longer retrieved from cache.