Tutorial

AWS credentials

Pallas uses boto3 internally, so it reads AWS credentials from the standard locations:

  • Shared credential file (~/.aws/credentials)

  • Environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)

  • Instance metadata service when run on an Amazon EC2 instance

The ~/.aws/credentials file can be generated using the AWS CLI.

aws configure

We recommend to use the AWS CLI to check the configuration. If the AWS CLI is able to authenticate then Pallas should work too.

aws sts get-caller-identity
aws athena list-databases --catalog-name AwsDataCatalog

Initialization

An Athena client can be obtained using the setup() function. All arguments are optional.

import pallas
athena = pallas.setup(
    # AWS region, read from ~/.aws/config if not specified.
    region=None,
    # Athena (AWS Glue) database.
    database=None,
    # Athena workgroup. Will use default workgroup if omitted.
    workgroup=None,
    # Athena output location, will use workgroup default location if omitted.
    output_location="s3://...",
    # Optional query execution cache.
    cache_remote="s3://...",
    # Optional query result cache.
    cache_local="~/Notebooks/.cache/",
    # Whether to return failed queries from cache. Defaults to False.
    cache_failed=False,
    # Normalize white whitespace for better caching. Enabled by default.
    normalize=True,
    # Kill queries on KeybordInterrupt. Enabled by default.
    kill_on_interrupt=True
)

To avoid hardcoded configuration values, the environ_setup() function can initialize Athena from environment variables, corresponding to arguments in the previous example:

export PALLAS_REGION=
export PALLAS_DATABASE=
export PALLAS_WORKGROUP=
export PALLAS_OUTPUT_LOCATION=
export PALLAS_NORMALIZE=true
export PALLAS_KILL_ON_INTERRUPT=true
export PALLAS_CACHE_REMOTE=$PALLAS_OUTPUT_LOCATION
export PALLAS_CACHE_LOCAL=~/Notebooks/.cache/
export PALLAS_CACHE_FAILED=false
athena = pallas.environ_setup()

Pallas uses Python standard logging. You can use configure_logging() instead of logging.basicConfig() to enable logging for Pallas only. At the DEBUG level, Pallas emits logs with query status including an estimated price:

pallas.configure_logging(level="DEBUG")

Executing queries

Use the Athena.execute() method to execute queries:

sql = "SELECT %s id, %s name, %s value"
results = athena.execute(sql, (1, "foo", 3.14))

Pallas also support non-blocking query execution:

query = athena.submit(sql)  # Submit a query and return
query.join()  # Wait for query completion.
results = query.get_results()  # Retrieve results. Joins the query internally.

The result objects provides a list-like interface and can be converted to a Pandas DataFrame:

df = results.to_df()

Caching

AWS Athena stores query results in S3 and does not delete them, so all past results are cached implicitly. To retrieve results of a past query, an ID of the query execution is needed.

Pallas can cache in two modes - remote and local:

  • In the remote mode, Pallas stores IDs of query executions. Using that, it can download previous results from S3 when they are available.

  • In the local mode, it copies query results. Thanks to that, locally cached queries can be executed without an internet connection.

Note

Pallas is designed to promote reproducible analyses and data pipelines:

  • Using the local caching, it is possible to regularly restart Jupyter notebooks without waiting for or paying for additional Athena queries.

  • Thanks to the remote caching, results can be reproduced at a different machine by a different person.

Reproducible queries should be deterministic. For example, if you query data that are ingested regularly, you should always filter on the date column.

Pallas assumes that your queries are deterministic and does not invalidate its cache.

Caching configuration can be passed to setup() or environ_setup(), as shown in the Initialization section.

After the initialization, caching can be customized later using the Athena.cache property:

athena.cache.enabled = True  # Default
athena.cache.read = True  # Can be set to False to write but not read the cache
athena.cache.write = True  # Can be set to False to read but not write the cache
athena.cache.local = "~/Notebooks/.cache/"
athena.cache.remote = "s3://..."
athena.cache.failed = True

Alternatively, the Athena.using() method can override a configuration for selected queries only:

athena.using(cache_enabled=False).execute(...)

Only SELECT queries are cached.