API

This page describes the public API of the Pallas library.

All public functions and classes are imported to the top level pallas module. Imports from internals of the package are not recommended and can break in future.

Assembly

To construct an Athena client, use setup() or environ_setup() functions.

setup(*, region=None, database=None, workgroup=None, output_location=None, cache_local=None, cache_remote=None, cache_failed=False, normalize=True, kill_on_interrupt=True)[source]

Setup an Athena client.

All configuration options can be given to this method, but many of them can be overridden after the client is constructed.

Parameters
  • region (str | None) – an AWS region. By default, region from AWS config (~/.aws/config) is used.

  • database (str | None) – a name of Athena database. Can be overridden in SQL.

  • workgroup (str | None) – a name of Athena workgroup. Workgroup can set resource limits or override output location. Defaults to the Athena default workgroup.

  • output_location (str | None) – an output location at S3 for query results. Optional if an output location is specified for the workgroup.

  • cache_local (str | None) – an URI of a local cache. Both results and query execution IDs are stored in the local cache.

  • cache_remote (str | None) – an URI of a remote cache. Query execution IDs without results are stored in the remote cache.

  • cache_failed (bool) – whether to return failed queries found in cache.

  • normalize (bool) – whether to normalize queries before execution.

  • kill_on_interrupt (bool) – whether to kill queries on KeyboardInterrupt.

Returns

a new instance of Athena client

Return type

Athena

environ_setup(environ=None, *, prefix='PALLAS')[source]

Setup an Athena client from environment variables.

Reads the following environment variables:

export PALLAS_REGION=
export PALLAS_DATABASE=
export PALLAS_WORKGROUP=
export PALLAS_OUTPUT_LOCATION=
export PALLAS_NORMALIZE=true
export PALLAS_KILL_ON_INTERRUPT=true
export PALLAS_CACHE_REMOTE=$PALLAS_OUTPUT_LOCATION
export PALLAS_CACHE_LOCAL=~/Notebooks/.cache/

Configuration from the environment variables can be overridden after the client is constructed.

Parameters
  • environ (Mapping[str, str] | None) – A mapping object representing the string environment. Defaults to os.environ.

  • prefix (str) – A prefix of environment variables

Returns

a new instance of Athena client

Return type

Athena

configure_logging(*, level=20, stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, **kwargs)[source]

Do basic configuration for the logging system.

Calls logging.basicConfig() internally, but:

  • Sets level to the “pallas” logger only

  • Log level default to “INFO:

  • stream defaults to sys.stdout instead of sys.stderr

Can be safely called no matter whether logging was already configured:

  • If logging was already configured, this function just sets a level for the “pallas” logger.

  • If logging was not configured yet, it enabled logging to stdout.

Parameters
  • level (int | str) – Set the “pallas” logger to the specified level.

  • stream (TextIO) – Use the specified stream to initialize the StreamHandler.

  • kwargs – passed to logging.basicConfig()

Return type

None

Client

The Athena class is a facade to all functionality offered by the library.

In the most common scenario, you may need only its execute() method. If you need to submit queries in a non-blocking fashion, you can use the submit() method, which returns a Query instance. The same class is also returned by get_query() method, which can be useful if you want to get back to queries executed in the past.

class Athena(proxy)[source]

Athena client.

Provides methods to execute SQL queries in AWS Athena, with an optional caching and other helpers.

Can be used as a blocking or a non-blocking client.

Use setup() or environ_setup() to construct this class without touching Pallas internals.

Parameters

proxy – an internal proxy to execute queries

static quote(value)

Quote a scalar value for an SQL expression.

Parametrized queries should be preferred to explicit quoting.

Following Python types can be quoted to an SQL expressions:

  • None – SQL NULL

  • str

  • int, including subclasses of numbers.Integral

  • float, including subclasses or numbers.Real

  • Decimal – SQL DECIMAL

  • datetime.date – SQL DATE

  • datetime.datetime – SQL TIMESTAMP

  • bytes – SQL VARBINARY

Parameters

value (Union[None, str, float, numbers.Real, decimal.Decimal, bytes, datetime.date]) – Python value

Returns

an SQL expression

Return type

str

database: str | None = None

Name of Athena database to be be queried.

Can be overridden in SQL.

workgroup: str | None = None

Name of Athena workgroup.

Workgroup can set resource limits or override output location. When None, defaults to the Athena default workgroup.

output_location: str | None = None

URI of output location on S3.

Optional if an output location is specified for workgroup.

normalize: bool = True

Whether to normalize queries before execution.

kill_on_interrupt: bool = True

Whether to kill queries on KeyboardInterrupt

property cache

Cache implementation.

It is possible to update properties of the cache attribute to reconfigure caching in place.

Alternatively, the using() method can apply a new configuration without affecting an existing instance.

Return type

AthenaCache

using(*, database=None, workgroup=None, output_location=None, normalize=None, kill_on_interrupt=None, cache_enabled=None, cache_read=None, cache_write=None, cache_failed=None)[source]

Crate a new instance with an updated configuration.

This method can be useful if you need to override a configuration for one query, but you do not want to affect future queries.

Parameters
  • database (str | None) – name of Athena database to be be queried.

  • workgroup (str | None) – name of Athena workgroup.

  • output_location (str | None) – URI of output location on S3.

  • normalize (bool | None) – whether to normalize queries before execution.

  • kill_on_interrupt (bool | None) – whether to kill queries on KeyboardInterrupt

  • cache_enabled (bool | None) – whether a cache should be used.

  • cache_read (bool | None) – whether a cache should be read.

  • cache_write (bool | None) – whether a cache should be written.

  • cache_failed (bool | None) – whether to return failed queries found in cache.

Returns

an updated copy of this client

Return type

Athena

execute(operation, parameters=None)[source]

Execute a query and return results.

This is a blocking method that waits until the query finishes.

Cached results or results from an existing query can be returned, if the caching was configured. Only SELECT queries are cached.

Raises AthenaQueryError if the query fails.

Parameters
  • operation (str) – an SQL query to be executed Can contain %s or %(key)s placeholders for substitution by parameters.

  • parameters (Union[None, Tuple[SQL_SCALAR, ..], Mapping[str, SQL_SCALAR]]) – parameters to substitute in operation. All substitute parameters are quoted appropriately. See the quote() method for a supported parameter types.

Returns

query results

Return type

pallas.results.QueryResults

submit(operation, parameters=None)[source]

Submit a query and return.

This is a non-blocking method that starts a query and returns. Returns a Query instance for monitoring query execution and downloading results later.

An existing query can be returned, if the caching was configured. Only SELECT queries are cached.

Parameters
  • operation (str) – an SQL query to be executed Can contain %s or %(key)s placeholders for substitution by parameters.

  • parameters (Union[None, Tuple[SQL_SCALAR, ..], Mapping[str, SQL_SCALAR]]) – parameters to substitute in operation. All substitute parameters are quoted appropriately. See the quote() method for a supported parameter types.

Returns

a query instance

Return type

pallas.client.Query

get_query(execution_id)[source]

Get a previously submitted query execution.

This method can be used to retrieve a query executed in the past. Because Athena stores results in S3 and does not delete them by default, it is possible to download results until they are manually deleted.

Parameters

execution_id (str) – an Athena query execution ID.

Returns

a query instance

Return type

pallas.client.Query

class Query(execution_id, *, proxy, cache)[source]

Athena query

Provides access to one query execution. It can be used to monitor status of the query results or retrieving results when the execution finishes.

Instances of this class are returned by Athena.submit() and Athena.get_query() methods. You should not need to create this class directly.

Parameters
  • execution_id – Athena query execution ID.

  • proxy – an internal proxy to execute queries

  • cache – a cache instance

backoff: Iterable[int] = <pallas.utils.Fibonacci object>

Delays in seconds between for checking query status.

kill_on_interrupt: bool = False

Whether to kill this query on KeyboardInterrupt

Initially set to Athena.kill_on_interrupt.

property execution_id

Athena query execution ID.

This ID can be used to retrieve this query later using the Athena.get_query() method.

get_info()[source]

Retrieve information about this query execution.

Returns a status of this query with other information.

Return type

pallas.info.QueryInfo

get_results()[source]

Download results of this query execution.

Cached results can be returned, if the caching was configured. Only SELECT queries are cached.

Waits until this query execution finishes and downloads results. Raises AthenaQueryError if the query failed.

Return type

pallas.results.QueryResults

kill()[source]

Kill this query execution.

This is a non-blocking operation. It does not wait until the query is killed.

Return type

None

join()[source]

Wait until this query execution finishes.

Raises AthenaQueryError if the query failed.

Return type

None

Query information

Information about query execution are returned as QueryInfo instances. If you call Query.get_info() multiple times, it can return different information as the query execution proceeds.

class QueryInfo(data)[source]

Information about query execution.

Instances are returned by the Query.get_info() method.

Parameters

data – data returned by Athena GetQueryExecution API method.

__str__()[source]

Return summary info about the query execution.

This is included in logs generated by the Athena client.

Return type

str

property execution_id

ID od the query execution.

property sql

SQL query executed.

property output_location

URI of output location on S3 for the query

property database

Name of database.

property finished

Whether the query execution finished.

property succeeded

Whether the query execution finished successfully.

property state

State of the query execution.

property state_reason

Reason of the state of the query execution.

property scanned_bytes

Data scanned by Athena.

property execution_time

Time spent by Athena.

check()[source]

Raises AthenaQueryError (or its subclass) if the query failed.

Does not raise if the query is still running.

Return type

None

Query results

Results of query executions are encapsulated by the QueryResults class.

class QueryResults(column_names, column_types, data)[source]

Collection of Athena query results.

Implements a list-like interface for accessing individual records. Alternatively, can be converted to pandas.DataFrame using the to_df() method.

__getitem__(index)[source]

Return one result or slice of results.

Records are returned as mappings from column names to values.

Parameters

index (int | slice) –

Return type

QueryRecord | Sequence[QueryRecord]

__len__()[source]

Return count of this results.

Return type

int

classmethod load(stream)[source]

Deserialize results from a text stream.

Parameters

stream (TextIO) –

Return type

pallas.results.QueryResults

save(stream)[source]

Serialize results to a text stream.

Parameters

stream (TextIO) –

Return type

None

property column_names

List of column names.

property column_types

List of column types.

to_df(dtypes=None)[source]

Convert this results to pandas.DataFrame.

Parameters

dtypes (Mapping[str, object] | None) –

Return type

pd.DataFrame

Caching

class AthenaCache[source]

Caches queries and its results.

Athena always stores results in S3, so it is possible to retrieve past results without manually copying data.

This class can hold a reference to two instances of cache storage:

  • local, which caches both query execution IDs and query results

  • remote, which cache query execution IDs only.

It is possible to configure one the backends, both of them, or none of them.

Queries cached in the local storage can be executed without an internet connection. Queries cached in the remote storage are not executed twice, but results have to be downloaded from AWS.

In theory, it is possible to use remote backend for the local cache (or vice versa), but we assume that the local cache is actually stored locally

Instance of this class is returned by the Athena.cache property. It can be updated to reconfigure the caching.

enabled: bool = True

Can be set to False to disable caching completely.

Can be updated to enable or disable the caching.

read: bool = True

Can be set to False to disable reading the cache.

Can be updated to reconfigure the caching.

write: bool = True

Can be set to False to disable writing the cache.

Can be updated to reconfigure the caching.

failed: bool = False

Whether to return failed queries found in cache.

When this is false, failed queries found in cache are ignored.

property local

URI of storage for local cache.

Can be updated to reconfigure the caching.

property remote

URI of storage for remote cache.

Can be updated to reconfigure the caching.

load_execution_id(database, sql)[source]

Retrieve cached query execution ID for the given SQL.

Looks into both the local and the remote storage.

Parameters
  • database (str | None) –

  • sql (str) –

Return type

str | None

save_execution_id(database, sql, execution_id)[source]

Store cached query execution ID for the given SQL.

Updates both the local and the remote storage.

Parameters
  • database (str | None) –

  • sql (str) –

  • execution_id (str) –

Return type

None

has_results(execution_id)[source]

Checks whether results are cached for the given execution ID.

Looks into the local storage only.

Parameters

execution_id (str) –

Return type

bool

load_results(execution_id)[source]

Retrieve cached results for the given execution ID.

Looks into the local storage only.

Parameters

execution_id (str) –

Return type

QueryResults | None

save_results(execution_id, results)[source]

Store cached results for the given SQL.

Updates the local storage only.

Parameters
Return type

None

Exceptions

Pallas can raise AthenaQueryError when a query fails. For transport errors (typically connectivity problems or authorization failures), boto3 exceptions bubble unmodified.

class AthenaQueryError(execution_id, state, state_reason)[source]

Athena query failed.

state: str

State of the query execution (FAILED or CANCELLED)

state_reason: str | None

Reason of the state of the query execution.

__str__()[source]

Report query state with its reason.

Return type

str

class DatabaseNotFoundError(execution_id, state, state_reason)[source]

Bases: pallas.exceptions.AthenaQueryError

Athena database does not exist.

Pallas maps string errors returned by Athena to exception classes.

class TableNotFoundError(execution_id, state, state_reason)[source]

Bases: pallas.exceptions.AthenaQueryError

Athena table does not exist.

Pallas maps string errors returned by Athena to exception classes.