API¶
This page describes the public API of the Pallas library.
All public functions and classes are imported to the top level pallas
module.
Imports from internals of the package are not recommended and can break in future.
Assembly¶
To construct an Athena
client, use setup()
or environ_setup()
functions.
-
setup
(*, region=None, database=None, workgroup=None, output_location=None, cache_local=None, cache_remote=None, cache_failed=False, normalize=True, kill_on_interrupt=True)[source]¶ Setup an
Athena
client.All configuration options can be given to this method, but many of them can be overridden after the client is constructed.
- Parameters
region (str | None) – an AWS region. By default, region from AWS config (
~/.aws/config
) is used.database (str | None) – a name of Athena database. Can be overridden in SQL.
workgroup (str | None) – a name of Athena workgroup. Workgroup can set resource limits or override output location. Defaults to the Athena default workgroup.
output_location (str | None) – an output location at S3 for query results. Optional if an output location is specified for the workgroup.
cache_local (str | None) – an URI of a local cache. Both results and query execution IDs are stored in the local cache.
cache_remote (str | None) – an URI of a remote cache. Query execution IDs without results are stored in the remote cache.
cache_failed (bool) – whether to return failed queries found in cache.
normalize (bool) – whether to normalize queries before execution.
kill_on_interrupt (bool) – whether to kill queries on KeyboardInterrupt.
- Returns
a new instance of Athena client
- Return type
-
environ_setup
(environ=None, *, prefix='PALLAS')[source]¶ Setup an
Athena
client from environment variables.Reads the following environment variables:
export PALLAS_REGION= export PALLAS_DATABASE= export PALLAS_WORKGROUP= export PALLAS_OUTPUT_LOCATION= export PALLAS_NORMALIZE=true export PALLAS_KILL_ON_INTERRUPT=true export PALLAS_CACHE_REMOTE=$PALLAS_OUTPUT_LOCATION export PALLAS_CACHE_LOCAL=~/Notebooks/.cache/
Configuration from the environment variables can be overridden after the client is constructed.
- Parameters
environ (Mapping[str, str] | None) – A mapping object representing the string environment. Defaults to
os.environ
.prefix (str) – A prefix of environment variables
- Returns
a new instance of Athena client
- Return type
-
configure_logging
(*, level=20, stream=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, **kwargs)[source]¶ Do basic configuration for the logging system.
Calls
logging.basicConfig()
internally, but:Sets level to the “pallas” logger only
Log level default to “INFO:
stream defaults to
sys.stdout
instead ofsys.stderr
Can be safely called no matter whether logging was already configured:
If logging was already configured, this function just sets a level for the “pallas” logger.
If logging was not configured yet, it enabled logging to stdout.
- Parameters
level (int | str) – Set the “pallas” logger to the specified level.
stream (TextIO) – Use the specified stream to initialize the StreamHandler.
kwargs – passed to
logging.basicConfig()
- Return type
None
Client¶
The Athena
class is a facade to all functionality offered by the library.
In the most common scenario, you may need only its execute()
method.
If you need to submit queries in a non-blocking fashion, you can use the
submit()
method, which returns a Query
instance.
The same class is also returned by get_query()
method,
which can be useful if you want to get back to queries executed in the past.
-
class
Athena
(proxy)[source]¶ Athena client.
Provides methods to execute SQL queries in AWS Athena, with an optional caching and other helpers.
Can be used as a blocking or a non-blocking client.
Use
setup()
orenviron_setup()
to construct this class without touching Pallas internals.- Parameters
proxy – an internal proxy to execute queries
-
static
quote
(value)¶ Quote a scalar value for an SQL expression.
Parametrized queries should be preferred to explicit quoting.
Following Python types can be quoted to an SQL expressions:
None
– SQLNULL
str
int
, including subclasses of numbers.Integralfloat
, including subclasses or numbers.RealDecimal
– SQLDECIMAL
datetime.date
– SQLDATE
datetime.datetime
– SQLTIMESTAMP
bytes
– SQLVARBINARY
- Parameters
value (Union[None, str, float, numbers.Real, decimal.Decimal, bytes, datetime.date]) – Python value
- Returns
an SQL expression
- Return type
str
-
database
: str | None = None¶ Name of Athena database to be be queried.
Can be overridden in SQL.
-
workgroup
: str | None = None¶ Name of Athena workgroup.
Workgroup can set resource limits or override output location. When None, defaults to the Athena default workgroup.
-
output_location
: str | None = None¶ URI of output location on S3.
Optional if an output location is specified for
workgroup
.
-
normalize
: bool = True¶ Whether to normalize queries before execution.
-
kill_on_interrupt
: bool = True¶ Whether to kill queries on KeyboardInterrupt
-
property
cache
¶ Cache implementation.
It is possible to update properties of the
cache
attribute to reconfigure caching in place.Alternatively, the
using()
method can apply a new configuration without affecting an existing instance.- Return type
-
using
(*, database=None, workgroup=None, output_location=None, normalize=None, kill_on_interrupt=None, cache_enabled=None, cache_read=None, cache_write=None, cache_failed=None)[source]¶ Crate a new instance with an updated configuration.
This method can be useful if you need to override a configuration for one query, but you do not want to affect future queries.
- Parameters
database (str | None) – name of Athena database to be be queried.
workgroup (str | None) – name of Athena workgroup.
output_location (str | None) – URI of output location on S3.
normalize (bool | None) – whether to normalize queries before execution.
kill_on_interrupt (bool | None) – whether to kill queries on KeyboardInterrupt
cache_enabled (bool | None) – whether a cache should be used.
cache_read (bool | None) – whether a cache should be read.
cache_write (bool | None) – whether a cache should be written.
cache_failed (bool | None) – whether to return failed queries found in cache.
- Returns
an updated copy of this client
- Return type
-
execute
(operation, parameters=None)[source]¶ Execute a query and return results.
This is a blocking method that waits until the query finishes.
Cached results or results from an existing query can be returned, if the caching was configured. Only SELECT queries are cached.
Raises
AthenaQueryError
if the query fails.- Parameters
operation (str) – an SQL query to be executed Can contain
%s
or%(key)s
placeholders for substitution by parameters.parameters (Union[None, Tuple[SQL_SCALAR, ..], Mapping[str, SQL_SCALAR]]) – parameters to substitute in operation. All substitute parameters are quoted appropriately. See the
quote()
method for a supported parameter types.
- Returns
query results
- Return type
-
submit
(operation, parameters=None)[source]¶ Submit a query and return.
This is a non-blocking method that starts a query and returns. Returns a
Query
instance for monitoring query execution and downloading results later.An existing query can be returned, if the caching was configured. Only SELECT queries are cached.
- Parameters
operation (str) – an SQL query to be executed Can contain
%s
or%(key)s
placeholders for substitution by parameters.parameters (Union[None, Tuple[SQL_SCALAR, ..], Mapping[str, SQL_SCALAR]]) – parameters to substitute in operation. All substitute parameters are quoted appropriately. See the
quote()
method for a supported parameter types.
- Returns
a query instance
- Return type
-
get_query
(execution_id)[source]¶ Get a previously submitted query execution.
This method can be used to retrieve a query executed in the past. Because Athena stores results in S3 and does not delete them by default, it is possible to download results until they are manually deleted.
- Parameters
execution_id (str) – an Athena query execution ID.
- Returns
a query instance
- Return type
-
class
Query
(execution_id, *, proxy, cache)[source]¶ Athena query
Provides access to one query execution. It can be used to monitor status of the query results or retrieving results when the execution finishes.
Instances of this class are returned by
Athena.submit()
andAthena.get_query()
methods. You should not need to create this class directly.- Parameters
execution_id – Athena query execution ID.
proxy – an internal proxy to execute queries
cache – a cache instance
-
backoff
: Iterable[int] = <pallas.utils.Fibonacci object>¶ Delays in seconds between for checking query status.
-
kill_on_interrupt
: bool = False¶ Whether to kill this query on KeyboardInterrupt
Initially set to
Athena.kill_on_interrupt
.
-
property
execution_id
¶ Athena query execution ID.
This ID can be used to retrieve this query later using the
Athena.get_query()
method.
-
get_info
()[source]¶ Retrieve information about this query execution.
Returns a status of this query with other information.
- Return type
-
get_results
()[source]¶ Download results of this query execution.
Cached results can be returned, if the caching was configured. Only SELECT queries are cached.
Waits until this query execution finishes and downloads results. Raises
AthenaQueryError
if the query failed.- Return type
-
kill
()[source]¶ Kill this query execution.
This is a non-blocking operation. It does not wait until the query is killed.
- Return type
None
-
join
()[source]¶ Wait until this query execution finishes.
Raises
AthenaQueryError
if the query failed.- Return type
None
Query information¶
Information about query execution are returned as QueryInfo
instances.
If you call Query.get_info()
multiple times,
it can return different information as the query execution proceeds.
-
class
QueryInfo
(data)[source]¶ Information about query execution.
Instances are returned by the
Query.get_info()
method.- Parameters
data – data returned by Athena GetQueryExecution API method.
-
__str__
()[source]¶ Return summary info about the query execution.
This is included in logs generated by the Athena client.
- Return type
str
-
property
execution_id
¶ ID od the query execution.
-
property
sql
¶ SQL query executed.
-
property
output_location
¶ URI of output location on S3 for the query
-
property
database
¶ Name of database.
-
property
finished
¶ Whether the query execution finished.
-
property
succeeded
¶ Whether the query execution finished successfully.
-
property
state
¶ State of the query execution.
-
property
state_reason
¶ Reason of the state of the query execution.
-
property
scanned_bytes
¶ Data scanned by Athena.
-
property
execution_time
¶ Time spent by Athena.
-
check
()[source]¶ Raises
AthenaQueryError
(or its subclass) if the query failed.Does not raise if the query is still running.
- Return type
None
Query results¶
Results of query executions are encapsulated by the QueryResults
class.
-
class
QueryResults
(column_names, column_types, data)[source]¶ Collection of Athena query results.
Implements a list-like interface for accessing individual records. Alternatively, can be converted to
pandas.DataFrame
using theto_df()
method.-
__getitem__
(index)[source]¶ Return one result or slice of results.
Records are returned as mappings from column names to values.
- Parameters
index (int | slice) –
- Return type
QueryRecord | Sequence[QueryRecord]
-
classmethod
load
(stream)[source]¶ Deserialize results from a text stream.
- Parameters
stream (TextIO) –
- Return type
-
save
(stream)[source]¶ Serialize results to a text stream.
- Parameters
stream (TextIO) –
- Return type
None
-
property
column_names
¶ List of column names.
-
property
column_types
¶ List of column types.
-
Caching¶
-
class
AthenaCache
[source]¶ Caches queries and its results.
Athena always stores results in S3, so it is possible to retrieve past results without manually copying data.
This class can hold a reference to two instances of cache storage:
local, which caches both query execution IDs and query results
remote, which cache query execution IDs only.
It is possible to configure one the backends, both of them, or none of them.
Queries cached in the local storage can be executed without an internet connection. Queries cached in the remote storage are not executed twice, but results have to be downloaded from AWS.
In theory, it is possible to use remote backend for the local cache (or vice versa), but we assume that the local cache is actually stored locally
Instance of this class is returned by the
Athena.cache
property. It can be updated to reconfigure the caching.-
enabled
: bool = True¶ Can be set to False to disable caching completely.
Can be updated to enable or disable the caching.
-
read
: bool = True¶ Can be set to False to disable reading the cache.
Can be updated to reconfigure the caching.
-
write
: bool = True¶ Can be set to False to disable writing the cache.
Can be updated to reconfigure the caching.
-
failed
: bool = False¶ Whether to return failed queries found in cache.
When this is false, failed queries found in cache are ignored.
-
property
local
¶ URI of storage for local cache.
Can be updated to reconfigure the caching.
-
property
remote
¶ URI of storage for remote cache.
Can be updated to reconfigure the caching.
-
load_execution_id
(database, sql)[source]¶ Retrieve cached query execution ID for the given SQL.
Looks into both the local and the remote storage.
- Parameters
database (str | None) –
sql (str) –
- Return type
str | None
-
save_execution_id
(database, sql, execution_id)[source]¶ Store cached query execution ID for the given SQL.
Updates both the local and the remote storage.
- Parameters
database (str | None) –
sql (str) –
execution_id (str) –
- Return type
None
-
has_results
(execution_id)[source]¶ Checks whether results are cached for the given execution ID.
Looks into the local storage only.
- Parameters
execution_id (str) –
- Return type
bool
-
load_results
(execution_id)[source]¶ Retrieve cached results for the given execution ID.
Looks into the local storage only.
- Parameters
execution_id (str) –
- Return type
QueryResults | None
-
save_results
(execution_id, results)[source]¶ Store cached results for the given SQL.
Updates the local storage only.
- Parameters
execution_id (str) –
results (pallas.results.QueryResults) –
- Return type
None
Exceptions¶
Pallas can raise AthenaQueryError
when a query fails.
For transport errors (typically connectivity problems or authorization failures),
boto3
exceptions bubble unmodified.
-
class
AthenaQueryError
(execution_id, state, state_reason)[source]¶ Athena query failed.
-
state
: str¶ State of the query execution (FAILED or CANCELLED)
-
state_reason
: str | None¶ Reason of the state of the query execution.
-
-
class
DatabaseNotFoundError
(execution_id, state, state_reason)[source]¶ Bases:
pallas.exceptions.AthenaQueryError
Athena database does not exist.
Pallas maps string errors returned by Athena to exception classes.
-
class
TableNotFoundError
(execution_id, state, state_reason)[source]¶ Bases:
pallas.exceptions.AthenaQueryError
Athena table does not exist.
Pallas maps string errors returned by Athena to exception classes.