Tutorial ======== AWS credentials --------------- Pallas uses boto3_ internally, so it reads `AWS credentials`_ from the standard locations: * Shared credential file (``~/.aws/credentials``) * Environment variables (``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY``) * Instance metadata service when run on an Amazon EC2 instance The ``~/.aws/credentials`` file can be generated using the AWS CLI. .. code-block:: shell aws configure We recommend to use the AWS CLI to check the configuration. If the AWS CLI is able to authenticate then Pallas should work too. .. code-block:: shell aws sts get-caller-identity aws athena list-databases --catalog-name AwsDataCatalog .. _AWS credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html .. _boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Initialization -------------- An :class:`.Athena` client can be obtained using the :func:`.setup` function. All arguments are optional. .. code-block:: python import pallas athena = pallas.setup( # AWS region, read from ~/.aws/config if not specified. region=None, # Athena (AWS Glue) database. database=None, # Athena workgroup. Will use default workgroup if omitted. workgroup=None, # Athena output location, will use workgroup default location if omitted. output_location="s3://...", # Optional query execution cache. cache_remote="s3://...", # Optional query result cache. cache_local="~/Notebooks/.cache/", # Whether to return failed queries from cache. Defaults to False. cache_failed=False, # Normalize white whitespace for better caching. Enabled by default. normalize=True, # Kill queries on KeybordInterrupt. Enabled by default. kill_on_interrupt=True ) To avoid hardcoded configuration values, the :func:`.environ_setup` function can initialize :class:`.Athena` from environment variables, corresponding to arguments in the previous example: .. code-block:: shell export PALLAS_REGION= export PALLAS_DATABASE= export PALLAS_WORKGROUP= export PALLAS_OUTPUT_LOCATION= export PALLAS_NORMALIZE=true export PALLAS_KILL_ON_INTERRUPT=true export PALLAS_CACHE_REMOTE=$PALLAS_OUTPUT_LOCATION export PALLAS_CACHE_LOCAL=~/Notebooks/.cache/ export PALLAS_CACHE_FAILED=false .. code-block:: python athena = pallas.environ_setup() Pallas uses Python standard logging. You can use :func:`.configure_logging` instead of :func:`logging.basicConfig` to enable logging for Pallas only. At the DEBUG level, Pallas emits logs with query status including an estimated price: .. code-block:: python pallas.configure_logging(level="DEBUG") Executing queries ----------------- Use the :meth:`.Athena.execute` method to execute queries: .. code-block:: python sql = "SELECT %s id, %s name, %s value" results = athena.execute(sql, (1, "foo", 3.14)) Pallas also support non-blocking query execution: .. code-block:: python query = athena.submit(sql) # Submit a query and return query.join() # Wait for query completion. results = query.get_results() # Retrieve results. Joins the query internally. The result objects provides a list-like interface and can be converted to a Pandas DataFrame: .. code-block:: python df = results.to_df() Caching ------- AWS Athena stores query results in S3 and does not delete them, so all past results are cached implicitly. To retrieve results of a past query, an ID of the query execution is needed. Pallas can cache in two modes - remote and local: * In the remote mode, Pallas stores IDs of query executions. Using that, it can download previous results from S3 when they are available. * In the local mode, it copies query results. Thanks to that, locally cached queries can be executed without an internet connection. .. note:: Pallas is designed to promote reproducible analyses and data pipelines: * Using the local caching, it is possible to regularly restart Jupyter notebooks without waiting for or paying for additional Athena queries. * Thanks to the remote caching, results can be reproduced at a different machine by a different person. Reproducible queries should be deterministic. For example, if you query data that are ingested regularly, you should always filter on the date column. Pallas assumes that your queries are deterministic and does not invalidate its cache. Caching configuration can be passed to :func:`.setup` or :func:`.environ_setup`, as shown in the `Initialization`_ section. After the initialization, caching can be customized later using the :attr:`.Athena.cache` property: .. code-block:: python athena.cache.enabled = True # Default athena.cache.read = True # Can be set to False to write but not read the cache athena.cache.write = True # Can be set to False to read but not write the cache athena.cache.local = "~/Notebooks/.cache/" athena.cache.remote = "s3://..." athena.cache.failed = True Alternatively, the :meth:`.Athena.using` method can override a configuration for selected queries only: .. code-block:: python athena.using(cache_enabled=False).execute(...) Only SELECT queries are cached.