kedro_tdda tutorial

This tutorial elaborates on the workflow and behavior of kedro_tdda. It departs from the kedro-iris starter template which can be reproduced with

kedro new --name=kedro-iris-tdda --tools=data --example=yes

When kedro_tdda is installed you can use the kedro tdda from the command line

kedro tdda -h
Usage: kedro tdda [OPTIONS] COMMAND [ARGS]...

  Use tdda-specific commands inside kedro project.

Options:
  -h, --help  Show this message and exit.

Commands:
  detect    Detect and write anomalies for data that deviates from the...
  discover  Discover constraints for pandas datasets in the catalog.
  verify    Verify data against constraints specifications Args:...

discover

With discover, you can write constraints to the conf/<<env>>/tdda/ folder for all available pandas datasets.

kedro tdda discover
INFO - Loading data from companies (CSVDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/companies.yml
INFO - Loading data from reviews (CSVDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/reviews.yml
INFO - Loading data from shuttles (ExcelDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/shuttles.yml
INFO - Loading data from preprocessed_companies (ParquetDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/preprocessed_companies.yml
INFO - Loading data from preprocessed_shuttles (ParquetDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/preprocessed_shuttles.yml
INFO - Loading data from model_input_table (ParquetDataset)...
INFO - TDDA constraints are written to ./conf/base/tdda/model_input_table.yml

An example constraints file that is auto-generated for companies below. For more extended use cases, check the tdda docs and extend the examples in yaml format

cat ./conf/base/tdda/companies.yml
companies:
    creation_metadata:
        creator: TDDA 2.2.05
        host: Sinans-MacBook-Pro.local
        local_time: '2024-12-22T11:12:27'
        n_records: 10019
        n_selected: 10019
        user: sinan
        utc_time: '2024-12-22T10:12:27'
    fields:
        company_location:
            max_length: 29
            min_length: 4
            type: string
        company_rating:
            max_length: 4
            min_length: 2
            type: string
        iata_approved:
            allowed_values:
            - f
            - t
            max_length: 1
            min_length: 1
            type: string
        id:
            max: 50098
            max_nulls: 0
            min: 1
            no_duplicates: true
            sign: positive
            type: int
        total_fleet_count:
            max: 1105.0
            min: 1.0
            sign: positive
            type: real

Optional arguments for verify are

kedro tdda discover --help
Usage: kedro tdda discover [OPTIONS]

  Discover constraints for pandas datasets in the catalog.

Options:
  -d, --dataset TEXT  The name of the pandas catalog entry for which
                      constraints be inferred.
  -e, --env TEXT      Kedro configuration environment name. Defaults to
                      `base`.
  -o, --overwrite     Boolean indicator for overwriting an existing tdda
                      constrains yml specification
  -h, --help          Show this message and exit.

verify

verify will check constraints for data

kedro tdda verify --dataset companies
INFO - Loading data from companies (CSVDataset)...
INFO - Verification summary `companies`: 20 passses, 0 failures

Optional arguments for discover are

kedro tdda verify --help
Usage: kedro tdda verify [OPTIONS]

  Verify data against constraints specifications

Options:
  -d, --dataset TEXT  The name of the pandas catalog entry for which
                      verification will be executed.
  -e, --env TEXT      The kedro environment where the dataset to retrieve is
                      available. Default to 'base'
  -h, --help          Show this message and exit.

TddaHooks

TddaHooks include dataset validation when a pandas dataset is loaded, for example during a kedro pipeline run. Let’s modify the constraints for companies and run a pipeline. Now, a TddaVerificationError is raised.

import yaml

filepath = './conf/base/tdda/companies.yml'
with open(filepath, 'r') as f:
    constr = yaml.safe_load(f)

# modify constraint
constr['companies']['fields']['company_location']['max_length'] = 20 # instead of 29
with open(filepath, 'w') as f:
    yaml.safe_dump(constr, f)
kedro run
INFO - Kedro project kedro-iris-tdda
INFO - Using synchronous mode for loading and saving data. Use the --async flag for potential performance gains. https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_pipeline.html#load-and-save-asynchronously
INFO - Loading data from companies (CSVDataset)...
WARNING - No nodes ran. Repeat the previous command to attempt a new run.
Traceback (most recent call last):
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/bin/kedro", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro/framework/cli/cli.py", line 263, in main
    cli_collection()
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro/framework/cli/cli.py", line 163, in main
    super().main(
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro/framework/cli/project.py", line 228, in run
    return session.run(
           ^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro/framework/session/session.py", line 399, in run
    run_result = runner.run(
                 ^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 113, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro/runner/sequential_runner.py", line 85, in _run
    ).execute()
      ^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro/runner/task.py", line 88, in execute
    node = self._run_node_sequential(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro/runner/task.py", line 153, in _run_node_sequential
    hook_manager.hook.after_dataset_loaded(
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 480, in traced_hookexec
    return outcome.get_result()
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/pluggy/_result.py", line 62, in from_call
    result = func()
             ^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/pluggy/_manager.py", line 477, in <lambda>
    lambda: oldcall(hook_name, hook_impls, caller_kwargs, firstresult)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro_tdda/hooks.py", line 40, in after_dataset_loaded
    formatted_log_verification(entry=dataset_name, verification=verification)
  File "/Users/sinan/personal/kedro_plugin/kedro_tdda/.venv/lib/python3.11/site-packages/kedro_tdda/utils.py", line 155, in formatted_log_verification
    raise TddaVerificationError(msg)
kedro_tdda.utils.TddaVerificationError: Dataset `companies` deviates from constraint specification:
✗ company_location: max_length

detect

detect will write csv files containing observations not matching one or more field expectations.

kedro tdda detect
INFO - Loading data from companies (CSVDataset)...
INFO - Detection for companies written to ./tdda_detect/companies.csv
WARNING - Dataset `companies` deviates from constraint specification:
✗ company_location: max_length
INFO - Loading data from reviews (CSVDataset)...
INFO - Verification summary `reviews`: 43 passses, 0 failures
INFO - Loading data from shuttles (ExcelDataset)...
INFO - Verification summary `shuttles`: 61 passses, 0 failures
INFO - Loading data from preprocessed_companies (ParquetDataset)...
INFO - Verification summary `preprocessed_companies`: 22 passses, 0 failures
INFO - Loading data from preprocessed_shuttles (ParquetDataset)...
INFO - Verification summary `preprocessed_shuttles`: 62 passses, 0 failures
INFO - Loading data from model_input_table (ParquetDataset)...
INFO - Verification summary `model_input_table`: 133 passses, 0 failures

Optional arguments for detect are

kedro tdda detect --help
Usage: kedro tdda detect [OPTIONS]

  Detect and write anomalies for data that deviates from the constraints

Options:
  -d, --dataset TEXT     The name of the pandas catalog entry for which
                         detection will be written.
  -e, --env TEXT         The kedro environment where the dataset to retrieve
                         is available. Default to 'base'
  -t, --target-dir TEXT  Target directory in which the tdda detection csv can
                         be saved. Should always be a directory
  -h, --help             Show this message and exit.

An example, continuing our companies datasets below.

import pandas as pd

# read raw companies and anomalies
companies = pd.read_csv('data/01_raw/companies.csv').reset_index()[['index', 'company_location']]
companies_detected = pd.read_csv('./tdda_detect/companies.csv')

companies.merge(companies_detected, how='outer', left_on='index', right_on='Index').head(15)
index company_location Index company_location_max_length_ok n_failures
0 0 Isle of Man NaN NaN NaN
1 1 NaN NaN NaN NaN
2 2 Isle of Man NaN NaN NaN
3 3 Bosnia and Herzegovina 3.0 False 1.0
4 4 Chile NaN NaN NaN
5 5 Kiribati NaN NaN NaN
6 6 Bahrain NaN NaN NaN
7 7 Nicaragua NaN NaN NaN
8 8 Turkmenistan NaN NaN NaN
9 9 Rwanda NaN NaN NaN
10 10 NaN NaN NaN NaN
11 11 Niue NaN NaN NaN
12 12 NaN NaN NaN NaN
13 13 Sao Tome and Principe 13.0 False 1.0
14 14 Denmark NaN NaN NaN