abraxos package

Submodules

abraxos.extract module

CSV reading utilities with bad line recovery.

class abraxos.extract.ReadCsvResult(bad_lines: list[list[str]], dataframe: pd.DataFrame)[source]

Bases: NamedTuple

A named tuple representing the result of reading a CSV file.

bad_lines

List of lines that could not be parsed correctly.

Type:

list of list of str

dataframe

Parsed portion of the CSV file.

Type:

pandas.DataFrame

bad_lines: list[list[str]]

Alias for field number 0

count(value, /)

Return number of occurrences of value.

dataframe: pandas.core.frame.DataFrame

Alias for field number 1

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

abraxos.extract.read_csv(path, *, chunksize=None, **kwargs)[source]

Reads a CSV file and optionally processes it in chunks, capturing malformed lines.

Parameters:
  • path (str) – Path to the CSV file.

  • chunksize (int, optional) – Number of rows per chunk. If specified, the file is read in chunks. If None (default), the entire file is read at once.

  • **kwargs (dict) – Additional arguments passed to pandas.read_csv.

Returns:

If chunksize is None, returns a single ReadCsvResult. Otherwise, returns a generator yielding ReadCsvResult for each chunk.

Return type:

ReadCsvResult or Generator of ReadCsvResult

Examples

>>> result = read_csv('data.csv')
>>> print(result.bad_lines)
>>> print(result.dataframe)
>>> for result in read_csv('data.csv', chunksize=50):
...     print(result.bad_lines)
...     print(result.dataframe)
abraxos.extract.read_csv_chunks(path, chunksize, **kwargs)[source]

Reads a CSV file in chunks and captures malformed lines.

Parameters:
  • path (str) – Path to the CSV file.

  • chunksize (int) – Number of rows per chunk.

  • **kwargs (dict) – Additional arguments passed to pandas.read_csv.

Yields:

ReadCsvResult – A named tuple containing bad lines and the parsed DataFrame for the chunk.

Return type:

collections.abc.Generator[abraxos.extract.ReadCsvResult, None, None]

Examples

>>> for result in read_csv_chunks('data.csv', chunksize=100):
...     print(result.bad_lines)
...     print(result.dataframe)

abraxos.load module

SQL loading utilities with error handling and retry logic.

class abraxos.load.SqlConnection(*args, **kwargs)[source]

Bases: Protocol

Protocol for a database connection that supports executing insert statements.

execute(insert, records)[source]

Execute an insert statement with given records.

Return type:

None

class abraxos.load.SqlEngine(*args, **kwargs)[source]

Bases: Protocol

Protocol for a database engine object that can provide connections.

connect()[source]

Obtain a SQL connection from the engine.

Return type:

abraxos.load.SqlConnection

class abraxos.load.SqlInsert(*args, **kwargs)[source]

Bases: Protocol

Protocol for a SQL insert statement object (e.g., sqlalchemy.Insert).

class abraxos.load.ToSqlResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]

Bases: NamedTuple

Result of inserting a DataFrame into a database.

errors

Exceptions encountered during insertion.

Type:

list of Exception

errored_df

Rows that failed to be inserted.

Type:

pandas.DataFrame

success_df

Rows that were successfully inserted.

Type:

pandas.DataFrame

count(value, /)

Return number of occurrences of value.

errored_df: pandas.core.frame.DataFrame

Alias for field number 1

errors: list[Exception]

Alias for field number 0

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

success_df: pandas.core.frame.DataFrame

Alias for field number 2

abraxos.load.to_sql(df, name, con, *, if_exists='append', index=False, chunks=2, **kwargs)[source]

Writes a DataFrame to a SQL database table with error handling.

Parameters:
  • df (pd.DataFrame) – The DataFrame to insert.

  • name (str) – Name of the target table.

  • con (SqlConnection or SqlEngine) – SQLAlchemy-like connection or engine object.

  • if_exists ({'fail', 'replace', 'append'}, optional) – SQL behavior if the table already exists (default is ‘append’).

  • index (bool, optional) – Whether to include the DataFrame index in the output (default is False).

  • chunks (int, optional) – Number of chunks to recursively split on failure (default is 2).

  • **kwargs (typing.Any) – Additional arguments passed to pandas.DataFrame.to_sql.

Returns:

A named tuple with lists of errors, failed rows, and successful rows.

Return type:

ToSqlResult

abraxos.load.use_sql(df, connection, sql_query, chunks=2)[source]

Executes user-provided SQL insert using insert_df with error handling.

Parameters:
  • df (pd.DataFrame) – The DataFrame to insert.

  • connection (SqlConnection) – SQL connection object.

  • sql_query (SqlInsert) – SQL insert statement object.

  • chunks (int, optional) – Number of chunks to split on failure (default is 2).

Returns:

A result indicating which rows succeeded and which failed.

Return type:

ToSqlResult

abraxos.transform module

DataFrame transformation with error isolation.

class abraxos.transform.TransformResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]

Bases: NamedTuple

Result of applying a transformation to a DataFrame.

errors

Exceptions raised during transformation.

Type:

list of Exception

errored_df

Rows that failed to transform.

Type:

pandas.DataFrame

success_df

Successfully transformed rows.

Type:

pandas.DataFrame

count(value, /)

Return number of occurrences of value.

errored_df: pandas.core.frame.DataFrame

Alias for field number 1

errors: list[Exception]

Alias for field number 0

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

success_df: pandas.core.frame.DataFrame

Alias for field number 2

abraxos.transform.transform(df, transformer, chunks=2)[source]

Applies a transformation function to a DataFrame with error isolation.

If the transformation raises an exception on a chunk, the DataFrame is split into smaller chunks recursively to isolate errors. Ultimately, rows that fail even as single-row DataFrames are collected separately.

Parameters:
  • df (pd.DataFrame) – The input DataFrame to transform.

  • transformer (Callable[[pd.DataFrame], pd.DataFrame]) – A function that transforms a DataFrame and returns a new DataFrame.

  • chunks (int, optional) – Number of subchunks to divide the DataFrame into if transformation fails (default is 2).

Returns:

A named tuple with: - errors: A list of exceptions that occurred during transformation. - errored_df: A DataFrame of rows that could not be transformed. - success_df: A DataFrame of successfully transformed rows.

Return type:

TransformResult

Examples

>>> import pandas as pd
>>> def double_values(df): return df.assign(value=df['value'] * 2)
>>> df = pd.DataFrame({'value': [1, 2, 3]})
>>> result = transform(df, double_values)
>>> result.success_df
   value
0      2
1      4
2      6
>>> result.errored_df.empty
True

abraxos.utils module

Utility functions for DataFrame operations.

abraxos.utils.clear(df)[source]

Returns an empty DataFrame with the same schema (columns and dtypes) as the input.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

An empty DataFrame with the same structure as df.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({'x': [1, 2, 3]})
>>> clear(df)
Empty DataFrame
Columns: [x]
Index: []
abraxos.utils.split(df, i=2)[source]

Splits a DataFrame into i approximately equal parts.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be split.

  • i (int, optional) – The number of parts to split the DataFrame into (default is 2).

Returns:

A tuple containing i DataFrames, each being a partition of the original DataFrame.

Return type:

tuple of pd.DataFrame

Examples

>>> import pandas as pd
>>> import abraxos
>>> df = pd.DataFrame({'A': range(10)})
>>> abraxos.split(df, 3)
(   A
0  0
1  1
2  2
3  3,
   A
4  4
5  5
6  6,
   A
7  7
8  8
9  9)
abraxos.utils.to_records(df)[source]

Converts a DataFrame to a list of record dictionaries, replacing NaN with None.

This is useful for inserting into databases that expect None for nulls.

Parameters:

df (pd.DataFrame) – The DataFrame to convert.

Returns:

A list of records (dicts), where each dict is a row in the DataFrame.

Return type:

list of dict

Examples

>>> df = pd.DataFrame({'a': [1, None], 'b': ['x', 'y']})
>>> to_records(df)
[{'a': 1.0, 'b': 'x'}, {'a': None, 'b': 'y'}]

abraxos.validate module

Pydantic model validation for DataFrame rows.

class abraxos.validate.PydanticModel(*args, **kwargs)[source]

Bases: Protocol

Protocol representing a Pydantic-like model for validation and serialization.

model_dump()[source]

Serializes the model into a dictionary.

Return type:

dict

model_validate(record)[source]

Validates a dictionary record and returns a validated model instance.

Return type:

abraxos.validate.PydanticModel

class abraxos.validate.ValidateResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]

Bases: NamedTuple

Result of validating a DataFrame using a Pydantic-like model.

errors

List of exceptions encountered during validation.

Type:

list of Exception

errored_df

DataFrame of rows that failed validation.

Type:

pd.DataFrame

success_df

DataFrame of successfully validated and serialized rows.

Type:

pd.DataFrame

count(value, /)

Return number of occurrences of value.

errored_df: pandas.core.frame.DataFrame

Alias for field number 1

errors: list[Exception]

Alias for field number 0

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

success_df: pandas.core.frame.DataFrame

Alias for field number 2

abraxos.validate.validate(df, model)[source]

Validates each row in a DataFrame using a Pydantic-like model.

Each record is passed to the model’s model_validate method. Successfully validated models are converted back into rows using model_dump.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing records to be validated.

  • model (type[PydanticModel] or PydanticModel) – A Pydantic-style model class or instance with model_validate and model_dump methods.

Returns:

A named tuple with: - errors: List of exceptions raised during validation. - errored_df: DataFrame of rows that failed validation. - success_df: DataFrame of rows that were successfully validated.

Return type:

ValidateResult

Examples

>>> import pandas as pd
>>> from pydantic import BaseModel
>>> class Person(BaseModel):
...     name: str
...     age: int
>>> df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [30, 'invalid']})
>>> result = validate(df, Person)
>>> len(result.success_df)
1
>>> len(result.errored_df)
1

Module contents

Abraxos: A lightweight Python toolkit for robust data processing with Pandas and Pydantic.

exception abraxos.AbraxosError[source]

Bases: Exception

Base exception for all abraxos errors.

add_note(object, /)

Exception.add_note(note) – add a note to the exception

args
with_traceback(object, /)

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception abraxos.LoadError[source]

Bases: AbraxosError

Exception raised when loading data to SQL fails.

add_note(object, /)

Exception.add_note(note) – add a note to the exception

args
with_traceback(object, /)

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class abraxos.ReadCsvResult(bad_lines: list[list[str]], dataframe: pd.DataFrame)[source]

Bases: NamedTuple

A named tuple representing the result of reading a CSV file.

bad_lines

List of lines that could not be parsed correctly.

Type:

list of list of str

dataframe

Parsed portion of the CSV file.

Type:

pandas.DataFrame

bad_lines: list[list[str]]

Alias for field number 0

count(value, /)

Return number of occurrences of value.

dataframe: pandas.core.frame.DataFrame

Alias for field number 1

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

class abraxos.ToSqlResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]

Bases: NamedTuple

Result of inserting a DataFrame into a database.

errors

Exceptions encountered during insertion.

Type:

list of Exception

errored_df

Rows that failed to be inserted.

Type:

pandas.DataFrame

success_df

Rows that were successfully inserted.

Type:

pandas.DataFrame

count(value, /)

Return number of occurrences of value.

errored_df: pandas.core.frame.DataFrame

Alias for field number 1

errors: list[Exception]

Alias for field number 0

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

success_df: pandas.core.frame.DataFrame

Alias for field number 2

exception abraxos.TransformError[source]

Bases: AbraxosError

Exception raised when DataFrame transformation fails.

add_note(object, /)

Exception.add_note(note) – add a note to the exception

args
with_traceback(object, /)

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class abraxos.TransformResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]

Bases: NamedTuple

Result of applying a transformation to a DataFrame.

errors

Exceptions raised during transformation.

Type:

list of Exception

errored_df

Rows that failed to transform.

Type:

pandas.DataFrame

success_df

Successfully transformed rows.

Type:

pandas.DataFrame

count(value, /)

Return number of occurrences of value.

errored_df: pandas.core.frame.DataFrame

Alias for field number 1

errors: list[Exception]

Alias for field number 0

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

success_df: pandas.core.frame.DataFrame

Alias for field number 2

class abraxos.ValidateResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]

Bases: NamedTuple

Result of validating a DataFrame using a Pydantic-like model.

errors

List of exceptions encountered during validation.

Type:

list of Exception

errored_df

DataFrame of rows that failed validation.

Type:

pd.DataFrame

success_df

DataFrame of successfully validated and serialized rows.

Type:

pd.DataFrame

count(value, /)

Return number of occurrences of value.

errored_df: pandas.core.frame.DataFrame

Alias for field number 1

errors: list[Exception]

Alias for field number 0

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

success_df: pandas.core.frame.DataFrame

Alias for field number 2

exception abraxos.ValidationError[source]

Bases: AbraxosError

Exception raised when row validation fails.

add_note(object, /)

Exception.add_note(note) – add a note to the exception

args
with_traceback(object, /)

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

abraxos.clear(df)[source]

Returns an empty DataFrame with the same schema (columns and dtypes) as the input.

Parameters:

df (pd.DataFrame) – The input DataFrame.

Returns:

An empty DataFrame with the same structure as df.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({'x': [1, 2, 3]})
>>> clear(df)
Empty DataFrame
Columns: [x]
Index: []
abraxos.read_csv(path, *, chunksize=None, **kwargs)[source]

Reads a CSV file and optionally processes it in chunks, capturing malformed lines.

Parameters:
  • path (str) – Path to the CSV file.

  • chunksize (int, optional) – Number of rows per chunk. If specified, the file is read in chunks. If None (default), the entire file is read at once.

  • **kwargs (dict) – Additional arguments passed to pandas.read_csv.

Returns:

If chunksize is None, returns a single ReadCsvResult. Otherwise, returns a generator yielding ReadCsvResult for each chunk.

Return type:

ReadCsvResult or Generator of ReadCsvResult

Examples

>>> result = read_csv('data.csv')
>>> print(result.bad_lines)
>>> print(result.dataframe)
>>> for result in read_csv('data.csv', chunksize=50):
...     print(result.bad_lines)
...     print(result.dataframe)
abraxos.read_csv_chunks(path, chunksize, **kwargs)[source]

Reads a CSV file in chunks and captures malformed lines.

Parameters:
  • path (str) – Path to the CSV file.

  • chunksize (int) – Number of rows per chunk.

  • **kwargs (dict) – Additional arguments passed to pandas.read_csv.

Yields:

ReadCsvResult – A named tuple containing bad lines and the parsed DataFrame for the chunk.

Return type:

collections.abc.Generator[abraxos.extract.ReadCsvResult, None, None]

Examples

>>> for result in read_csv_chunks('data.csv', chunksize=100):
...     print(result.bad_lines)
...     print(result.dataframe)
abraxos.split(df, i=2)[source]

Splits a DataFrame into i approximately equal parts.

Parameters:
  • df (pd.DataFrame) – The DataFrame to be split.

  • i (int, optional) – The number of parts to split the DataFrame into (default is 2).

Returns:

A tuple containing i DataFrames, each being a partition of the original DataFrame.

Return type:

tuple of pd.DataFrame

Examples

>>> import pandas as pd
>>> import abraxos
>>> df = pd.DataFrame({'A': range(10)})
>>> abraxos.split(df, 3)
(   A
0  0
1  1
2  2
3  3,
   A
4  4
5  5
6  6,
   A
7  7
8  8
9  9)
abraxos.to_records(df)[source]

Converts a DataFrame to a list of record dictionaries, replacing NaN with None.

This is useful for inserting into databases that expect None for nulls.

Parameters:

df (pd.DataFrame) – The DataFrame to convert.

Returns:

A list of records (dicts), where each dict is a row in the DataFrame.

Return type:

list of dict

Examples

>>> df = pd.DataFrame({'a': [1, None], 'b': ['x', 'y']})
>>> to_records(df)
[{'a': 1.0, 'b': 'x'}, {'a': None, 'b': 'y'}]
abraxos.to_sql(df, name, con, *, if_exists='append', index=False, chunks=2, **kwargs)[source]

Writes a DataFrame to a SQL database table with error handling.

Parameters:
  • df (pd.DataFrame) – The DataFrame to insert.

  • name (str) – Name of the target table.

  • con (SqlConnection or SqlEngine) – SQLAlchemy-like connection or engine object.

  • if_exists ({'fail', 'replace', 'append'}, optional) – SQL behavior if the table already exists (default is ‘append’).

  • index (bool, optional) – Whether to include the DataFrame index in the output (default is False).

  • chunks (int, optional) – Number of chunks to recursively split on failure (default is 2).

  • **kwargs (typing.Any) – Additional arguments passed to pandas.DataFrame.to_sql.

Returns:

A named tuple with lists of errors, failed rows, and successful rows.

Return type:

ToSqlResult

abraxos.transform(df, transformer, chunks=2)[source]

Applies a transformation function to a DataFrame with error isolation.

If the transformation raises an exception on a chunk, the DataFrame is split into smaller chunks recursively to isolate errors. Ultimately, rows that fail even as single-row DataFrames are collected separately.

Parameters:
  • df (pd.DataFrame) – The input DataFrame to transform.

  • transformer (Callable[[pd.DataFrame], pd.DataFrame]) – A function that transforms a DataFrame and returns a new DataFrame.

  • chunks (int, optional) – Number of subchunks to divide the DataFrame into if transformation fails (default is 2).

Returns:

A named tuple with: - errors: A list of exceptions that occurred during transformation. - errored_df: A DataFrame of rows that could not be transformed. - success_df: A DataFrame of successfully transformed rows.

Return type:

TransformResult

Examples

>>> import pandas as pd
>>> def double_values(df): return df.assign(value=df['value'] * 2)
>>> df = pd.DataFrame({'value': [1, 2, 3]})
>>> result = transform(df, double_values)
>>> result.success_df
   value
0      2
1      4
2      6
>>> result.errored_df.empty
True
abraxos.use_sql(df, connection, sql_query, chunks=2)[source]

Executes user-provided SQL insert using insert_df with error handling.

Parameters:
  • df (pd.DataFrame) – The DataFrame to insert.

  • connection (SqlConnection) – SQL connection object.

  • sql_query (SqlInsert) – SQL insert statement object.

  • chunks (int, optional) – Number of chunks to split on failure (default is 2).

Returns:

A result indicating which rows succeeded and which failed.

Return type:

ToSqlResult

abraxos.validate(df, model)[source]

Validates each row in a DataFrame using a Pydantic-like model.

Each record is passed to the model’s model_validate method. Successfully validated models are converted back into rows using model_dump.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing records to be validated.

  • model (type[PydanticModel] or PydanticModel) – A Pydantic-style model class or instance with model_validate and model_dump methods.

Returns:

A named tuple with: - errors: List of exceptions raised during validation. - errored_df: DataFrame of rows that failed validation. - success_df: DataFrame of rows that were successfully validated.

Return type:

ValidateResult

Examples

>>> import pandas as pd
>>> from pydantic import BaseModel
>>> class Person(BaseModel):
...     name: str
...     age: int
>>> df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [30, 'invalid']})
>>> result = validate(df, Person)
>>> len(result.success_df)
1
>>> len(result.errored_df)
1