abraxos package¶
Submodules¶
abraxos.extract module¶
CSV reading utilities with bad line recovery.
- class abraxos.extract.ReadCsvResult(bad_lines: list[list[str]], dataframe: pd.DataFrame)[source]
Bases:
NamedTupleA named tuple representing the result of reading a CSV file.
- bad_lines
List of lines that could not be parsed correctly.
- Type:
list of list of str
- dataframe
Parsed portion of the CSV file.
- Type:
pandas.DataFrame
-
bad_lines:
list[list[str]] Alias for field number 0
- count(value, /)
Return number of occurrences of value.
-
dataframe:
pandas.core.frame.DataFrame Alias for field number 1
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
- abraxos.extract.read_csv(path, *, chunksize=None, **kwargs)[source]
Reads a CSV file and optionally processes it in chunks, capturing malformed lines.
- Parameters:
path (str) – Path to the CSV file.
chunksize (int, optional) – Number of rows per chunk. If specified, the file is read in chunks. If None (default), the entire file is read at once.
**kwargs (dict) – Additional arguments passed to pandas.read_csv.
- Returns:
If chunksize is None, returns a single ReadCsvResult. Otherwise, returns a generator yielding ReadCsvResult for each chunk.
- Return type:
ReadCsvResult or Generator of ReadCsvResult
Examples
>>> result = read_csv('data.csv') >>> print(result.bad_lines) >>> print(result.dataframe)
>>> for result in read_csv('data.csv', chunksize=50): ... print(result.bad_lines) ... print(result.dataframe)
- abraxos.extract.read_csv_chunks(path, chunksize, **kwargs)[source]
Reads a CSV file in chunks and captures malformed lines.
- Parameters:
path (str) – Path to the CSV file.
chunksize (int) – Number of rows per chunk.
**kwargs (dict) – Additional arguments passed to pandas.read_csv.
- Yields:
ReadCsvResult – A named tuple containing bad lines and the parsed DataFrame for the chunk.
- Return type:
collections.abc.Generator[abraxos.extract.ReadCsvResult,None,None]
Examples
>>> for result in read_csv_chunks('data.csv', chunksize=100): ... print(result.bad_lines) ... print(result.dataframe)
abraxos.load module¶
SQL loading utilities with error handling and retry logic.
- class abraxos.load.SqlConnection(*args, **kwargs)[source]
Bases:
ProtocolProtocol for a database connection that supports executing insert statements.
- execute(insert, records)[source]
Execute an insert statement with given records.
- Return type:
None
- class abraxos.load.SqlEngine(*args, **kwargs)[source]
Bases:
ProtocolProtocol for a database engine object that can provide connections.
- connect()[source]
Obtain a SQL connection from the engine.
- Return type:
abraxos.load.SqlConnection
- class abraxos.load.SqlInsert(*args, **kwargs)[source]
Bases:
ProtocolProtocol for a SQL insert statement object (e.g., sqlalchemy.Insert).
- class abraxos.load.ToSqlResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]
Bases:
NamedTupleResult of inserting a DataFrame into a database.
- errors
Exceptions encountered during insertion.
- Type:
list of Exception
- errored_df
Rows that failed to be inserted.
- Type:
pandas.DataFrame
- success_df
Rows that were successfully inserted.
- Type:
pandas.DataFrame
- count(value, /)
Return number of occurrences of value.
-
errored_df:
pandas.core.frame.DataFrame Alias for field number 1
-
errors:
list[Exception] Alias for field number 0
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
-
success_df:
pandas.core.frame.DataFrame Alias for field number 2
- abraxos.load.to_sql(df, name, con, *, if_exists='append', index=False, chunks=2, **kwargs)[source]
Writes a DataFrame to a SQL database table with error handling.
- Parameters:
df (pd.DataFrame) – The DataFrame to insert.
name (str) – Name of the target table.
con (SqlConnection or SqlEngine) – SQLAlchemy-like connection or engine object.
if_exists ({'fail', 'replace', 'append'}, optional) – SQL behavior if the table already exists (default is ‘append’).
index (bool, optional) – Whether to include the DataFrame index in the output (default is False).
chunks (int, optional) – Number of chunks to recursively split on failure (default is 2).
**kwargs (
typing.Any) – Additional arguments passed to pandas.DataFrame.to_sql.
- Returns:
A named tuple with lists of errors, failed rows, and successful rows.
- Return type:
ToSqlResult
- abraxos.load.use_sql(df, connection, sql_query, chunks=2)[source]
Executes user-provided SQL insert using insert_df with error handling.
- Parameters:
df (pd.DataFrame) – The DataFrame to insert.
connection (SqlConnection) – SQL connection object.
sql_query (SqlInsert) – SQL insert statement object.
chunks (int, optional) – Number of chunks to split on failure (default is 2).
- Returns:
A result indicating which rows succeeded and which failed.
- Return type:
ToSqlResult
abraxos.transform module¶
DataFrame transformation with error isolation.
- class abraxos.transform.TransformResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]
Bases:
NamedTupleResult of applying a transformation to a DataFrame.
- errors
Exceptions raised during transformation.
- Type:
list of Exception
- errored_df
Rows that failed to transform.
- Type:
pandas.DataFrame
- success_df
Successfully transformed rows.
- Type:
pandas.DataFrame
- count(value, /)
Return number of occurrences of value.
-
errored_df:
pandas.core.frame.DataFrame Alias for field number 1
-
errors:
list[Exception] Alias for field number 0
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
-
success_df:
pandas.core.frame.DataFrame Alias for field number 2
- abraxos.transform.transform(df, transformer, chunks=2)[source]
Applies a transformation function to a DataFrame with error isolation.
If the transformation raises an exception on a chunk, the DataFrame is split into smaller chunks recursively to isolate errors. Ultimately, rows that fail even as single-row DataFrames are collected separately.
- Parameters:
df (pd.DataFrame) – The input DataFrame to transform.
transformer (Callable[[pd.DataFrame], pd.DataFrame]) – A function that transforms a DataFrame and returns a new DataFrame.
chunks (int, optional) – Number of subchunks to divide the DataFrame into if transformation fails (default is 2).
- Returns:
A named tuple with: - errors: A list of exceptions that occurred during transformation. - errored_df: A DataFrame of rows that could not be transformed. - success_df: A DataFrame of successfully transformed rows.
- Return type:
TransformResult
Examples
>>> import pandas as pd >>> def double_values(df): return df.assign(value=df['value'] * 2) >>> df = pd.DataFrame({'value': [1, 2, 3]}) >>> result = transform(df, double_values) >>> result.success_df value 0 2 1 4 2 6 >>> result.errored_df.empty True
abraxos.utils module¶
Utility functions for DataFrame operations.
- abraxos.utils.clear(df)[source]
Returns an empty DataFrame with the same schema (columns and dtypes) as the input.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
- Returns:
An empty DataFrame with the same structure as df.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({'x': [1, 2, 3]}) >>> clear(df) Empty DataFrame Columns: [x] Index: []
- abraxos.utils.split(df, i=2)[source]
Splits a DataFrame into i approximately equal parts.
- Parameters:
df (pd.DataFrame) – The DataFrame to be split.
i (int, optional) – The number of parts to split the DataFrame into (default is 2).
- Returns:
A tuple containing i DataFrames, each being a partition of the original DataFrame.
- Return type:
tuple of pd.DataFrame
Examples
>>> import pandas as pd >>> import abraxos >>> df = pd.DataFrame({'A': range(10)}) >>> abraxos.split(df, 3) ( A 0 0 1 1 2 2 3 3, A 4 4 5 5 6 6, A 7 7 8 8 9 9)
- abraxos.utils.to_records(df)[source]
Converts a DataFrame to a list of record dictionaries, replacing NaN with None.
This is useful for inserting into databases that expect None for nulls.
- Parameters:
df (pd.DataFrame) – The DataFrame to convert.
- Returns:
A list of records (dicts), where each dict is a row in the DataFrame.
- Return type:
list of dict
Examples
>>> df = pd.DataFrame({'a': [1, None], 'b': ['x', 'y']}) >>> to_records(df) [{'a': 1.0, 'b': 'x'}, {'a': None, 'b': 'y'}]
abraxos.validate module¶
Pydantic model validation for DataFrame rows.
- class abraxos.validate.PydanticModel(*args, **kwargs)[source]
Bases:
ProtocolProtocol representing a Pydantic-like model for validation and serialization.
- model_dump()[source]
Serializes the model into a dictionary.
- Return type:
dict
- model_validate(record)[source]
Validates a dictionary record and returns a validated model instance.
- Return type:
abraxos.validate.PydanticModel
- class abraxos.validate.ValidateResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]
Bases:
NamedTupleResult of validating a DataFrame using a Pydantic-like model.
- errors
List of exceptions encountered during validation.
- Type:
list of Exception
- errored_df
DataFrame of rows that failed validation.
- Type:
pd.DataFrame
- success_df
DataFrame of successfully validated and serialized rows.
- Type:
pd.DataFrame
- count(value, /)
Return number of occurrences of value.
-
errored_df:
pandas.core.frame.DataFrame Alias for field number 1
-
errors:
list[Exception] Alias for field number 0
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
-
success_df:
pandas.core.frame.DataFrame Alias for field number 2
- abraxos.validate.validate(df, model)[source]
Validates each row in a DataFrame using a Pydantic-like model.
Each record is passed to the model’s model_validate method. Successfully validated models are converted back into rows using model_dump.
- Parameters:
df (pd.DataFrame) – The DataFrame containing records to be validated.
model (type[PydanticModel] or PydanticModel) – A Pydantic-style model class or instance with model_validate and model_dump methods.
- Returns:
A named tuple with: - errors: List of exceptions raised during validation. - errored_df: DataFrame of rows that failed validation. - success_df: DataFrame of rows that were successfully validated.
- Return type:
ValidateResult
Examples
>>> import pandas as pd >>> from pydantic import BaseModel >>> class Person(BaseModel): ... name: str ... age: int >>> df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [30, 'invalid']}) >>> result = validate(df, Person) >>> len(result.success_df) 1 >>> len(result.errored_df) 1
Module contents¶
Abraxos: A lightweight Python toolkit for robust data processing with Pandas and Pydantic.
- exception abraxos.AbraxosError[source]
Bases:
ExceptionBase exception for all abraxos errors.
- add_note(object, /)
Exception.add_note(note) – add a note to the exception
- args
- with_traceback(object, /)
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception abraxos.LoadError[source]
Bases:
AbraxosErrorException raised when loading data to SQL fails.
- add_note(object, /)
Exception.add_note(note) – add a note to the exception
- args
- with_traceback(object, /)
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class abraxos.ReadCsvResult(bad_lines: list[list[str]], dataframe: pd.DataFrame)[source]
Bases:
NamedTupleA named tuple representing the result of reading a CSV file.
- bad_lines
List of lines that could not be parsed correctly.
- Type:
list of list of str
- dataframe
Parsed portion of the CSV file.
- Type:
pandas.DataFrame
-
bad_lines:
list[list[str]] Alias for field number 0
- count(value, /)
Return number of occurrences of value.
-
dataframe:
pandas.core.frame.DataFrame Alias for field number 1
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
- class abraxos.ToSqlResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]
Bases:
NamedTupleResult of inserting a DataFrame into a database.
- errors
Exceptions encountered during insertion.
- Type:
list of Exception
- errored_df
Rows that failed to be inserted.
- Type:
pandas.DataFrame
- success_df
Rows that were successfully inserted.
- Type:
pandas.DataFrame
- count(value, /)
Return number of occurrences of value.
-
errored_df:
pandas.core.frame.DataFrame Alias for field number 1
-
errors:
list[Exception] Alias for field number 0
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
-
success_df:
pandas.core.frame.DataFrame Alias for field number 2
- exception abraxos.TransformError[source]
Bases:
AbraxosErrorException raised when DataFrame transformation fails.
- add_note(object, /)
Exception.add_note(note) – add a note to the exception
- args
- with_traceback(object, /)
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class abraxos.TransformResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]
Bases:
NamedTupleResult of applying a transformation to a DataFrame.
- errors
Exceptions raised during transformation.
- Type:
list of Exception
- errored_df
Rows that failed to transform.
- Type:
pandas.DataFrame
- success_df
Successfully transformed rows.
- Type:
pandas.DataFrame
- count(value, /)
Return number of occurrences of value.
-
errored_df:
pandas.core.frame.DataFrame Alias for field number 1
-
errors:
list[Exception] Alias for field number 0
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
-
success_df:
pandas.core.frame.DataFrame Alias for field number 2
- class abraxos.ValidateResult(errors: list[Exception], errored_df: pd.DataFrame, success_df: pd.DataFrame)[source]
Bases:
NamedTupleResult of validating a DataFrame using a Pydantic-like model.
- errors
List of exceptions encountered during validation.
- Type:
list of Exception
- errored_df
DataFrame of rows that failed validation.
- Type:
pd.DataFrame
- success_df
DataFrame of successfully validated and serialized rows.
- Type:
pd.DataFrame
- count(value, /)
Return number of occurrences of value.
-
errored_df:
pandas.core.frame.DataFrame Alias for field number 1
-
errors:
list[Exception] Alias for field number 0
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
-
success_df:
pandas.core.frame.DataFrame Alias for field number 2
- exception abraxos.ValidationError[source]
Bases:
AbraxosErrorException raised when row validation fails.
- add_note(object, /)
Exception.add_note(note) – add a note to the exception
- args
- with_traceback(object, /)
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- abraxos.clear(df)[source]
Returns an empty DataFrame with the same schema (columns and dtypes) as the input.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
- Returns:
An empty DataFrame with the same structure as df.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({'x': [1, 2, 3]}) >>> clear(df) Empty DataFrame Columns: [x] Index: []
- abraxos.read_csv(path, *, chunksize=None, **kwargs)[source]
Reads a CSV file and optionally processes it in chunks, capturing malformed lines.
- Parameters:
path (str) – Path to the CSV file.
chunksize (int, optional) – Number of rows per chunk. If specified, the file is read in chunks. If None (default), the entire file is read at once.
**kwargs (dict) – Additional arguments passed to pandas.read_csv.
- Returns:
If chunksize is None, returns a single ReadCsvResult. Otherwise, returns a generator yielding ReadCsvResult for each chunk.
- Return type:
ReadCsvResult or Generator of ReadCsvResult
Examples
>>> result = read_csv('data.csv') >>> print(result.bad_lines) >>> print(result.dataframe)
>>> for result in read_csv('data.csv', chunksize=50): ... print(result.bad_lines) ... print(result.dataframe)
- abraxos.read_csv_chunks(path, chunksize, **kwargs)[source]
Reads a CSV file in chunks and captures malformed lines.
- Parameters:
path (str) – Path to the CSV file.
chunksize (int) – Number of rows per chunk.
**kwargs (dict) – Additional arguments passed to pandas.read_csv.
- Yields:
ReadCsvResult – A named tuple containing bad lines and the parsed DataFrame for the chunk.
- Return type:
collections.abc.Generator[abraxos.extract.ReadCsvResult,None,None]
Examples
>>> for result in read_csv_chunks('data.csv', chunksize=100): ... print(result.bad_lines) ... print(result.dataframe)
- abraxos.split(df, i=2)[source]
Splits a DataFrame into i approximately equal parts.
- Parameters:
df (pd.DataFrame) – The DataFrame to be split.
i (int, optional) – The number of parts to split the DataFrame into (default is 2).
- Returns:
A tuple containing i DataFrames, each being a partition of the original DataFrame.
- Return type:
tuple of pd.DataFrame
Examples
>>> import pandas as pd >>> import abraxos >>> df = pd.DataFrame({'A': range(10)}) >>> abraxos.split(df, 3) ( A 0 0 1 1 2 2 3 3, A 4 4 5 5 6 6, A 7 7 8 8 9 9)
- abraxos.to_records(df)[source]
Converts a DataFrame to a list of record dictionaries, replacing NaN with None.
This is useful for inserting into databases that expect None for nulls.
- Parameters:
df (pd.DataFrame) – The DataFrame to convert.
- Returns:
A list of records (dicts), where each dict is a row in the DataFrame.
- Return type:
list of dict
Examples
>>> df = pd.DataFrame({'a': [1, None], 'b': ['x', 'y']}) >>> to_records(df) [{'a': 1.0, 'b': 'x'}, {'a': None, 'b': 'y'}]
- abraxos.to_sql(df, name, con, *, if_exists='append', index=False, chunks=2, **kwargs)[source]
Writes a DataFrame to a SQL database table with error handling.
- Parameters:
df (pd.DataFrame) – The DataFrame to insert.
name (str) – Name of the target table.
con (SqlConnection or SqlEngine) – SQLAlchemy-like connection or engine object.
if_exists ({'fail', 'replace', 'append'}, optional) – SQL behavior if the table already exists (default is ‘append’).
index (bool, optional) – Whether to include the DataFrame index in the output (default is False).
chunks (int, optional) – Number of chunks to recursively split on failure (default is 2).
**kwargs (
typing.Any) – Additional arguments passed to pandas.DataFrame.to_sql.
- Returns:
A named tuple with lists of errors, failed rows, and successful rows.
- Return type:
ToSqlResult
- abraxos.transform(df, transformer, chunks=2)[source]
Applies a transformation function to a DataFrame with error isolation.
If the transformation raises an exception on a chunk, the DataFrame is split into smaller chunks recursively to isolate errors. Ultimately, rows that fail even as single-row DataFrames are collected separately.
- Parameters:
df (pd.DataFrame) – The input DataFrame to transform.
transformer (Callable[[pd.DataFrame], pd.DataFrame]) – A function that transforms a DataFrame and returns a new DataFrame.
chunks (int, optional) – Number of subchunks to divide the DataFrame into if transformation fails (default is 2).
- Returns:
A named tuple with: - errors: A list of exceptions that occurred during transformation. - errored_df: A DataFrame of rows that could not be transformed. - success_df: A DataFrame of successfully transformed rows.
- Return type:
TransformResult
Examples
>>> import pandas as pd >>> def double_values(df): return df.assign(value=df['value'] * 2) >>> df = pd.DataFrame({'value': [1, 2, 3]}) >>> result = transform(df, double_values) >>> result.success_df value 0 2 1 4 2 6 >>> result.errored_df.empty True
- abraxos.use_sql(df, connection, sql_query, chunks=2)[source]
Executes user-provided SQL insert using insert_df with error handling.
- Parameters:
df (pd.DataFrame) – The DataFrame to insert.
connection (SqlConnection) – SQL connection object.
sql_query (SqlInsert) – SQL insert statement object.
chunks (int, optional) – Number of chunks to split on failure (default is 2).
- Returns:
A result indicating which rows succeeded and which failed.
- Return type:
ToSqlResult
- abraxos.validate(df, model)[source]
Validates each row in a DataFrame using a Pydantic-like model.
Each record is passed to the model’s model_validate method. Successfully validated models are converted back into rows using model_dump.
- Parameters:
df (pd.DataFrame) – The DataFrame containing records to be validated.
model (type[PydanticModel] or PydanticModel) – A Pydantic-style model class or instance with model_validate and model_dump methods.
- Returns:
A named tuple with: - errors: List of exceptions raised during validation. - errored_df: DataFrame of rows that failed validation. - success_df: DataFrame of rows that were successfully validated.
- Return type:
ValidateResult
Examples
>>> import pandas as pd >>> from pydantic import BaseModel >>> class Person(BaseModel): ... name: str ... age: int >>> df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [30, 'invalid']}) >>> result = validate(df, Person) >>> len(result.success_df) 1 >>> len(result.errored_df) 1