tesliper.extraction.soxhlet

A tool for batch parsing files from specified directory.

Classes

Soxhlet([path, purpose, wanted_files, ...])

A tool for data extraction from files in specific directory.

class tesliper.extraction.soxhlet.Soxhlet(path: Optional[Union[str, pathlib.Path]] = None, purpose: str = 'gaussian', wanted_files: Optional[Iterable[Union[str, pathlib.Path]]] = None, extension: Optional[str] = None, recursive: bool = False)[source]

A tool for data extraction from files in specific directory. Typical use:

>>> s = Soxhlet('absolute/path_to/working/directory')
>>> data = s.extract()
Parameters
  • path (str or pathlib.Path) – String representing absolute path to directory containing files, which will be the subject of data extraction.

  • purpose (str) – Determines which from registered parsers should be used for extraction. purposes supported out-of-the-box are “gaussian”, “spectra”, and “parameters”.

  • wanted_files (list of str or pathlib.Path objects, optional) – List of files, that should be loaded for further extraction. If omitted, all output files present in directory will be processed.

  • extension (str, optional) – A string representing file extension of output files, that should be parsed. If omitted, Soxhlet will try to resolve it based on contents of directory given in path parameter.

  • recursive (bool) – If True, given path will be searched recursively, extracting data from subdirectories, otherwise subdirectories are ignored and only files placed directly in path will be parsed.

Raises
  • FileNotFoundError – If path passed as argument to constructor doesn’t exist or is not a directory.

  • ValueError – If no parser is registered for given purpose.

property all_files

List of all files present in directory bounded to Soxhlet instance. If its recursive attribute is True, also files from subdirectories are included.

property files

List of all wanted files available in given directory. If wanted_files is not specified, evaluates to all files in said directory. If Soxhlet object’s recursive attribute is True, also files from subdirectories are included.

property wanted_files: Optional[Set[str]]

Set of files that are desired for data extraction, stored as filenames without an extension. Any iterable of strings or Path objects is transformed to this form.

>>> s = Soxhlet()
>>> s.wanted_files = [Path("./dir/file_one.out"), Path("./dir/file_two.out")]
>>> s.wanted_files
{"file_one", "file_two"}

May also be set to None or other “falsy” value, in such case it is ignored.

property output_files: List[pathlib.Path]

List of (sorted by file name) gaussian output files from files list associated with Soxhlet instance.

filter_files(ext: Optional[str] = None) List[pathlib.Path][source]

Filters files from filenames list.

Filters file names in list associated with Soxhlet object instance. It returns list of file names ending with provided ext string, representing file extension and starting with any of filenames associated with instance as wanted_files if those were provided.

Parameters

ext (str) – Strings representing file extension.

Returns

List of filtered filenames as strings.

Return type

list

Raises

ValueError – If parameter ext is not given and attribute extension in None.

guess_extension() str[source]

Tries to figure out which extension should be assumed.

Looks for files, which names end with one of the extensions defined by currently used parser. Returns extension that matches as the only one. Raises an exception if extension cannot be easily guessed.

Returns

The extension of files that are present in filenames list, which current parser can parse.

Return type

str

Raises
  • ValueError – If more than one type of files declared by a current parser as possibly compatible is present in list of filenames.

  • FileNotFoundError – If none of files declared by a current parser as possibly compatible are present in list of filenames.

  • TypeError – If current parser does not declare any compatible file extensions.

extract_iter() Generator[Tuple[str, dict], None, None][source]

Extracts data from files associated with Soxhlet instance (via path and wanted_files attributes), using a current parser (determined by a purpose provided on Soxhlet’s instantiation). Implemented as generator. If Soxhlet instance’s recursive attribute is True, also files from subdirectories are parsed.

Yields

tuple – Two item tuple with name of parsed file as first and extracted data as second item, for each file associated with Soxhlet instance.

extract() dict[source]

Extracts data from files associated with Soxhlet instance (via path and wanted_files attributes), using a current parser (determined by a purpose provided on Soxhlet’s instantiation). If Soxhlet.recursive attribute is True, also files from subdirectories are parsed.

Returns

dictionary of extracted data, with name of parsed file as key and data as value, for each file associated with Soxhlet instance.

Return type

dict of dicts

parse_one(source: Union[str, pathlib.Path]) Any[source]

Parse one file using current parser (determined by a purpose provided on Soxhlet’s instantiation) and return extracted data.

Parameters

source (str or Path) – Path or Path-like object to a file. May be given as an absolute path or relative to the Soxhlet.path.

Returns

Data in a format that current parser provides.

Return type

any

Raises

FileNotFoundError – If no source file is found.