persidict documentation#
persidict is a lightweight persistent key-value store for Python designed for distributed environments where multiple processes on different machines concurrently work with the same store.
Overview#
persidict provides a familiar dict-like API for persistent storage, supporting both local
filesystem and AWS S3 backends. Each key-value pair is stored as a separate file or S3 object,
enabling efficient concurrent access in distributed computing scenarios.
from persidict import FileDirDict, S3Dict
# Local storage
local_store = FileDirDict(base_dir="my_data")
local_store["key"] = "value"
# Cloud storage
cloud_store = S3Dict(bucket_name="my-bucket")
cloud_store["api_key"] = "ABC-123"
Key Features#
Persistent Storage: Store dictionaries on local filesystem (
FileDirDict) or AWS S3 (S3Dict)Standard Dictionary API: Use like regular Python dicts with
[],keys(),items(), etc.Distributed-Ready: Optimistic concurrency model designed for multi-process, multi-machine access
Flexible Serialization: Support for pickle, JSON, or plain text storage formats
Type Safety: Optional enforcement that all values are instances of a specific class
Generic Type Parameters: Static type checking with
FileDirDict[MyClass]syntaxAdvanced Features: Write-once dictionaries, timestamps, and tools for handling filesystem-safe keys
ETag-Based Conditional Operations: Optimistic concurrency helpers for conditional reads, writes, deletes, and transforms
Hierarchical Keys: Keys can be sequences of strings, creating directory-like structures
Core Concepts#
PersiDict Base Class#
PersiDict is the abstract base class defining the unified interface for all persistent
dictionaries. It extends Python’s MutableMapping with persistence-specific operations.
Key Types#
Keys in persidict must be URL/filename-safe:
SafeStrTuple: An immutable tuple of non-empty, filesystem-safe strings
NonEmptySafeStrTuple: A non-empty SafeStrTuple (the standard key type)
Keys can be provided as strings or sequences of strings and are automatically converted
Storage Implementations#
- FileDirDict
Local filesystem storage. Each key-value pair is a separate file in a directory hierarchy.
from persidict import FileDirDict store = FileDirDict( base_dir="./data", serialization_format="json", # or "pkl" or "txt" append_only=False, digest_len=8 # hash suffix for case-insensitive filesystems )
- S3Dict
AWS S3 cloud storage. Each key-value pair is an S3 object.
from persidict import S3Dict store = S3Dict( bucket_name="my-bucket", region="us-east-1", serialization_format="pkl" )
- LocalDict
In-memory storage for testing and ephemeral data.
from persidict import LocalDict store = LocalDict()
- EmptyDict
Null device equivalent - accepts all writes but discards them. Always appears empty. Useful for testing and debugging.
from persidict import EmptyDict store = EmptyDict() # All operations work but nothing is stored
Advanced Wrappers#
- WriteOnceDict
Enforces write-once behavior with optional probabilistic consistency checking.
from persidict import WriteOnceDict, FileDirDict store = WriteOnceDict( wrapped_dict=FileDirDict(append_only=True), p_consistency_checks=0.1 # 10% random validation )
- MutableDictCached
Adds intelligent caching with ETag validation for mutable dictionaries.
from persidict import MutableDictCached, FileDirDict, LocalDict store = MutableDictCached( main_dict=FileDirDict(base_dir="./data"), data_cache=LocalDict(), etag_cache=LocalDict() )
- AppendOnlyDictCached
High-performance caching for immutable (append-only) dictionaries.
from persidict import AppendOnlyDictCached, FileDirDict, LocalDict store = AppendOnlyDictCached( main_dict=FileDirDict(base_dir="./data", append_only=True), data_cache=LocalDict() )
- OverlappingMultiDict
Container for multiple PersiDict instances with different serialization formats sharing the same storage location.
from persidict import OverlappingMultiDict, FileDirDict multi = OverlappingMultiDict( dict_type=FileDirDict, shared_subdicts_params={"base_dir": "./data"}, json={}, # Creates multi.json with serialization_format="json" pkl={}, # Creates multi.pkl with serialization_format="pkl" txt={} # Creates multi.txt with serialization_format="txt" ) multi.json["config"] = {"setting": "value"} multi.pkl["model"] = trained_model multi.txt["log"] = "Plain text log entry"
Configuration Parameters#
Common Parameters#
All PersiDict implementations support these parameters:
serialization_formatstr, default=”pkl”Storage format:
"pkl"(pickle),"json"(JSON), or any other value for plain textbase_class_for_valuestype | None, default=NoneOptional type constraint - all values must be instances of this class
append_onlybool, default=FalseIf True, items cannot be modified or deleted after creation
FileDirDict Specific#
base_dirstrDirectory path for storing files
digest_lenint, default=4Length of hash suffix added to prevent collisions on case-insensitive filesystems
S3Dict Specific#
bucket_namestrName of the S3 bucket
regionstr | None, default=NoneAWS region for the bucket (uses default region if not specified)
base_dirstrLocal directory for caching downloaded files
Special Values (Jokers)#
persidict provides special command-like values for conditional operations:
KEEP_CURRENTWhen assigned to a key, preserves the existing value unchanged
from persidict import FileDirDict, KEEP_CURRENT store = FileDirDict(base_dir="./data") store["key"] = "original" store["key"] = KEEP_CURRENT # Value remains "original"
DELETE_CURRENTWhen assigned to a key, deletes that key-value pair
from persidict import FileDirDict, DELETE_CURRENT store = FileDirDict(base_dir="./data") store["key"] = "value" store["key"] = DELETE_CURRENT # Key is now deleted
Extended API Methods#
Beyond standard dict operations, PersiDict provides additional methods:
Timestamp Operations#
timestamp(key) -> float
Returns POSIX timestamp of key's last modification
oldest_keys(max_n=None) -> list[SafeStrTuple]
Returns keys sorted from oldest to newest
newest_keys(max_n=None) -> list[SafeStrTuple]
Returns keys sorted from newest to oldest
oldest_values(max_n=None) -> list[Any]
Returns values corresponding to oldest keys
newest_values(max_n=None) -> list[Any]
Returns values corresponding to newest keys
Hierarchical Operations#
get_subdict(prefix_key) -> PersiDict
Returns a view into keys sharing a common prefix
subdicts() -> dict[str, PersiDict]
Returns mapping of first-level key prefixes to sub-dictionaries
Utility Methods#
random_key() -> SafeStrTuple | None
Returns a uniformly random key (None if empty)
discard(key) -> bool
Deletes key if it exists, returns True; otherwise returns False
get_params() -> dict
Returns configuration parameters (mixinforge API)
Conditional Operations (ETag-based)#
Conditional operations provide optimistic concurrency. Each key has an ETag;
missing keys use ITEM_NOT_AVAILABLE. Conditions are ANY_ETAG
(unconditional), ETAG_IS_THE_SAME (expected == actual), and
ETAG_HAS_CHANGED (expected != actual). Methods return structured results
with whether the condition was satisfied, the actual ETag, the resulting ETag,
and the resulting value (or VALUE_NOT_RETRIEVED when value retrieval is
skipped).
from persidict import FileDirDict, ANY_ETAG, ETAG_IS_THE_SAME, ITEM_NOT_AVAILABLE
d = FileDirDict(base_dir="my_app_data")
d["counter"] = 1
# Compare-and-swap loop.
while True:
r = d.get_item_if("counter", condition=ANY_ETAG, expected_etag=ITEM_NOT_AVAILABLE)
new_value = 1 if r.new_value is ITEM_NOT_AVAILABLE else r.new_value + 1
r2 = d.set_item_if("counter", value=new_value, condition=ETAG_IS_THE_SAME, expected_etag=r.actual_etag)
if r2.condition_was_satisfied:
break
ANY_ETAG / ETAG_IS_THE_SAME / ETAG_HAS_CHANGED
Condition selectors for ETag-based operations
ITEM_NOT_AVAILABLE / VALUE_NOT_RETRIEVED
Sentinels for missing keys and skipped retrievals
ETagValue
Type for ETag strings (NewType over ``str``)
get_item_if(key, *, condition, expected_etag, retrieve_value=IF_ETAG_CHANGED) -> ConditionalOperationResult
Conditional read
set_item_if(key, *, value, condition, expected_etag, retrieve_value=IF_ETAG_CHANGED) -> ConditionalOperationResult
Conditional write (supports KEEP_CURRENT and DELETE_CURRENT)
setdefault_if(key, *, default_value, condition, expected_etag, retrieve_value=IF_ETAG_CHANGED) -> ConditionalOperationResult
Insert-if-absent
discard_if(key, *, condition, expected_etag) -> ConditionalOperationResult
Conditional delete
transform_item(key, *, transformer, n_retries=6) -> OperationResult
Retry loop for read-modify-write
etag(key) -> ETagValue
Returns ETag for the key
Enhanced Iterators#
keys_and_timestamps() -> Iterator[tuple[SafeStrTuple, float]]
Iterates over (key, timestamp) pairs
values_and_timestamps() -> Iterator[tuple[Any, float]]
Iterates over (value, timestamp) pairs
items_and_timestamps() -> Iterator[tuple[SafeStrTuple, Any, float]]
Iterates over (key, value, timestamp) triples
Design Principles#
Familiar dict-like API: Mirrors Python’s built-in dict interface
Optimistic concurrency: Assumes conflicts are rare; last-write-wins for mutations
Conditional operations: Explicit ETag-based reads and writes with structured results
Pluggable backends: Unified API across filesystem, S3, and in-memory storage
Hierarchical keys: Sequences of safe strings form natural directory structures
Flexible serialization: Choose pickle, JSON, or plain text per use case
Layered architecture: Compose capabilities (storage + caching + write-once)
Intelligent caching: Tune performance for mutable vs append-only access patterns
Trade-offs#
Eventual consistency: May briefly see stale data in distributed scenarios
No multi-key transactions: Single-key operations are atomic only
Memory vs speed: Caching trades memory for performance
Network dependency: Cloud backends require reliable connectivity
Use Cases#
persidict excels at:
Caching: Store expensive computation results accessible across machines
Configuration Management: Distribute application settings in multi-node deployments
Data Pipelines: Share data between pipeline stages
Distributed Task Queues: Store task definitions and results
Memoization: Cache function results persistently and distributedly
Model Registries: Store and version machine learning models
Experiment Tracking: Log and retrieve experiment parameters and results
Choosing a Configuration#
- Development/Testing
LocalDictorFileDirDictfor simplicitystore = FileDirDict(base_dir="./dev_data")
- Production, Single Machine
MutableDictCachedwithFileDirDictfor performancefrom persidict import MutableDictCached, FileDirDict, LocalDict store = MutableDictCached( main_dict=FileDirDict(base_dir="/var/app/data"), data_cache=LocalDict(), etag_cache=LocalDict() )
- Production, Distributed
MutableDictCachedwithS3Dictfor scalabilityfrom persidict import MutableDictCached, S3Dict, LocalDict store = MutableDictCached( main_dict=S3Dict(bucket_name="prod-data"), data_cache=LocalDict(), etag_cache=LocalDict() )
- Append-Only Workloads
AppendOnlyDictCachedfor maximum performancefrom persidict import AppendOnlyDictCached, FileDirDict, LocalDict store = AppendOnlyDictCached( main_dict=FileDirDict(base_dir="./data", append_only=True), data_cache=LocalDict() )
- Content-Addressed Storage
WriteOnceDictto avoid redundant writesfrom persidict import WriteOnceDict, FileDirDict store = WriteOnceDict( wrapped_dict=FileDirDict(append_only=True), p_consistency_checks=0.05 )
Installation#
Install from PyPI:
pip install persidict
With AWS S3 support:
pip install persidict[aws]
For development:
pip install persidict[dev]
Dependencies#
Core dependencies:
mixinforge
jsonpickle
joblib
lz4
deepdiff
AWS S3 support:
boto3
Development/testing:
pandas
numpy
pytest
moto
polars
scipy
pyarrow
Pillow
networkx
sympy
shapely
astropy
torch
Quick Start Examples#
Basic Usage#
from persidict import FileDirDict
# Create a persistent dictionary
cache = FileDirDict(base_dir="./cache")
# Use it like a regular dict
cache["user_123"] = {"name": "Alice", "score": 95}
cache["user_456"] = {"name": "Bob", "score": 87}
print(cache["user_123"]) # {'name': 'Alice', 'score': 95}
print(len(cache)) # 2
print("user_123" in cache) # True
# Data persists across sessions
cache2 = FileDirDict(base_dir="./cache")
print(cache2["user_123"]) # Still there!
Hierarchical Keys#
from persidict import FileDirDict
store = FileDirDict(base_dir="./data")
# Keys can be sequences of strings
store[("users", "alice", "profile")] = {"age": 30}
store[("users", "bob", "profile")] = {"age": 25}
store[("configs", "database")] = {"host": "localhost"}
# Get subdictionaries
users = store.get_subdict("users")
print(len(users)) # 2 (alice and bob)
# Access nested data
alice = store.get_subdict(("users", "alice"))
print(alice["profile"]) # {'age': 30}
Timestamps and Sorting#
from persidict import FileDirDict
import time
log = FileDirDict(base_dir="./logs")
log["event_1"] = "Started"
time.sleep(0.1)
log["event_2"] = "Processing"
time.sleep(0.1)
log["event_3"] = "Completed"
# Get events in chronological order
for key in log.oldest_keys():
timestamp = log.timestamp(key)
print(f"{key}: {log[key]} at {timestamp}")
# Get most recent events
recent = log.newest_keys(max_n=2)
print(recent) # [('event_3',), ('event_2',)]
Caching for Performance#
from persidict import MutableDictCached, FileDirDict, LocalDict
# Slow remote storage (e.g., network drive)
remote = FileDirDict(base_dir="/mnt/network/data")
# Fast local caches
data_cache = LocalDict()
etag_cache = LocalDict()
# Cached access
fast_store = MutableDictCached(
main_dict=remote,
data_cache=data_cache,
etag_cache=etag_cache
)
# First access: slow (reads from remote)
value = fast_store["key"]
# Subsequent accesses: fast (reads from cache)
value = fast_store["key"] # Much faster!
Type Safety#
from persidict import FileDirDict
import pandas as pd
# Only allow pandas DataFrames
df_store = FileDirDict(
base_dir="./dataframes",
base_class_for_values=pd.DataFrame,
serialization_format="json"
)
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
df_store["data"] = df # OK
# df_store["text"] = "hello" # Raises TypeError!
Generic Type Parameters#
All PersiDict classes support generic type parameters for static type checking:
from persidict import FileDirDict, LocalDict
# Typed dictionaries
scores: FileDirDict[int] = FileDirDict(base_dir="./scores")
scores["player1"] = 100
val: int = scores["player1"] # Type checker knows this is int
# Works with all implementations
cache: LocalDict[str] = LocalDict()
Why two mechanisms? Generic parameters are for static type checking only
(mypy, pyright). For runtime type enforcement, use base_class_for_values.
These are kept separate because many type hints—such as Callable, Literal,
TypedDict, and NewType—cannot be checked at runtime via isinstance().
Multiple Serialization Formats#
from persidict import OverlappingMultiDict, FileDirDict
multi = OverlappingMultiDict(
dict_type=FileDirDict,
shared_subdicts_params={"base_dir": "./storage"},
json={},
pkl={},
txt={}
)
# Store same logical data in different formats
config = {"host": "localhost", "port": 8080}
multi.json["config"] = config # Human-readable JSON
multi.pkl["config"] = config # Efficient pickle
# Store plain text
multi.txt["readme"] = "This is a plain text file."
API Reference#
API Documentation:
Project Statistics#
Metric |
Main code |
Unit Tests |
Total |
|---|---|---|---|
Lines Of Code (LOC) |
7471 |
20500 |
27971 |
Source Lines Of Code (SLOC) |
3303 |
13380 |
16683 |
Classes |
37 |
40 |
77 |
Functions / Methods |
296 |
1191 |
1487 |
Files |
17 |
136 |
153 |
Contributing#
Contributions are welcome! Please see the contributing guide for details on:
Setting up the development environment
Running tests
Code style guidelines
Commit message conventions
Submitting pull requests
License#
persidict is licensed under the MIT License. See the LICENSE file for details.
Resources#
GitHub: pythagoras-dev/persidict
Documentation: https://persidict.readthedocs.io/
Design Principles: design_principles.md
Contact#
Maintainer: Vlad (Volodymyr) Pavlov
Email: vlpavlov@ieee.org