============================ Parallelization Guide ============================ ZenoWrapper provides two independent levels of parallelization that can be combined for optimal performance on multi-core systems. Two-Level Parallelism ===================== 1. **Frame-Level Parallelism** (MDAnalysis) Distributes trajectory frames across multiple Python processes using MDAnalysis's parallel analysis framework. 2. **Within-Frame Parallelism** (ZENO C++) Parallelizes Monte Carlo walks within each frame using ZENO's native C++ threading. Architecture ============ .. code-block:: text ┌───────────────────────────────────────────────────────────────┐ │ MDAnalysis Multiprocessing (Frame Level) │ │ Distributes FRAMES across Python processes │ ├───────────────────────────────────────────────────────────────┤ │ Process 1 │ Process 2 │ Process 3 │ Process 4 │ │ Frames 0-24 │ Frames 25-49 │ Frames 50-74 │ Frames 75-99 │ └───────┬───────┴───────┬───────┴───────┬───────┴───────┬───────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ ZENO C++ │ │ ZENO C++ │ │ ZENO C++ │ │ ZENO C++ │ │ Threading │ │ Threading │ │ Threading │ │ Threading │ │ (Within Frame)│ │ (Within Frame)│ │ (Within Frame)│ │ (Within Frame)│ ├───────────────┤ ├───────────────┤ ├───────────────┤ ├───────────────┤ │ Thread 1 │ │ Thread 1 │ │ Thread 1 │ │ Thread 1 │ │ Thread 2 │ │ Thread 2 │ │ Thread 2 │ │ Thread 2 │ │ Thread 3 │ │ Thread 3 │ │ Thread 3 │ │ Thread 3 │ │ Thread 4 │ │ Thread 4 │ │ Thread 4 │ │ Thread 4 │ └───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘ Total Parallelism: 4 processes × 4 threads = 16 parallel computations Choosing a Parallelization Strategy ==================================== The optimal strategy depends on your workload characteristics: Many Frames, Fast Computation ------------------------------ **Use frame-level parallelism only** - **Configuration**: ``backend='multiprocessing'``, ``n_workers=N_CORES``, ``num_threads=1`` - **Best for**: Trajectories with >100 frames, ``n_walks`` < 100,000 - **Memory**: N_CORES × base memory (each worker loads full trajectory) - **Scaling**: Near-linear up to ~physical cores .. code-block:: python import MDAnalysis as mda from zenowrapper import ZenoWrapper u = mda.Universe('topology.pdb', 'trajectory.dcd') # 1000 frames zeno = ZenoWrapper( u.atoms, type_radii={'C': 1.7, 'N': 1.55, 'O': 1.52}, n_walks=50000, # Moderate computation per frame n_interior_samples=5000, num_threads=1 # Single-threaded per frame ) # Distribute frames across 16 workers zeno.run(backend='multiprocessing', n_workers=16) Few Frames, Expensive Computation ---------------------------------- **Use within-frame parallelism only** - **Configuration**: ``backend='serial'``, ``num_threads=N_CORES`` - **Best for**: <20 frames, ``n_walks`` > 1,000,000 - **Memory**: 1× base memory (shared across threads) - **Scaling**: 90-95% efficiency (ZENO's C++ threading is very efficient) .. code-block:: python u = mda.Universe('protein.pdb', 'single_frame.pdb') # Single frame zeno = ZenoWrapper( u.atoms, type_radii={'C': 1.7, 'N': 1.55, 'O': 1.52}, n_walks=10000000, # Very expensive: 10M walks! n_interior_samples=1000000, num_threads=16 # Multi-threaded ZENO computation ) # Process serially but with multi-threaded frames zeno.run(backend='serial') Balanced Workload (Hybrid) --------------------------- **Use both levels of parallelism** - **Configuration**: ``backend='multiprocessing'``, ``n_workers=K``, ``num_threads=M`` where K×M ≤ N_CORES - **Best for**: Medium trajectories (20-200 frames), moderate computation - **Memory**: K × base memory - **Scaling**: 60-75% efficiency (overhead from both levels) .. code-block:: python u = mda.Universe('topology.pdb', 'trajectory.dcd') # 100 frames zeno = ZenoWrapper( u.atoms, type_radii={'C': 1.7, 'N': 1.55, 'O': 1.52}, n_walks=500000, # Moderate computation n_interior_samples=50000, num_threads=4 # 4 threads per frame ) # 4 workers × 4 threads = 16 cores total zeno.run(backend='multiprocessing', n_workers=4) Performance Comparison ====================== Example: 100 frames, 1,000,000 walks per frame, 16-core machine +---------------------+----------------+----------------+--------------+------------------+ | Configuration | n_workers | num_threads | Total Time | Memory Usage | +=====================+================+================+==============+==================+ | Serial | 1 | 1 | ~1000s | 1× (baseline) | +---------------------+----------------+----------------+--------------+------------------+ | Frame-parallel only | 16 | 1 | ~65s | 16× | +---------------------+----------------+----------------+--------------+------------------+ | Thread-parallel only| 1 | 16 | ~100s | 1× | +---------------------+----------------+----------------+--------------+------------------+ | Hybrid | 4 | 4 | ~30s | 4× | +---------------------+----------------+----------------+--------------+------------------+ .. note:: Performance numbers are approximate and depend on system architecture, memory bandwidth, and workload specifics. Backend Selection ================= Serial Backend -------------- .. code-block:: python zeno.run(backend='serial') - Single-process execution - Always available - Use with high ``num_threads`` for within-frame parallelism - Best for: debugging, single frames, small systems Multiprocessing Backend ----------------------- .. code-block:: python zeno.run(backend='multiprocessing', n_workers=8) - Standard Python multiprocessing - No additional dependencies - Good for local multi-core machines - Each worker gets independent Python process - **Limitation**: Cannot use with streaming readers (e.g., IMDReader) Dask Backend ------------ .. code-block:: python zeno.run(backend='dask', n_workers=8) - Requires ``dask`` and ``dask.distributed`` packages - Supports distributed computing across multiple machines - More sophisticated scheduling - Better for very large workloads or clusters .. code-block:: bash # Install dask support pip install "dask[distributed]" Limitations =========== Trajectory Reader Compatibility -------------------------------- Frame-level parallelization requires trajectory readers that support: 1. **Random access**: Ability to seek to arbitrary frames 2. **Pickling**: Serialization for inter-process communication 3. **Independent copies**: Each worker creates its own reader instance **Compatible readers** (most file-based formats): - DCD, XTC, TRR, NetCDF, HDF5, PDB, etc. **Incompatible readers**: - :class:`~MDAnalysis.coordinates.IMD.IMDReader` (streaming, no random access) - Any custom readers without pickle support For incompatible readers, use serial backend with within-frame threading: .. code-block:: python # IMDReader example (streaming data) u = mda.Universe('topology.tpr', 'imd://localhost:8889') zeno = ZenoWrapper( u.atoms, type_radii=type_radii, num_threads=8 # Use threading only ) # Must use serial backend zeno.run(backend='serial') Memory Considerations --------------------- Each worker in frame-level parallelization loads a complete copy of the trajectory: .. code-block:: python # Memory usage ≈ n_workers × trajectory_size memory_needed = n_workers * trajectory_memory_footprint For large trajectories, consider: - Using fewer workers with more threads per worker - Processing trajectory in chunks - Using memory-efficient trajectory formats (e.g., XTC instead of DCD) Best Practices ============== 1. **Start with profiling**: Run a few frames serially to estimate per-frame cost 2. **Match strategy to workload**: Use guidelines above based on frame count and computation cost 3. **Monitor memory**: Ensure ``n_workers × trajectory_size`` fits in RAM 4. **Test scaling**: Verify speedup with small tests before full production runs 5. **Use fixed seeds**: Set ``seed`` parameter for reproducible parallel results 6. **Check results**: Compare serial vs parallel runs on small dataset to verify correctness Example: Adaptive Strategy --------------------------- .. code-block:: python import MDAnalysis as mda from zenowrapper import ZenoWrapper import multiprocessing u = mda.Universe('topology.pdb', 'trajectory.dcd') n_cores = multiprocessing.cpu_count() n_frames = len(u.trajectory) type_radii = {'C': 1.7, 'N': 1.55, 'O': 1.52} n_walks = 1000000 # Adaptive strategy based on workload if n_frames > 100 and n_walks < 100000: # Many frames, fast computation: maximize frame parallelism config = { 'backend': 'multiprocessing', 'n_workers': n_cores, 'num_threads': 1 } elif n_frames < 20 and n_walks > 1000000: # Few frames, expensive: maximize thread parallelism config = { 'backend': 'serial', 'n_workers': None, 'num_threads': n_cores } else: # Balanced: hybrid approach n_workers = max(1, n_cores // 4) threads_per_worker = n_cores // n_workers config = { 'backend': 'multiprocessing', 'n_workers': n_workers, 'num_threads': threads_per_worker } print(f"Using strategy: {config}") zeno = ZenoWrapper( u.atoms, type_radii=type_radii, n_walks=n_walks, num_threads=config['num_threads'] ) if config['backend'] == 'serial': zeno.run(backend='serial') else: zeno.run(backend=config['backend'], n_workers=config['n_workers']) See Also ======== - :ref:`parallel-analysis` : MDAnalysis parallel analysis framework - :class:`~MDAnalysis.analysis.base.AnalysisBase` : Base class documentation - `ZENO Documentation `_ : Algorithm and implementation details