Charm++ is a mature, highly scalable parallel programming system. Though written in a C++ skeleton, it is compatible with Fortran, C, and C++. Code written in MPI can call Charm++, and Charm++ can call MPI, OpenMP, CUDA, and more.
Charm++ offers increased performance and the ability for code to continue increasing in performance even as computer processors stop getting faster.
Relative to Shared-Memory Parallel Code...
Charm++ offers increased performance on equal hardware (often 2x), and the ability for code to scale to larger problem sizes and increased performance by running across multiple computer nodes in a ‘distributed-memory’ fashion, with minor code modifications.
Relative to Distributed-Memory Parallel Code...
Charm++ offers improved ability to express sophisticated parallel programs, seamless automatic fault tolerance, and expert-quality facilities to address concerns of communication locality, load imbalance, message aggregation, and more, that would otherwise be left unhandled or receive an ad-hoc per-application treatment.
The problem is broken down into logical units, which are automatically mapped to processors.
Load imbalance arises in many HPC applications, and also occurs on mixed-hardware clusters. Rather than make every program solve this on its own, Charm++ provides automatic load balancing for all applications.
Automatic Communication & Computation Overlap
Charm++ exploits logical decomposition to enable dynamic overlap of communication and computation as the application executes.
Checkpointing & Resilience
Applications written in Charm++ can automatically checkpoint and restart with no extra code or special OS support. They can also run through node failures!
Charm++ can adapt execution to limit power consumption, reduce hotspots to improve reliability, and conserve energy to reduce cluster TCO.
Projections: an extensive suite for understanding the performance of your applications.
LiveViz: get live visualization data from any application.
CharmDebug: debug your Charm++ code interactively
Collision Detection: a highly scalable collision detection library written in Charm++.
Sorting: a highly efficient Charm++ library that can be used to sort billions of keys.
In partnership with the University of Illinois, Charmworks is the exclusive commercial licensor for the Charm++ parallel programming system and its associated tools. Licenses are offered for a wide range of needs.
Developer licenses cover the full compilation toolchain for Charm++ and mixed Charm++/MPI applications, including tools for debugging, performance analysis, and visualization, as well as runtime licenses for correctness and scaling validation.
Runtime licenses enable usage of Charm++ codes on production-scale clusters and supercomputers.
Embedded library licenses cover particular components built in Charm++ called from non-Charm++ applications.
Application distribution licenses are useful for ISVs that wish to incorporate Charm++ into their products.
Contact us today for more information on pricing and licensing options!
Our staff is available to teach courses in parallel computing, focusing on scalable algorithm design and efficient implementation using the Charm++ system. These courses can range in length from short introductions of a few hours to a
week-long hands-on tutorial. We tailor the coverage and presentation of each course to your group’s knowledge and experience.
We can provide the following consulting services:
Problem analysis and solution method development
Application Development: architecture, design, testing, debugging, verification, and validation
Performance Engineering: analysis, tuning, restructuring, and optimization
Integration with existing applications and work-flows
Cluster hardware purchasing assistance
Cloud and utility computing deployment
We offer many remote support options to make sure there are no problems
Installation matched to your environment
Integration with your scheduler & resource manager
Rapid solutions for any bugs encountered
Contact us today for more information about our services.
Explore how Charm++ has been used in other applications using the slider below.
Domain: Classical MD
Converted From: PVM
Scale: 500k CPU cores
NAMD, recipient of a 2002 Gordon Bell Award, is a parallel molecular dynamics application designed for high-performance
simulation of large biomolecular systems. NAMD is a result of many years of collaboration between Prof. Kale, Prof. Robert D. Skeel, and Prof. Klaus J. Schulten at the Theoretical and Computational Biophysics Group (TCBG) of Beckman
Charm++, developed by Prof. Kale and co-workers, simplifies parallel programming and provides automatic load balancing, which was crucial to the performance of NAMD. It is used by tens of thousands of biophysical researchers with production versions installed
on most supercomputing platforms. NAMD scales to hundreds of cores for small simulations and beyond 300,000 cores for the largest simulations.
The dynamic components of NAMD are implemented in the Charm++ parallel language. It is composed of collections of C++ objects, which communicate by remotely invoking methods on other objects. This supports the multi-partition decompositions
in NAMD. Also data-driven execution adaptively overlaps communication and computation. Finally, NAMD benefits from Charm++'s load balancing framework to achieve unsurpassed parallel performance. See PPL NAMD research page for more details.
Domain: N-body gravity & SPH
Converted From: MPI
Scale: 500k CPU cores
ChaNGa (Charm++ N-Body Gravity Simulator) is a cosmological simulator to study formation of galaxies
and other large scale structures in the Universe. It is a result of interdisciplinary collaboration between Prof. Kale, Prof. Thomas Quinn of University of Washington and Prof. Orion Lawlor of University of Alaska Fairbanks. ChaNGa
is a production code with the features required for accurate simulation, including canonical, comoving coordinates with a symplectic integrator to efficiently handle cosmological dynamics, individual and adaptive time steps, periodic
boundary conditions using Ewald summation, and Smooth Particle Hydrodynamics (SPH) for adiabatic gas.
ChaNGa implements the well-known Barnes-Hut algorithm, which has N log N computational complexity, organizing the particles involved in the simulation into a tree based on Oct, Orthogonal Recursive Bisection (ORB), or SFC decompositions.
In order to compute the pair interaction between particles and collections of particles on different processors, parts of the tree needed for the computation are imported from the remote processors which own them. ChaNGa uses the
Charm++ Salsa parallel visualization and analysis tool. Visualization in the context of cosmology involves a large amount of data, possibly spread over multiple processors.
ChaNGa has been scaled to 32K cores, and has been ported to GPU clusters. Over time, ChaNGa is being actively developed and improved, with an eye for efficient utilization and scaling of current and future supercomputing systems.
Domain: Agent-based epidemiology
Converted From: MPI
Scale: 500k CPU cores
The study of contagion effects in extremely large social networks, such as the spread of disease pathogens through a population, is critical to many areas of our
world. Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines Applications that model dynamical systems involve large scale, irregular graph processing. These applications are difficult
to scale due to the evolutionary nature of their workload, irregular communication and load imbalance. EpiSimdemics is a collaborative project between PPL and Virginia Tech
to create a Charm++ version of EpiSimdemics.
EpiSimdemics implements a graph based system that captures dynamics among co-evolving entities, while simulating contagious diffusion in extremely large and realistic social contact networks. EpiSimdemics relies on individual-based
models, thus allowing studies in great detail. The implementation of EpiSimdemics in Charm++ enables future research by social, biological and computational scientists at unprecedented data and system scales. We have presented
new methods for application-specific decomposition of graph data and predictive dynamic load migration and demonstrate the effectiveness of these methods on Cray XE6/XK7 and IBM Blue Gene/Q.
Discrete event simulations (DES) are central to exploration of ``what-if'' scenarios in many domains including networks, storage devices, and chip design. Accurate simulation of dynamically varying behavior of large components in these
domains requires the DES engines to be scalable and adaptive in order to complete simulations in a reasonable time. This paper takes a step towards development of such a simulation engine by redesigning ROSS, a parallel DES engine
in MPI, in Charm++, a parallel programming framework based on the concept of message-driven migratable objects managed by an adaptive runtime system.
In the paper, we first show that the programming model of Charm++ is highly suitable for implementing a PDES engine such as ROSS. Next, the design and implementation of the Charm++ version of ROSS is described and its benefits are
discussed. Finally, we demonstrate the performance benefits of the Charm++ version of ROSS over its MPI counterpart on IBM's Blue Gene/Q supercomputers. We obtain up to 40% higher event rate for the PHOLD benchmark on two million
processes, and improve the strong-scaling of the dragonfly network model to 524,288 processes with up to 5x speed up at lower process counts.
Domain: Electronic Structure
Converted From: MPI
Scale: 128k CPU cores
Many important problems in material science, chemistry, solid-state physics, and biophysics require a modeling approach based on fundamental quantum mechanical principles.
A particular approach that has proven to be relatively efficient and useful is Car-Parrinello ab initio molecular dynamics (CPAIMD). Parallelization of this approach beyond a few hundred processors is challenging, due to the complex
dependencies among various subcomputations, which lead to complex communication optimization and load balancing problems. We are parallelizing CPAIMD using Charm++. The computation is modeled using a large number of virtual processors,
which are mapped flexibly to available processors with assistance from the Charm++ runtime system.
This project began as a NSF funded collaboration involving us (PPL: Laxmikant Kale) and Drs. Roberto Car, Michael Klein, Glenn Martyna, Mark Tuckerman, Nick Nystrom and Josep Torrellas. It then shifted to a collaborative development
to scale both
OpenAtom and NAMD under the LCF ORNL grant "Scalable Atomistic Modeling Tools with Chemical Reactivity for Life Sciences", as a continuing collaboration with PPL, Kale on computer
scienece, Martyna and Tuckerman on the QM side, Klaus Schulten on the MD side and Jack Dongarra on performance optimization for ORNL LCF. Currently, the OpenAtom project is a collaboration of Kale with Glenn Martyna and Sohrab
Domain: Relativistic MHD
Scale: 100k CPU cores
SpECTRE is a Charm++ application used for research on relativistic astrophysics. The design of SpECTRE relies on arrays of charm objects to represent its Discontinuous Galerkin elements and distribute them over processors and nodes.
By taking advantage of asynchronous execution and adaptive overlap, SpECTRE has been scaled to run on the full 22,000 nodes of the Blue Waters supercomputer with excellent efficiency.
We introduce a new relativistic astrophysics code, SpECTRE, that combines a discontinuous Galerkin method with a task-based parallelism model. SpECTRE's goal is to achieve more accurate solutions for challenging relativistic astrophysics problems such as core-collapse supernovae and binary neutron star mergers. The robustness of the discontinuous Galerkin method allows for the use of high-resolution shock capturing methods in regions where (relativistic) shocks are found, while exploiting high-order accuracy in smooth regions. A task-based parallelism model allows efficient use of the largest supercomputers for problems with a heterogeneous workload over disparate spatial and temporal scales. We argue that the locality and algorithmic structure of discontinuous Galerkin methods will exhibit good scalability within a task-based parallelism framework. We demonstrate the code on a wide variety of challenging benchmark problems in (non)-relativistic (magneto)-hydrodynamics. We demonstrate the code's scalability including its strong scaling on the NCSA Blue Waters supercomputer up to the machine's full capacity of 22,380 nodes using 671,400 threads.
Converted From: MPI
Scale: 64k CPU cores
Cello is a Charm++ framework for multi-physics adaptive mesh refinement (AMR) simulations. The next generation of the Enzo cosmological hydrodynamics code, Enzo-P, is built on Cello.
The design of Enzo-P and Cello relies on fully-distributed Charm++ object array data structures. Some of the largest AMR simulations in the world have been used with Cello, and Cello has achieved almost perfect scaling on 64,000
cores of the NCSA Blue Waters supercomputer.
We present a hybrid OpenMP/Charm++ framework for solving the O(N) Self-Consistent-Field eigenvalue problem with parallelism in the strong scaling regime, P N. This result is achieved with a nested approach to Spectral Projection
and the Sparse Approximate Matrix Multiply [Bock and Challacombe, SIAM J. Sci. Comput. 35 C72, 2013], which involves an N-Body approach to occlusion and culling of negligible products in the case of matrices with decay. Employing
classic technologies associated with the N-Body programming model, including over-decomposition, recursive task parallelism, orderings that preserve locality and persistence-based load balancing, we obtain scaling better than P
∼ 500 N for small water clusters ([H2O]N , N = 30, 90, 150) and find support for an increasingly strong scalability with increasing system size, N.
Domain: Systems Hydrology
Scale: 1000 CPU cores
ADHydro is a large-scale, high-resolution, multi-physics watershed simulation created by the CI-WATER watershed modeling team. ADHydro was specifically developed for high performance
computing environments rather than single computers allowing larger scale and higher resolution simulation domains. ADHydro was parallelized in October of 2014 using the Charm++ run-time environment, and run have been completed
on the UWyo Advanced Research Computing Cluster using 512 and more cores.
This paper presents a scalable implementation of the Asynchronous Contact Mechanics (ACM) algorithm, a reliable method to simulate flexible material subject to complex collisions and contact geometries. As an example, we apply ACM
to cloth simulation for animation. The parallelization of ACM is challenging due to its highly irregular communication pattern, its need for dynamic load balancing, and its extremely fine-grained computations.
We utilize CHARM++, an adaptive parallel runtime system, to address these challenges and show good strong scaling of ACM to 384 cores for problems with fewer than 100k vertices. By comparison, the previously published shared memory
implementation only scales well to about 30 cores for the same examples. We demonstrate the scalability of our implementation through a number of examples which, to the best of our knowledge, are only feasible with the ACM algorithm.
In particular, for a simulation of 3 seconds of a cylindrical rod twisting within a cloth sheet, the simulation time is reduced by 12× from 9 hours on 30 cores to 46 minutes using our implementation on 384 cores of a Cray XC30.
Particle-tracking methods are widely used in fluid mechanics and multi-target tracking research because of their unique ability to reconstruct long trajectories with high spatial and temporal resolution. Researchers have recently demonstrated
3D tracking of several objects in real time, but as the number of objects is increased, real-time tracking becomes impossible due to data transfer and processing bottlenecks. This problem may be solved by using parallel processing.
In this paper, a parallel-processing framework has been developed based on frame decomposition and is programmed using the asynchronous object-oriented Charm++ paradigm. This framework can be a key step in achieving a scalable Lagrangian
measurement system for particle-tracking velocimetry and may lead to real-time measurement capabilities.
The parallel tracking algorithm was evaluated with three data sets including the particle image velocimetry standard 3D images data set #352, a uniform data set for optimal parallel performance and a computational-fluid-dynamics-generated
non-uniform data set to test trajectory reconstruction accuracy, consistency with the sequential version and scalability to more than 500 processors. The algorithm showed strong scaling up to 512 processors and no inherent limits
of scalability were seen. Ultimately, up to a 200-fold speedup is observed compared to the serial algorithm when 256 processors were used. The parallel algorithm is adaptable and could be easily modified to use any sequential tracking
algorithm, which inputs frames of 3D particle location data and outputs particle trajectories.