###
Parallel Family Trees for Transfer Matrices in the Potts Model

####
Cristobal Navarro, Fabrizio Canfora, Nancy Hitschfeld, and Gonzalo
Navarro

The computational cost of transfer matrix methods for the Potts model is
related to the problem of *into how many ways can two layers of a lattice
be connected*. Answering this question leads to the generation of a
combinatorial set of lattice configurations. This set defines the
*configuration space* of the problem, and the smaller it is, the
faster the transfer matrix method can be computed. The configuration space
of generic transfer matrix methods for strip lattices in the Potts model is
in the order of the Catalan numbers, leading to an asymptotic cost of
*O(4^m)* with *m* being the width of the strip. Transfer matrix
methods with a smaller configuration space indeed exist but they make
assumptions on the temperature, number of spin states, or restrict the
structure of the lattice in order to work. In this paper we propose a parallel
algorithm that uses a sub-Catalan configuration space of *O(3^m)* to
build the generic *(q,v)* transfer matrix. The improvement is achieved by
grouping the original set of Catalan configurations into a forest of family
trees, in such a way that the solution to the problem is now computed by just
solving the root node of each family. As a result, the algorithm becomes
exponentially faster than the Catalan approach while still highly parallel.
An additional advantage of the method is that the final matrix can be saved in
a compressed form using *O(3^m x 4^4)* of space, making numerical
evaluation of *(q,v)* faster than in the Catalan transfer matrix, which
is *O(4^m x 4^m)* in size. Experimental results for different sizes
of strip lattices show that the *Parallel family trees (PFT)* strategy
indeed runs exponentially faster than the *Catalan Parallel Method (CPM)*,
especially when dealing with dense transfer matrices. In terms of parallel
performance, we report strong-scaling speedups of up to 5.7X when running on
an 8-core shared memory machine and 28X for a 32-core cluster. The best
balance of speedup and efficiency for the multi-core machine was achieved when
using *p* = 4 processors, while for the cluster scenario it was in the
range *p in* [8, 10]. Because of the parallel capabilities of the
algorithm, a large-scale execution of PFT in a supercomputer could allow
the study of wider strip lattices and give new insights on the physical
properties of strips.