Social Research Update 24: Optimal Matching Analysis (2024)



Issue 24	Spring 1999

Social Research Update is published quarterly by the Department of Sociology, University of Surrey, Guildford GU2 7XH, England. Subscriptions for the hardcopy version are free to researchers with addresses in the UK. Apply by email to [email protected].

Optimal Matching Analysis

Tak Wing Chan

Tak Wing Chan is a lecturer in Sociology at the University of Surrey. His research interests include social stratification and the life course. His most recent publication is Revolving Doors Reexamined: Occupational Sex Segregation Over the Life Course, American Sociological Review, 1999, vol.64, pp.86-96.
Email: [email protected]

Optimal Matching Analysis is a technique for the analysis of sequence data. It takes into account the full complexity of sequences.
It uses an iterative minimisation procedure to find the distance between every pair of sequences in a sample.
Cluster analysis can then be applied to the distance matrix to ascertain if the sequences belong to distinct types drawn from a typology.
The typology, if identified, can be used as an independent or dependent variable in further analysis.

Many interesting sociological questions are about process: how do observed social regularities come about? For example, mobility researchers have established that there is a cross-nationally common and over-time persistent mobility pattern, in which people from advantaged class background enjoy better relative mobility chances (Erikson and Goldthorpe 1992). This is an important finding, but we may further ask: how does this pattern come about? Bearing in mind that social mobility is for many people a circuitous process, it is important to ask: how does the multifaceted career process - job-changing, promotion, going into and out of the labour market, etc. - allocate people to different positions in the class structure, in a way which ultimately produces the observed mobility regime.

Clearly, to address questions about social process, we need both longitudinal data relevant to the process of interest and technical tools to analyse those data. Longitudinal data are available (the British Household Panel Study is a prime source for the UK), and we do have a range of analytical tools at our disposal (Dale and Davis 1994).

Of the various tools for longitudinal data analysis, event history models are probably most useful (Blossfeld and Rohwer 1995). They allow the researchers to test how the transition rate of moving from one state (e.g. unemployed) to another (employed) is affected by other variables. Researchers can include previous (employment) history as a covariate. They can also include predictors which measure what is happening in other life domains (e.g. education, family), thus helping us understand the interweaving of multiple processes (see e.g. Blossfeld 1995).

However, event history models invariably consider one transition at a time. If the process in question is repeatable - as in the case when someone moves from being unemployed to employed, back to unemployment, and then finds a job again - the researcher will have to model each transition separately, with the covariates suitably updated each time. This is reasonable for many purposes. But in some cases, we may want to consider the whole sequence of multiple transitions simultaneously, because we want to have a sense of how the individual steps fit together, or because we think that the sequence taken as a whole tells us something important. If this is the case, optimal matching analysis (also known as optimal alignment, hereafter as OMA) is a potentially useful tool.

Background

Interest in event sequence is not new. Hogan (1978), for example, studied the temporal order of three events - finishing school, getting a first job, and getting married - in the transition to adulthood in the United States. He constructed a typology of typical (and normatively sanctioned) and atypical sequences. He then investigated the prevalence of the various sequence-types, the determinants of following one sequence-type as opposed to another, etc. The analysis is informative. However, because appropriate tools were not available, Hogan could only consider a limited number of events, and the timing of event was not taken into account. Person A who finished school, got a job immediately, and got married within a year would be considered to belong to the same sequence-type as B who took a year to find his first job after finishing school, and stayed unmarried for the next ten years.

OMA is a family of procedures that takes into account the full complexity of sequence data. The objective is to identify a typology of sequences empirically. It was developed by, among others, molecular biologists who study protein or DNA sequences. Biologists often wish to find out how different a particular DNA sequence is from other sequences. Knowing this will help them reconstruct the evolutionary tree. There are of course other applications of OMA in biology, other sciences and engineering (Waterman 1995, Sankoff and Kruskal 1983). The technique was introduced into sociology by Andrew Abbott, who has applied OMA to a range of sociological issues, such as the development of the welfare state and musicians careers (Abbott 1995, Abbott and Hrycak 1990, Stovel, Savage and Bearman 1996).

The core of OMA is a two-step procedure. First, given a set of sequences, find the distance between every pair of sequences though an iterative minimisation procedure. This will give the researcher a distance matrix for all the sequences. Secondly, apply clustering or cognate procedures to the distance matrix to ascertain if the sequences fall into distinct types. The typology, if found, can be used as an independent variable (Han and Moen 1998) or dependent variable (Chan 1995) in further research.

The OMA algorithm

The second (clustering) step should be familiar to readers, though in the context of OMA it may run into practical problems. This section explains the iterative minimisation procedure. The objective of this first step is to find, for each pair of sequences in the sample, the lowest costs needed to turn one sequence into another, using three elementary operations: insertion, deletion, and substitution. A simple example will illustrate this idea. Consider an hypothetical organisation which hires people as clerks (c) or supervisors (s). At the end of each year, the performance of each clerk is reviewed, and he or she may be promoted to become a supervisor. Here are two sequences, representing the career histories of two persons.

A:ccccs
B:csss

B is the high-flyer: he was promoted to become a supervisor after one year, while it took A four years to do so. There are several ways to turn sequence A into Sequence B. For example, we can delete the first three c at the beginning of sequence A and then insert two s at the end. This can be represented as follows:
A:ccccsff
B:fffcsss
Note that f stands for an empty state - think of it as a place-holder. When it appears on the bottom line, it means that the corresponding element in sequence A is deleted. When it appears on top line, it means that the corresponding element in sequence B is inserted into sequence A. So this matching involves three deletions and two insertions. Depending on how much a deletion/insertion costs (see below), we can put a numerical value to this particular matching.

Alternatively, we can substitute s for c for the second, third and fourth elements of sequence A, and then delete the last s.
A:cssss
B:csssf
This matching involves three substitutions and one deletion. Again, given the costs we assign to substitution and insertion/deletion, we can attach a numerical value to this matching. What OMA does is to consider all possible ways of matching the two sequences, and finds the cheapest way to do so.

Two issues follow immediately. First, the assignment of substitution and insertion/deletion costs is a critical issue, as they determine the matching outcome. How do we set these costs? Secondly, as sequences become longer, and as the number of states we wish to distinguish increases (say, we want to distinguish not just clerks and supervisors, but also branch managers, and senior managers), eyeballing, as a way of finding alignment, will soon fail. A systematic procedure is needed.

Cost assignment is a key point where theoretical consideration comes in. Substitution costs are often set to reflect the relative distance among the states distinguished. For example, in terms of pay and authority, clerks are closer to supervisors than they are to managers. Thus, it seems reasonable to set a lower substitution cost for supervisor and clerk than for manager and clerk. Researchers must rely on theoretical reasoning and external evidence to justify their cost assignment.

The second issue is about the iterative minimisation procedure. The discussion here will have to be schematic. Readers can find detailed and formal discussion of the algorithm in Waterman (1995), Sankoff and Kruskal (1983), or the documentation of the programs cited below. In general, given a finite state space S and two sequences a = (a₁ ... a_m) and b = (b₁ ... b_n), where m is not equal to n, and the elements of a and b are drawn from S, the algorithm works by first creating a (m+1)×(n+1) matrix. To turn sequence a into sequence b from left to right is to move from the top left-hand cell of the matrix to the bottom right-hand cell (Figure 1). There are many paths one can take, but each path is built by a combination of moving down a row (which is equivalent to deleting the corresponding a element), moving right across a column (equivalent to inserting the corresponding b element), and moving diagonally (equivalent to making a substitution).

Figure 1: The matrix used in iterative minimisation, and part of a possible path

Each cell on a particular path records the total cost incurred so far. At the beginning of the iterative process, we are at the top left-hand corner, and the cost is zero. To move down the first column is to delete successively the a elements. Thus, the cells in the first column record the cumulative deletion costs. Likewise, the cells of the first row record the cumulative insertion costs. Given these initial cell values, we can progressively calculate the value of all cells in the matrix, by considering whether the cumulative total cost for a particular cell is minimised by moving from the cell to its left, from the cell above, or from the cell diagonal to it (see the unshaded but outlined cell in Figure 2). Having calculated the value for all cells in the matrix, we will be able to find the optimal matching for the two sequences concerned.

Figure 2: Progressive calculation of the minimal cost for each cell. In this example, the value of the shaded cells are known.

The procedure described above produces a distance measure for every pair of sequences in the sample. Cluster analysis is then applied to the distance matrix to see if the sequences belong to a small number of distinct types. The typology generated by the cluster analysis can be used as an independent or dependent variable in further research.

Potential problems

Many popular programs support cluster analysis, and this second step of OMA should be straightforward. However, readers who have experience with cluster analysis will be aware that (a) it is a relatively simple procedure, and it is not supported by a large body of statistical reasoning, and (b) different clustering methods may produce different solutions to the same data (Aldenderfer and Blashfield 1984). My experience is that with a large number of sequences, it is likely that there are some problems of cluster identification. In such a situation, the researcher may be tempted to impose some relatively arbitrary rules, and consider, say, the top four or five clusters. It seems to me that this comes perilously close to using OMA as a sausage machine.

More importantly, it can be argued that the cluster identification problem is not an incidental issue traceable to cluster analysis. What OMA tries to do is to summarise what is potentially very complex sequence pattern with just one categorical variable -- the different sequence-types are the values of this variable. This should be reasonable in many situations, but perhaps not in all. In other words, it may be the case that in certain situations the sequences simply do not fall neatly into a clean typology. This is a cautionary note for potential OMA users.

Summary

OMA is a valuable addition to the set of analytical tools for longitudinal data analysis. It may help researchers develop a typology of sequences, fully recognising the complexity of sequence data - event order, timing, and so on. Specialised programs are becoming available (see below). Of course, much work still needs to be done. One important task is to relate OMA to the large body of conventional statistical methods and thinking (Halpin and Chan 1998).

Two OMA Programs

Optimize is freeware developed by Andrew Abbott, see http://www.spc.uchicago.edu/users/abbot/optimize.htm
TDA 6.2b is also freeware, distributed under the terms of the GNU General Public Licence. TDA was originally developed for event history models, see http://steinhaus.stat.ruhr-uni-bochum.de/

References

Abbott, A. (1995) Sequence Analysis: New Methods for Old Ideas, Annual Review of Sociology, 21:93-113.