NADD Bulletin Volume VII Number 5 Article 1

Complete listing

Using Data to Continuously Improve Treatment Outcomes: A Conceptual Framework and Practical Guidelines.

Al Pfadt, Ph.D. & Donald J. Wheeler, Ph.D.

One of the goals of the Fifth International NADD Conference held in Boston, MA in March 2004 was to encourage more active collaboration between scientific researchers and clinical practitioners. This article considers some of the obstacles to conducting clinical research so that it is useful to the practicing clinician. It also presents practical guidelines for using a valued outcome scaling methodology to enhance the ability of treatment teams to specify the objectives they are trying to accomplish and to establish objective criteria for determining if those objectives have been met within a specific treatment interval. In this way, evidence about which treatments are effective for particular problems can be gathered in a manner that informs clinical decision-making. Such a data-based approach to treatment planning and outcome monitoring also promotes greater accountability to consumers, regulatory agencies, and funding sources.

Extensive Research Designs

Chassan (1979) described and contrasted two research designs which are used in clinical psychology and psychiatry. The first approach, termed the "extensive design," utilizes measurements of variability between groups of individuals regarding the phenomenon being studied as the basis for drawing conclusions about the effectiveness of treatment. In order to ensure that any differences between groups are due to the planned intervention, the so-called "gold standard" exemplar of this extensive research tradition involves random assignment to treatment conditions and double-blind conditions of observation. Ratings/measurements are made by investigators who are not aware of participant's treatment status and placebos are used to keep participants unaware of their treatment status as well. There are formidable logistical barriers and challenges involved in implementing double blind, placebo-controlled studies with random assignment to treatment or control groups in naturalistic, clinical settings. Additionally, there are practical limitations to the utility of these extensive designs for the practicing clinician. Because the information derived from extensive designs applies only to group differences between average scores obtained by comparing results across treatment conditions, these outcomes do not identify which participants within each group responded or failed to respond to the treatment in question.

For illustrative purposes, consider the following two distributions of scores which might have resulted from a well-controlled study of the effects of Medication A, relative to placebo (given to Control Group B), on measurements of a clinically relevant (valid) outcome obtained using a psychometrically sound (reliable) instrument.

A comparison of the mean outcome measurements obtained for Group A (the treatment group) indicates that it is the same as the control group's mean, suggesting that the medication had no effect on the phenomenon being studied. However, closer inspections of the outcome measures obtained for individuals within the treatment group suggest that this conclusion is misleading. While the mean for the group of subjects in the treatment group is identical to that obtained for the control group, some subjects in the treatment group clearly seem to have improved while taking medication A (assuming that higher scores represent improvement). Three individuals (S2A, S5A and S9A) obtained outcome scores which were nearly twice as large as the average score for the control group. However, the improvement in this sub-group of subjects treated with Medication A was offset by the apparent deterioration of another sub-group (S3A, S6A and S8A), whose outcome scores were only half as large as the average obtained for the control group. This anomaly illustrates a fundamental flaw of extensive group designs, clearly articulated by Chassan:

"As a consequence of the conclusion that a statistically significant difference between groups can occur as a result of effectiveness in only a very few patients in a study and because the method is not capable to specifying the particular patients involved, results from an extensive design yield relatively little information for clinical practice concerning specific characteristics as a basis for the selection of a treatment for a given patient" (pp. 219-220).

Advocates for the extensive research tradition would counteract the claim made above by pointing to the development of more sophisticated research designs that are capable of partial ling out (isolating) within group subject characteristics to address the question of who is more likely to respond to treatment. Subjects within each comparison group can be blocked (grouped together) according to age, sex, pre-treatment score, etc. to allow for a more precise determination of which subject variables are associated with particular treatment outcomes. However, each level of complexity in the research design requires the recruitment of more subjects and/or the use of more outcome measures, calling attention to another limitation of extensive research designs-they require resources that are typically not available to the practicing clinician. Consequently, they take place in artificial settings (e.g., funded research laboratories), which lack "ecological validity" (due to the type of subjects recruited and/or reactivity due to the continuous presence of trained raters/observers during treatment). This not only limits the extent to which findings can be generalized from the "research laboratory" back to the "real world," it also creates an almost unbridgeable gap between the scientific investigator and the practicing clinician, since the "messy conditions" of clinical practice do not lend themselves to implementation of extensive research designs "in the trenches." One of the reasons why the results of well-controlled clinical trials do not get implemented may be because they are seen as so out of touch with the reality of practicing clinicians.

Intensive Research Designs: Focusing on graphic representation of single subject data

Intensive research designs provide a more user-friendly alternative to the extensive design approach described above and they can be employed by the practicing clinician as part of routine patient treatment. Chassan observes that "a shift from the extensive to the intensive model of the design of clinical research depends upon the acceptance of a general underlying variability within the subject, or patient, with respect to his day to day, or week to week evidences of behavior, affect, signs and symptoms" (p. 227). Whereas, with an extensive design the relevant comparisons are made across groups of patients, intensive designs test hypotheses about treatment outcomes by comparing measurements of the same person's behavior obtained under different conditions.

Inter-individual variability is the primary source of error within an extensive design, while intra-individual variability is not analyzed. However, intra-individual variability is the primary source of information in an intensive design. Typically, an intensive design begins with measurements of the individual's behavior under pre-treatment (baseline) conditions for a sufficiently long interval to establish a benchmark, against which to evaluate treatment outcomes. The effectiveness of treatment is determined by the extent to which it can be shown that the results of the intervention are significantly different from those which would have been expected if the intervention had not taken place. Both types of designs require estimates of probability. The extensive design calls for an estimate of the likelihood that differences between the group means for the treatment and control groups could have been obtained by chance. The intensive design calls for an estimate of the likelihood that the individual's post-treatment performance would have occurred if the baseline conditions had remained in effect. One of the reasons why intensive research designs do not have more scientific credibility is that there is no clear consensus among practitioners about the rules of evidence for determining if a significant change occurred as a result of treatment during an intensive research design. One group of practitioners who use intensive research designs advocates for the use of probabilistic, statistical models to quantify the level of significance of treatment outcomes (that is, to precisely estimate the likelihood that the observed differences between a subject's baseline and post-treatment functioning could be due to chance, random, variation).

Chassan himself exemplifies this tradition and several volumes have appeared since 1979 illustrating how probabilistic, statistical models can be used in single subject research designs. (see Franklin, Allison, & Gorman, 1997; and Krishef, 1991). However, there has been contentious debate even among practitioners committed to this research tradition about whether or not the assumptions required to utilize these models (e.g., lack of significant autocorrelation among scores) can be met.

Another school of thought within the intensive research design tradition is represented by most applied behavior analysts, who de-emphasize the importance of obtaining "statistically significant results" in favor of visually analyzing graphic displays of a subject's baseline and post-intervention data to determine if they display "clinical significance." This introduces an indeterminate amount of subjectivity into the analysis of outcome measures that are obtained as objectively as possible just to counteract this taint of subjectivity.

Fortunately, there is a way to combine within an intensive design the advantages of a fine-grained visual analysis of outcome data with the use of objective operationally defined criteria for determining if a "significant change" has taken place during treatment. The remainder of this article will illustrate the application of statistically derived criteria to operationally define the types of outcomes obtained during treatment of an individual client.

Guideline for Contruction and Using Valued Outcome Scales

Continuous quality improvement in human services has been described by Sperry et al (1996) as "the systematic analysis of service quality indicators for the purpose of optimizing service delivery programs and procedures" (p 115). Readers interested in learning more about how the data analytic and problem solving tools developed for quality improvement have been applied in health care settings should review the detailed descriptions provided by Pfadt and Wheeler (1995). Hawkins, Matthews, and Hamdan (1999) listed six ways that continuous, quantitative outcome measured promote optimal clinical effectiveness: 1. They force the change agent to more clearly specify exactly which changes are necessary to improve functioning; 2) If these descriptions of anticipated behavioral changes are written down, they improve communications among the change agent, the consumer of services, and other interested parties; 3) Analysis of quantitative data helps the change agent identify variables which contribute to the consumer's problems; 4) Quantitative data can be visually displayed and this guides clinical decision-making by informing the change agent about the "Trajectory of change" (the direction and pace of treatment outcomes); 5) The process of precisely specifying treatment outcomes fosters analytic thinking and creative problem solving; 6) The change agent is rewarded by the visual display of successful outcomes and is motivated to modify interventions producing unsuccessful outcomes.

The Valued Outcome Scaling procedures described below are a modification of those promoted by Goal Attainment Scaling (Kiresuk, Smith, & Cardillo, 1994). They have all of the benefits of continuous, quantitative data analysis described above, if precise, objective outcome measures are used to construct the scales-as illustrated below. The principle difference between this approach and conventional Goal Attainment Scaling is that the consumer's baseline level of functioning is used to anchor the outcome rating of "No Change", whereas a 0 rating in Goal Attainment Scaling is assigned to the most likely, positive outcome. Using a 0 rating to indicate that the consumer continues to function at the baseline level at the end of the scaling level is more consistent with routine clinical practice, particularly within the behavior analytic tradition (see Pfadt, 1999). This approach also eliminates a major source of confusion the author encountered when trying to teach others to use Goal Attainment Scaling.

Valued Outcome Scales are constructed to provide a structured framework for answering the three questions posed by Wheeler (2003) which drive the Continuous Quality Improvement process: 1) What outcomes are you trying to accomplish? 2) By what method?; and 3) How will you know if you are successful? Continuous quality improvement is accomplished by involving the consumer as an active participant in the process of choosing valued outcomes and in spelling out the criteria for determining to what extent these outcomes have been accomplished, as well as in determining what methods will be used to achieve these outcomes. Constructing a Valued Outcome Scale helps the treatment team focus on the outcomes that are clinically relevant for a specific consumer. Since the same scaling format is used to scale all outcomes, this scaling methodology provides a common language for comparing the results obtained for the same person across different domains (e.g. symptom reduction as well as skill acquisition). Results can also be meaningfully compared across different individuals who are treated with the same or even different methods. This point is discussed by Kiresuk, Smith, and Cardillo (1994) in their handbook on Goal Attainment Scaling.

Carefully choose behavioral indicators that accurately reflect the person's current level of functioning when the valued outcome scale is constructed and carefully word the criteria to be used at the review date for determining what type of outcome was accomplished. This makes it possible to provide valuable feedback for developing a more effective intervention strategy (changing the method) or changing the expectation of the team and the focus person (choosing different criteria for scaling the valued outcomes). The procedures for constructing and completing the valued outcome scales are described below.

1.Meet with the consumer, his/her service coordinator, and other representatives of the person's treatment team to select a valued outcome from the individual's service plan (ISP) that will be the focus for outcome scaling. This valued outcome should reflect something that the person is strongly motivated to accomplish, but it can also be an outcome that the team feels is important for the individual. The latter usually is directed towards reducing the frequency or intensity of challenging behaviors. As a rule of thumb, it is advisable to teach at least one replacement skill for each challenging behavior or psychiatric symptom you are trying to reduce or eliminate. Be realistic in choosing an outcome that can be accomplished within the time frame specified by the anticipated review date.

2.Choose a descriptive title for the valued outcome that reflects what the individual or the team wants to accomplish.

3.Identify behavioral indicators for the valued outcome that clearly reflect the intended accomplishment and also reflect progress towards attaining the criteria for improvement or loss of functioning, which are specified in the right hand column of the form. These behavioral indicators should be worded so that everyone can agree about whether or not they are present when the scale is reviewed.

4. Describe the person's current level of functioning as objectively as possible using the behavioral indicators as anchor points. For example, a valued outcome might be "controlling self-injurious behaviors." In that case, behavioral indicators could be the frequency (number of incidents per day or per week.), the intensity (how long does a bout of self-injury last or how much tissue damage occurs) or the amount of staff assistance required to redirect the person to another activity (verbal prompts, or the use of physical interventions when verbal redirection is not successful.)

4.Within the time frame specified for working on this valued outcome, spell out the criteria for determining what level of improvement occurred at the end of that interval. If quantitative data is available, measurements of the behavioral indicators can be used to indicate the person's baseline level of functioning and to provide fairly precise operational definitions of the otherwise vague terms "Slight", "Moderate", and "Significant" levels of improvement used in the Valued Outcome Scale. For example, suppose that reduction in the duration of agitated outbursts involving self-injurious behaviors, aggression towards others, and /or property destruction was chosen as the valued outcome. If so, measurements of the duration (in minutes) of incidents when these behavioral incidents occurred with sufficient intensity to warrant verbal redirection or the use of specified physical interventions on the part of staff to prevent injury or property destruction could be readily obtained for a baseline period of 2-4 weeks by having staff record the time of onset and cessation of each incident. For illustrative purposes, suppose that 10 incidents meeting this operation definition occurred during a 4 week baseline interval, lasting for 40, 35, 40, 30, 35, 25, 35, 30, 40 and 35 minutes respectively. The baseline level of functioning for this valued outcome could be characterized by the mean (34.5 minutes) and patterns of variation determined for this time series and displayed on the process behavior chart shown in Figure l. The decision rules described by Pfadt (1999), Pfadt and Wheeler (1995), and Wheeler (2003) could then be used to provide precise anchor points and clear operational definitions for specifying the type of change which took place during the treatment interval (the length of time between the date the sale was constructed and the actual review date). A "Slight Improvement" would be indicated by a run of 7 consecutive values at the end of the treatment interval above the mean for the baseline time series. "Moderate Improvement" would be indicated by 3 of 4 consecutive values in the lower 25% region between the process limits shown in Figure 1 (values between 24.9 and 15.3 minutes). The criteria for "Significant Improvement" would be the present if at least one value was below the lower process limit (a duration of 15 minutes or less). Again, for the sake of having an illustrative example, assume that the last 10 values obtained during the treatment phrase were durations of 15, 20, 30, 30, 25, 30, 20, 20, 25, and 10 minutes respectively. By inspecting Figure 1, it can be seen that this represents a "Significant Improvement", since at least one of the values (the last 10 minute duration) is below the lower process limit (15.3 minutes). Furthermore, in this case, the criteria for a "Moderate Improvement" are also present (3 of 4 consecutive values below 25% threshold of 24.9 minutes.) The criteria for a "Slight Improvement" (at least 7 consecutive values below the baseline mean of 34.5 minutes) was also exceeded. A "Slight Loss of Functioning" would have been indicated if 7 consecutive values were above the baseline mean. "Moderate/Significant Loss of Functioning" would have been checked if 3 of 4 consecutive values were above the 75% criteria of 44.1 minutes; or if at least 1 value had been greater than the upper limit of 53.7 minutes. "No Change" relative to baseline would be checked if all points for the treatment phase were within the upper and lower process limits, with no other signs of variation due to special, assignable causes (see Pfadt, 1999, for examples of how control charts can be used to determine if baselines are stable).

If precise quantitative data are not available, it will be necessary to use more arbitrary, subjective criteria to create operational definitions for characterizing the type of change that occurred during treatment. Nevertheless, by spelling out these criteria for quantifying level of improvement or deterioration at the beginning of treatment, it is possible to reduce some of the bias present when results are subjectively evaluated in the traditional, retrospective manner.

5.At the time of the actual review date, meet with the team and consumer to determine what type of outcome was accomplished. A score of +3 indicates that the criteria for making a "Significant Improvement" were met, by comparing the behavioral indicators describing the persons level of functioning at the time the valued outcome scale was constructed with those behavioral indicators describing the person's level of functioning when the scale is completed. Lesser degrees of improvement result in lower scale values (+2 is assigned when the criteria for Moderate Improvement are met; +1 indicates that there was Slight, but noticeable, Improvement). A score of -1 indicates that there was a Slight Loss of Functioning, while a score of -2 indicates that there was a Moderate/Significant Loss of Functioning. A score of 0 indicates that the consumer continues to display behavioral indicators of the valued outcome which are within the range of random variation observed during baseline. That is, no signs that special causes of variation were introduced by the treatment methods employed during the scaling interval.

6.Depending on the outcome accomplished during this scaling interval, the team and the focus person might want to change the method/intervention (if sub-optimal outcomes were accomplished), continue the same methods and construct a new scale (if the treatment was effective but did not yet accomplish the valued outcome), or focus on a different valued outcome (if the basic objectives were accomplished).

Use of intensive research designs help the practicing clinician contribute to the scientific knowledge base required to identify effective treatment strategies for particular types of presenting problems. The data-based approach to decision making fostered by methods such as Valued Outcome Scaling provides clinicians and treatment teams with timely information necessary to improve the quality of treatment for each client. It also helps to promote greater accountability, to the individuals we serve as well as to regulatory and funding agencies.



Chassan, J. B. (1979). Research design in clinical psychology and psychiatry, 2nd ed. NY: Irvington Publishers.

Franklin, R. D., Allison, D. B., & Gorman, B. S. (1997). Design and analysis of single-case research. Mahway, NJ: Lawrence Erlbaum.

Hawkins, R. P., Matthews, J. R., & Hamdan, L. (1999). Measuring behavioral health outcomes. NY: Kluver Academic/Plenum Publishers.

Kiresuk, T. J., Smith, A., & Cardillo, J. E. (1994). Goal attainment scaling. Hillsdale, NJ: Lawrence Erlbaum.

Krishef, C. H. (1991). Fundamental approached to single subject design and analysis. Malabar, FL: Krieger Publishing Co.

Pfadt, A. Using control charts to analyze baseline stability. (1999). Journal of Organizational Behavior Management, 18, 53-60.

Pfadt, A. & Wheeler, D. J. (1995). Using statistical process control to make data - based clinical decisions. Journal of Applied Behavior analysis, 28, 349-370

Sperry, L., Brill, P. L., Howard, K. I., & Grissom, G. R. (1996). Treatment outcomes in psychotherapy and psychiatric interventions. NY: Brunner/Mazel.

Wheeler, D. J. (2003). Making sense of data. Knoxville, TN: SPC Press.

For further information: