Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The bird’s-eye view: A data-driven approach to understanding patient journeys from claims data

The bird’s-eye view: A data-driven approach to understanding patient journeys from claims data Abstract Objective In preference-sensitive conditions such as back pain, there can be high levels of variability in the trajectory of patient care. We sought to develop a methodology that extracts a realistic and comprehensive understanding of the patient journey using medical and pharmaceutical insurance claims data. Materials and Methods We processed a sample of 10 000 patient episodes (comprised of 113 215 back pain–related claims) into strings of characters, where each letter corresponds to a distinct encounter with the healthcare system. We customized the Levenshtein edit distance algorithm to evaluate the level of similarity between each pair of episodes based on both their content (types of events) and ordering (sequence of events). We then used clustering to extract the main variations of the patient journey. Results The algorithm resulted in 12 comprehensive and clinically distinct patterns (clusters) of patient journeys that represent the main ways patients are diagnosed and treated for back pain. We further characterized demographic and utilization metrics for each cluster and observed clear differentiation between the clusters in terms of both clinical content and patient characteristics. Discussion Despite being a complex and often noisy data source, administrative claims provide a unique longitudinal overview of patient care across multiple service providers and locations. This methodology leverages claims to capture a data-driven understanding of how patients traverse the healthcare system. Conclusions When tailored to various conditions and patient settings, this methodology can provide accurate overviews of patient journeys and facilitate a shift toward high-quality practice patterns. Claims data, patient journey, clustering, edit distance, sequence alignment INTRODUCTION Medical researchers have long pointed to the importance of understanding the realistic picture of the patient journey: the chronological sequence of how a patient seeks and receives care from the healthcare system.1,2 Capturing an accurate overview of the patient journey can help identify sources of variability, evaluate why patients respond differently to the same overarching treatment plan, and compare how actual realizations of the treatment plan differ from standard clinical guidelines. However, in a fragmented healthcare system, it can be difficult to derive a comprehensive understanding patient journeys based on real utilization patterns. Understanding the patient journey is especially important for highly variable, preference-sensitive conditions such as back pain.3–5 Because back pain has numerous clinically acceptable therapeutic options, the trajectory of patient care can be highly variable and influenced by the severity of the condition, access to healthcare services, provider preferences, and the patient’s medical history.5–7 Adding to this complexity, treatment for back pain often occurs across service locations (eg, primary care, emergency services, physical therapy). While significant effort has been placed on extracting and analyzing patient journeys from electronic medical records and clinical workflows, these data sources tend to be centered around a single healthcare provider.8–10 To obtain a more comprehensive overview of the patient journey across various providers and locations, we propose a data-driven methodology based on medical and pharmaceutical claims data. In the U.S. healthcare system, administrative claims data from insurance providers offer a uniquely detailed retrospective account of how individual patients receive medical treatment.11–13 Claims data contain date, diagnostic, procedural, and provider information, which, when strung together, create an overview of services provided by a collection of clinicians. Compared with electronic health records, insurance claims are a useful platform to study longitudinal utilization and conditions that are treated across multiple locations. However, claims data are often inherently noisy, have duplicated information, and may not accurately identify a complete list of services provided to the patient.14,15 With these challenges, to our knowledge, automatic detection of representative patient journey patterns from claims data has not been successfully completed at scale. The proposed data-driven methodology uniquely combines and builds on tools leveraged elsewhere in healthcare informatics to develop an algorithmic approach to extract and understand patient journeys from claims data.16–25 We represent the back pain–related events of the patient’s journey as a string of letters, in which each letter corresponds to a distinct encounter with the healthcare system. We then evaluate the similarities between the strings based on both their content and their ordering (with a dynamic sequence alignment algorithm), and finally cluster the patient journeys together (using ensemble clustering) to identify representative patterns. Applied together, using careful data modeling, these analytic elements create a data-driven understanding of the patient journey. The proposed methodology to extract patient journey patterns from claims data combines and customizes techniques from sequence alignment and clustering. Applications of sequence alignment (such as the Levenshtein edit distance) have been successfully implemented within informatics to map laboratory text into a standardized medical vocabulary, identify duplications in electronic medical records, and normalize terms in clinical text.16–19 Clustering has been shown to be effective at compressing large clinical datasets; techniques including k-means clustering are commonly applied to image processing in the context of radiology scans and skin tissue samples.20–23 There is a large prior literature that focuses on understanding or extracting patient journey patterns from event logs, such as electronic medical record systems.8,10,26,27 When patient journey data are organized into event logs or time stamps, process mining can discover a single process map (or set of maps) that shows how entities transfer from the beginning to the end of the system.24,28 Even though noise reduction techniques have been developed to address challenges such as missing data and repeated events, the frequency of such occurrences in claims data makes it difficult to apply process discovery within the claims setting.9,29,30 Furthermore, in conditions like back pain, in which it is appropriate to revisit or repeat events such as physical therapy, it may not be appropriate to conceptualize the patient journey as an end-to-end process. When studying the patient journey using administrative claims, analyses typically limit the analysis to specific elements of the patient journey, for instance, categorizing the first-line treatment after condition onset, or looking at the first 3 events of the treatment pathway.31,32 Claims have also been used to measure outcomes of pathway effectiveness, without being leveraged to create an understanding of the patient journey itself.33,34 Other work identified common pathways by frequency, but this inherently biases the outputs to display the shortest and simplest patient pathways.35 In contrast, our proposed methodology uses a data-driven approach to identify similarities between patient journeys and understand the main patterns across the patient’s full set of interactions with the healthcare system. MATERIALS AND METHODS There are 3 analytical steps to the proposed data-driven approach to extract the patient journeys: claims processing, sequence alignment, and journey clustering. Details of all clinical assumptions, including codes used to process the data, are provided in the Supplementary Appendix. Claims processing This research utilized a nationwide U.S. dataset that included medical and pharmaceutical claims from 29 different provider networks across 23 states and the District of Columbia. While not a nationally representative dataset, the patients were insured through commercial, Medicare Advantage, and Medicaid plans and represent a variety of patient demographics and comorbidities. The research was approved by the Ethics Committee at the University of Cambridge Judge Business School as a nonhuman subject study. We analyzed all back pain–related claims between September 2012 and March 2019, in which back pain was broadly defined to encompass patients expected to follow conservative back pain guidelines, such as those released by the American College of Practitioners.36,37 Following the related back pain literature, patients were excluded if they had a history of cancer, congenital abnormalities, or certain autoimmune conditions, or if they were being treated in end-of-life care, as the care for these patients is often medically justified to deviate from the general guidelines.36–39 We identified a random sample of 10 000 back pain episodes (corresponding to 9981 unique patients) in which the patient had an initial back pain–related claim after a minimum 6-month clean period without back pain–related claims.38,40 Patients were required to be fully eligible in the dataset for at least 12 months before the start of the episode and for 6 months after the index back pain claim. We then extracted the first 6 months of back pain claims for each episode, totaling 113 215 claims. The claims processing stage uses clinical assumptions to group the medical and pharmaceutical claims into 14 event types (see Table 1). For each event type, we identified the set of diagnosis, procedure, revenue, service location, and clinician specialty codes that could be used to classify the claims. For some event types such as back pain surgery (coded as event letter “S”), the event is clearly defined and the codes used to identify claims are directly drawn from the medical literature.4,38 Table 1. Back pain episode event categories and event type descriptions Event category . Event type description . Event letter . General medical interactions Surgery (eg, fusion or laminectomy) S Invasive pain management (eg, epidural or facet injection) M Inpatient admission (without surgery) A Unplanned care (eg, emergency or urgent care visit) E Surgeon appointment G Pain medicine specialist appointment (without procedure) D Nonprocedural specialist appointment N Primary care physician or advanced care practitioner appointment P Physical or occupational therapy appointment T Alternative medicine (chiropractor or acupuncture) C Diagnostic imaging Advanced imaging (magnetic resonance imaging or computer tomography scan) I X-ray X Prescriptions Opioid prescription O Nonopioid prescription R Time Wait period between events (of 4+ weeks) or end of episode W Event category . Event type description . Event letter . General medical interactions Surgery (eg, fusion or laminectomy) S Invasive pain management (eg, epidural or facet injection) M Inpatient admission (without surgery) A Unplanned care (eg, emergency or urgent care visit) E Surgeon appointment G Pain medicine specialist appointment (without procedure) D Nonprocedural specialist appointment N Primary care physician or advanced care practitioner appointment P Physical or occupational therapy appointment T Alternative medicine (chiropractor or acupuncture) C Diagnostic imaging Advanced imaging (magnetic resonance imaging or computer tomography scan) I X-ray X Prescriptions Opioid prescription O Nonopioid prescription R Time Wait period between events (of 4+ weeks) or end of episode W Within the back pain setting, the order of the table from top to bottom reflects a decreasing clinical importance of the events. Open in new tab Table 1. Back pain episode event categories and event type descriptions Event category . Event type description . Event letter . General medical interactions Surgery (eg, fusion or laminectomy) S Invasive pain management (eg, epidural or facet injection) M Inpatient admission (without surgery) A Unplanned care (eg, emergency or urgent care visit) E Surgeon appointment G Pain medicine specialist appointment (without procedure) D Nonprocedural specialist appointment N Primary care physician or advanced care practitioner appointment P Physical or occupational therapy appointment T Alternative medicine (chiropractor or acupuncture) C Diagnostic imaging Advanced imaging (magnetic resonance imaging or computer tomography scan) I X-ray X Prescriptions Opioid prescription O Nonopioid prescription R Time Wait period between events (of 4+ weeks) or end of episode W Event category . Event type description . Event letter . General medical interactions Surgery (eg, fusion or laminectomy) S Invasive pain management (eg, epidural or facet injection) M Inpatient admission (without surgery) A Unplanned care (eg, emergency or urgent care visit) E Surgeon appointment G Pain medicine specialist appointment (without procedure) D Nonprocedural specialist appointment N Primary care physician or advanced care practitioner appointment P Physical or occupational therapy appointment T Alternative medicine (chiropractor or acupuncture) C Diagnostic imaging Advanced imaging (magnetic resonance imaging or computer tomography scan) I X-ray X Prescriptions Opioid prescription O Nonopioid prescription R Time Wait period between events (of 4+ weeks) or end of episode W Within the back pain setting, the order of the table from top to bottom reflects a decreasing clinical importance of the events. Open in new tab Other event types require a more knowledge-driven approach to identify the combination of characteristics that classify claims into events. For instance, the unplanned care event (coded as event letter “E”) looks for claims related to an emergency department visit (revenue code starting with 045, service location code 23, or Current Procedural Terminology code between 99 281 and 99 285) or urgent care visit (revenue code starting with 0516 or 0526, service location code 20, or Current Procedural Terminology code of S90088 or S9083). Further description of how we arrived at the code classification can be found in the Supplementary Appendix. Once the claims are assigned to an event type, the claims within each event category are aggregated based on overlapping dates into distinct interactions with the health system. For example, if a physical therapy appointment generated more than 1 medical claim, these claims would be grouped together into a single “physical therapy” (T) event. Likewise, all medical claims associated with a multiday inpatient hospital stay would be grouped together into a single “inpatient admission” (A) event. If an event contained claims that could be classified into different event types, the event is labeled according to the claim with the highest relative importance. For example, if a patient saw a surgeon (G) while admitted to the hospital (I), the event would be labeled as an inpatient admission (I). The order of importance of various events, also known as a clinical hierarchy, is represented from top to bottom in Table 1, with higher importance events listed first.41 Because combining claims into a single event only occurs within the partitions of an event category (eg, diagnostic imaging), it is possible that events from different categories occur on the same day. We also apply the clinical hierarchy from Table 1 to order these same-day events, such that the general medical interactions are ordered ahead of diagnostic imaging or prescriptions. This logic assumes that the clinician associated with the general medical interaction likely ordered the diagnostic imaging or prescription drugs. Events are each assigned a letter and then strung together in consecutive order to form a longitudinal view of the patient journey for back pain across distinct specialty appointments, prescriptions, facility visits, and diagnostic tests. As an example, the string P-T-T-I-O-W is a potential patient journey. It represents a patient that first went to their primary care physician for back pain (P), had 2 physical therapy appointments (T-T), was given diagnostic imaging in the form of magnetic resonance imaging or a computed tomography scan (I), and then was prescribed an opioid (O). A time-spacing event (W) indicates that significant time has elapsed between events or marks the end of an episode. Depending on the specific study context, the preprocessing stage can make a significant impact in reducing the dimensionality of the dataset. In our illustration, 113 215 back pain claims were reduced to 53 820 events (a 52.5% reduction in distinct data points), representing 2863 unique variations of the 6-month back pain patient journey. Figure 1 contains a visual representation of the first 4 back pain–related events across the patient sample. Figure 1. Open in new tabDownload slide Variation in the first 4 events of the patient back pain journey. Because back pain is a preference-sensitive condition, high variation exists in the first 6 months of the patient journey. The letters in this Sankey chart correspond to the event types displayed in Table 1. Of the patient back pain episodes, 74% contain 4 or fewer events; 89% are completed within 6 months. Figure 1. Open in new tabDownload slide Variation in the first 4 events of the patient back pain journey. Because back pain is a preference-sensitive condition, high variation exists in the first 6 months of the patient journey. The letters in this Sankey chart correspond to the event types displayed in Table 1. Of the patient back pain episodes, 74% contain 4 or fewer events; 89% are completed within 6 months. Sequence alignment For preference-sensitive conditions such as back pain, the treatment decisions (eg, whether the patient was prescribed opioids) as well as the order of treatment decisions (eg, whether the patient was sent for advanced imaging before or after attempting physical therapy) can substantially impact patient outcomes.4,37,40,42,43 The next stage of our proposed algorithm assesses the similarity between pairs of patient journey sequences based on both content and order, without requiring researchers to explicitly define clinical rules. Levenshtein’s edit distance algorithm aligns 2 sequences using a combination of edits: matches, insertions (or, equivalently, deletions), and substitutions.44 For example, the sequences G-T-T-T-P and P-T-P could be aligned by substituting the G for P at the front of the string, and inserting 2 Ts into the middle of the second sequence (see Figure 2). In the standard Levenshtein algorithm, each match between the 2 sequences is awarded a value of 1 and each insertion or substitution is penalized with a value of –1.45 As such, aligning G-T-T-T-P and P-T-P as described previously (substitute + match + insert + insert + match) with the Levenshtein edit costs would result in an alignment score of −1. Figure 2. Open in new tabDownload slide Example of sequence alignment. Our adaption of Levenshtein’s edit distance maximizes the total score of aligning 2 sequences using matches, substitutions, insertions, and transpositions. Because multiple possible alignments exist for any 2 strings, dynamic optimization is applied to maximize the sequence alignment score based on the given edit values. Figure 2. Open in new tabDownload slide Example of sequence alignment. Our adaption of Levenshtein’s edit distance maximizes the total score of aligning 2 sequences using matches, substitutions, insertions, and transpositions. Because multiple possible alignments exist for any 2 strings, dynamic optimization is applied to maximize the sequence alignment score based on the given edit values. Our algorithm relies on 2 expansions of the Levenshtein algorithm. First, we allow for transpositions, such that O-P and P-O could be aligned by swapping the last 2 characters instead of applying the insert-match-insert sequence.46 This is important in the back pain context because small changes in the order of patient actions (eg, filling a prescription and getting an x-ray) are often due to scheduling constraints and are of little consequence to the patient’s overall pattern of care. Second, unlike in the Levenshtein algorithm, in which all edits are penalized with a value of 1, our algorithm customizes the edit values based on both the type of editing action and the event being edited.19 For instance, transposing 2 letters may be awarded a smaller edit value compared with matching on the same 2 letters. Assigning edit values can be data-driven, involve the input of medical experts, or a combination of both.47 For back pain patients, some rarer treatment options, such as surgery, can be a defining aspect of the patient journey. Therefore, instead of weighting a match on surgery equal to a match on a primary care visit, we assign the value of matching events in proportion to the rareness of the event (which we refer to as rareness weighting). With match values scaled between 1 and 10 in our dataset, A (inpatient admission), which makes up 0.1% of events, has a match edit value of 10.0, while P (primary care visits), which makes up 11.7% of events, has a match edit value of 2.2. See the Supplementary Appendix for a complete list of edit values and the corresponding sensitivity analyses. As each pair of sequences may be aligned with multiple sets of edit actions, we utilize dynamic optimization to efficiently calculate the highest possible alignment score. The dynamic program is based on the principle that the maximum alignment score of strings i and j must be some combination of an action (eg, substitution) on the last letter(s) of 1 or both the sequences and the optimal score before that action. Specifically, we define si,j*(y,z) to represent the maximum score of aligning the first y elements of patient journey i (where 1≤y≤i_len, the number of elements in sequence i ⁠) with the first z elements of patient journey j (where 1≤z≤j_len, the number of elements in sequence j ⁠). The value of the yth element of i is designated as i[y] and the zth element of j as j[z] ⁠. The values (v) associated with each potential edit operation are vmtc (match), vsub (substitution), vins (insertion), and vtns (transposition). The dynamic optimization problem to maximize similarity score si,j*y,z can be expressed through the following formulation: si,j*y,z=maxsi,j*y-a,z-b+eij(y,z) Subject to: si,j*0,0=0 1≤y≤i_len; 1≤z≤j_len (a,b)∈0,1, 1,0,1,1,2,2; a≤y; b≤z eijy, z={1 vinsjz when a=0; b=12 vinsiy when a=1; b=03 vmtciy when a=b=1; iy=jz4 vsubiy+ vsubjz2 when a=b=1; iy≠jz5 maxvmtciy,vmtcjz+minvtnsiy, vtnsjz whena=b=2; iy-1=jz; iy=jz-16 -∞ when a=b=2;iy-1≠jz | iy≠jz-1 Where [1] inserts letter jz into string i ⁠, [2] inserts letter iy into string j ⁠, [3] matches letter iy=jz ⁠, [4] substitutes letter iy for jz ⁠, [5] transposes letters iy-1:y with jz-1:z ⁠, and [6] indicates that a transposition between iy-1:y and jz-1:z is not valid. After obtaining the optimal similarity score, we calculate the minimum (⁠ scale_minij ⁠) and maximum (⁠ scale_maxij ⁠) scores that could have been generated for the given pair of strings i ⁠, j (see Supplementary Appendix for calculation). We then transform the optimal value of aligning the 2 complete strings si,j*(i_len,j_len) into a normalized similarity score si,j ⁠, where 0 represents no similarity between strings and 1 implies the strings are identical: si,j=si,j*i_len,j_len-scale_minijscale_maxij-scale_minij The algorithm thus assigns high similarity scores to similar patient journeys (eg, P-X-O-W and P-X-O-O-W have a similarity score of 0.81) and lower scores to less similar journeys (eg, P-X-O-W and E-R-P-W have a similarity score of 0.21). The similarity scores si,j for each pair of journeys are compiled into a similarity matrix (see Figure 3). Figure 3. Open in new tabDownload slide Sample of the n-by-n similarity matrix. The matrix is populated using the normalized similarity scores. The index [i, j] in the similarity matrix s corresponds to the similarity score between patient journey i and patient journey j. Note that diagonal entries all have a normalized similarity score of 1 (as a given patient journey is identical to itself), and the lower diagonal is a reflection of the upper diagonal scores (because si,j=sj,i ⁠). Figure 3. Open in new tabDownload slide Sample of the n-by-n similarity matrix. The matrix is populated using the normalized similarity scores. The index [i, j] in the similarity matrix s corresponds to the similarity score between patient journey i and patient journey j. Note that diagonal entries all have a normalized similarity score of 1 (as a given patient journey is identical to itself), and the lower diagonal is a reflection of the upper diagonal scores (because si,j=sj,i ⁠). Journey clustering The goal of the clustering is to summarize the main patterns of the patient journeys. As it is important for the methodology to scale to large patient samples, we leverage k-means clustering, an effective approach when classifying objects within large datasets.21 The basic k-means algorithm (1) chooses k objects to be cluster centers, (2) assigns all other objects to their nearest cluster center, and (3) re-evaluates the center of the cluster. Steps 2 and 3 are repeated until the algorithm converges and no reassignments are made. To choose the cluster centers, we leverage the “k-means++” seeding technique, an approach that encourages starting seeds to be widely spread across the sample.48 After the first center K is randomly chosen, the next center is chosen by assigning a probability based on the squared distance between K and the other objects. Then, because k-means clustering can be sensitive to its initialization, we aggregate the results from multiple iterations of k-means using ensemble clustering. Ensemble clustering forms more stable clusters, with improved robustness and less distortion.21,22,49 After the k-means algorithm is run with different values of k and starting seeds, we calculate the percentage of times that patient journey i has been clustered together with patient journey j. These percentages are populated into what is called a co-association matrix. Researchers can then choose the single-link threshold t, which represents the minimum percentage that a patient journey i must have been clustered together with 1 (or more) of the patient journeys j in the final data partition Cn for patient journey i to be added into Cn ⁠. In the example illustrated in Figure 4, higher thresholds (eg, 90%) yield smaller and more homogenous clusters, whereas lower thresholds (eg, 50%) yield larger and more diverse clusters. Figure 4. Open in new tabDownload slide Aggregating k-means results using ensemble clustering. A single-link method partitions the outputs from multiple iterations of k-means into the final patient journey clusters Cn ⁠. When the minimum threshold t is set to 90%, 2 clusters form: POWPO-POWPRO and GWGIGW-GXIGW; the other 4 patient journeys drop out as “noise.” When t = 70%, patient journeys are categorized into 1 of 3 clusters: GWGIGW-GXIGW-GXIW, POWPO-POWPRO-PPRW, or EOXW-EPOPW. When t = 50%, 2 clusters merge, resulting in 2 more heterogeneous clusters: POWPO-POWPRO-PPRW-EOXW-EPOPW and GWGIGW-GXIGW-GXIW. Figure 4. Open in new tabDownload slide Aggregating k-means results using ensemble clustering. A single-link method partitions the outputs from multiple iterations of k-means into the final patient journey clusters Cn ⁠. When the minimum threshold t is set to 90%, 2 clusters form: POWPO-POWPRO and GWGIGW-GXIGW; the other 4 patient journeys drop out as “noise.” When t = 70%, patient journeys are categorized into 1 of 3 clusters: GWGIGW-GXIGW-GXIW, POWPO-POWPRO-PPRW, or EOXW-EPOPW. When t = 50%, 2 clusters merge, resulting in 2 more heterogeneous clusters: POWPO-POWPRO-PPRW-EOXW-EPOPW and GWGIGW-GXIGW-GXIW. The chosen threshold t should balance the specificity of the clusters (to focus on specific sets of patients) with the cluster size (to gain enough “power” for any subsequent interpretations, regressions or analyses). To gain an overview of the main patient journey patterns in this study, clinicians selected a threshold of 50% to extract 12 main patient journey patterns from the data. As detailed in the Supplementary Appendix, this threshold is appropriate for this study context in gathering a comprehensive overview of the first 6 months of back pain treatment; setting t to higher thresholds resulted in more, smaller partitions appropriate for studying more detailed clinical questions. RESULTS Using the proposed data-driven methodology, the 10 000 patient journeys were reduced into 12 primary patient journey clusters. The resulting clusters displayed in Table 2 show the distribution of patient episodes between diagnosis and treatment pathways, along with example patient journey sequences that make up each cluster. Table 2. Back pain patient journey clusters Cluster . Journeys . Example sequences . Clinical description . 1 17.0% PXWPW, PXPTW, PXPW, PW, PPXWPW, PXPPW, PPXTW, PWPXWTW, PPW Primary care + x-ray or wait 2 16.0% EWEXW, EXEW, EEXW, EXERW, EXWEW, EEWXW, EWXW, EEXRW, EXEWTTW Emergency + x-ray 3 9.7% TTWTTW, TWTTTW, TTWTTTW, TTTWTW, TTWTWT, TWTTWTW Short therapy series 4 9.3% CCCCWCCW, CCCCWCCCW, CCCWCCCCW, CCCCCCCW Alternative medicine 5 9.1% PRWPRW, PRPRW, PRPRTW, PRTWPRW, PRPRPW, PRRPW, PXRPRW, PRWPRRW Primary care + nonopioid prescriptions 6 8.1% TTTTTTTTWTTTTW, TTTTTTTTTTTTW, TTTTTTWTTTTTW, TTTTTWTTTTTTW Long therapy series 7 5.7% POWPOW, PORPOW, POPOW, PORWPOW, PWPOPOW, POROW, PXOWPOW Primary care + opioids 8 5.7% MWMWMW, MWMMW, MMWMW, MWIMWMW, TTMMWMW, MIMWMMW Pain medicine procedure 9 5.5% GXIWXW, GXIGW, GXWIGW, PGXIGW, GXIGWTTW, GXGW, GXWGIGW Surgeon + imaging 10 5.3% EXIW, EXWIW, EWIW, EIWIW, EXWIEW, EITW, EIW, EIWTTW, EIWEIW Emergency + advanced imaging 11 5.0% DDDWDW, DDDWDDW, DWDDWDW, DDDDWDW, DWDDDW, DDDDW Pain medicine specialist 12 2.0% NIWNW, NINW, NWNW, NRINW, NITTNTTW, NRNW, NNW, NINRTTW Nonprocedural specialist Noise 1.7% — Other Cluster . Journeys . Example sequences . Clinical description . 1 17.0% PXWPW, PXPTW, PXPW, PW, PPXWPW, PXPPW, PPXTW, PWPXWTW, PPW Primary care + x-ray or wait 2 16.0% EWEXW, EXEW, EEXW, EXERW, EXWEW, EEWXW, EWXW, EEXRW, EXEWTTW Emergency + x-ray 3 9.7% TTWTTW, TWTTTW, TTWTTTW, TTTWTW, TTWTWT, TWTTWTW Short therapy series 4 9.3% CCCCWCCW, CCCCWCCCW, CCCWCCCCW, CCCCCCCW Alternative medicine 5 9.1% PRWPRW, PRPRW, PRPRTW, PRTWPRW, PRPRPW, PRRPW, PXRPRW, PRWPRRW Primary care + nonopioid prescriptions 6 8.1% TTTTTTTTWTTTTW, TTTTTTTTTTTTW, TTTTTTWTTTTTW, TTTTTWTTTTTTW Long therapy series 7 5.7% POWPOW, PORPOW, POPOW, PORWPOW, PWPOPOW, POROW, PXOWPOW Primary care + opioids 8 5.7% MWMWMW, MWMMW, MMWMW, MWIMWMW, TTMMWMW, MIMWMMW Pain medicine procedure 9 5.5% GXIWXW, GXIGW, GXWIGW, PGXIGW, GXIGWTTW, GXGW, GXWGIGW Surgeon + imaging 10 5.3% EXIW, EXWIW, EWIW, EIWIW, EXWIEW, EITW, EIW, EIWTTW, EIWEIW Emergency + advanced imaging 11 5.0% DDDWDW, DDDWDDW, DWDDWDW, DDDDWDW, DWDDDW, DDDDW Pain medicine specialist 12 2.0% NIWNW, NINW, NWNW, NRINW, NITTNTTW, NRNW, NNW, NINRTTW Nonprocedural specialist Noise 1.7% — Other Open in new tab Table 2. Back pain patient journey clusters Cluster . Journeys . Example sequences . Clinical description . 1 17.0% PXWPW, PXPTW, PXPW, PW, PPXWPW, PXPPW, PPXTW, PWPXWTW, PPW Primary care + x-ray or wait 2 16.0% EWEXW, EXEW, EEXW, EXERW, EXWEW, EEWXW, EWXW, EEXRW, EXEWTTW Emergency + x-ray 3 9.7% TTWTTW, TWTTTW, TTWTTTW, TTTWTW, TTWTWT, TWTTWTW Short therapy series 4 9.3% CCCCWCCW, CCCCWCCCW, CCCWCCCCW, CCCCCCCW Alternative medicine 5 9.1% PRWPRW, PRPRW, PRPRTW, PRTWPRW, PRPRPW, PRRPW, PXRPRW, PRWPRRW Primary care + nonopioid prescriptions 6 8.1% TTTTTTTTWTTTTW, TTTTTTTTTTTTW, TTTTTTWTTTTTW, TTTTTWTTTTTTW Long therapy series 7 5.7% POWPOW, PORPOW, POPOW, PORWPOW, PWPOPOW, POROW, PXOWPOW Primary care + opioids 8 5.7% MWMWMW, MWMMW, MMWMW, MWIMWMW, TTMMWMW, MIMWMMW Pain medicine procedure 9 5.5% GXIWXW, GXIGW, GXWIGW, PGXIGW, GXIGWTTW, GXGW, GXWGIGW Surgeon + imaging 10 5.3% EXIW, EXWIW, EWIW, EIWIW, EXWIEW, EITW, EIW, EIWTTW, EIWEIW Emergency + advanced imaging 11 5.0% DDDWDW, DDDWDDW, DWDDWDW, DDDDWDW, DWDDDW, DDDDW Pain medicine specialist 12 2.0% NIWNW, NINW, NWNW, NRINW, NITTNTTW, NRNW, NNW, NINRTTW Nonprocedural specialist Noise 1.7% — Other Cluster . Journeys . Example sequences . Clinical description . 1 17.0% PXWPW, PXPTW, PXPW, PW, PPXWPW, PXPPW, PPXTW, PWPXWTW, PPW Primary care + x-ray or wait 2 16.0% EWEXW, EXEW, EEXW, EXERW, EXWEW, EEWXW, EWXW, EEXRW, EXEWTTW Emergency + x-ray 3 9.7% TTWTTW, TWTTTW, TTWTTTW, TTTWTW, TTWTWT, TWTTWTW Short therapy series 4 9.3% CCCCWCCW, CCCCWCCCW, CCCWCCCCW, CCCCCCCW Alternative medicine 5 9.1% PRWPRW, PRPRW, PRPRTW, PRTWPRW, PRPRPW, PRRPW, PXRPRW, PRWPRRW Primary care + nonopioid prescriptions 6 8.1% TTTTTTTTWTTTTW, TTTTTTTTTTTTW, TTTTTTWTTTTTW, TTTTTWTTTTTTW Long therapy series 7 5.7% POWPOW, PORPOW, POPOW, PORWPOW, PWPOPOW, POROW, PXOWPOW Primary care + opioids 8 5.7% MWMWMW, MWMMW, MMWMW, MWIMWMW, TTMMWMW, MIMWMMW Pain medicine procedure 9 5.5% GXIWXW, GXIGW, GXWIGW, PGXIGW, GXIGWTTW, GXGW, GXWGIGW Surgeon + imaging 10 5.3% EXIW, EXWIW, EWIW, EIWIW, EXWIEW, EITW, EIW, EIWTTW, EIWEIW Emergency + advanced imaging 11 5.0% DDDWDW, DDDWDDW, DWDDWDW, DDDDWDW, DWDDDW, DDDDW Pain medicine specialist 12 2.0% NIWNW, NINW, NWNW, NRINW, NITTNTTW, NRNW, NNW, NINRTTW Nonprocedural specialist Noise 1.7% — Other Open in new tab The highest proportion of patients (17.0% in cluster 1) visit a primary care practitioner and are directed to a low-acuity next step that may include waiting at least 4 weeks, getting an x-ray diagnosis, or a physical or occupational therapy appointment. Patients in cluster 1 appear to closely follow clinical guidelines that promote noninvasive, nonopioid care after initial onset of back pain.36,43 Patients in clusters 5 and 7 also begin their back pain episode in the primary care setting; however, most patients fill prescriptions (either opioid or nonopioid) as their first-line treatment. The second most common cluster is comprised of patients who make an unplanned visit to an emergency or urgent care center and receive an x-ray (16.0% in cluster 2). In 9.7% of episodes (cluster 3), we observe a self-referral to physical or occupational therapy, in which the patient proceeds to have approximately 3-5 additional therapy appointments. There also exist small, well-defined clusters such as cluster 8 (5.7% of episodes in which patients are primarily treated with facet or epidural injections) and cluster 12 (5.0% of episodes in which pain medicine specialists are consulted but do not administer epidural or facet injections). As described in the Materials and Methods, patient journeys were clustered solely on the sequence of the patient’s back pain events without considering the patient’s comorbidities or demographics. However, as seen in Table 3, there is a high level of variability between clusters in terms of patient characteristics. Table 3. Cluster summary statistics . Patient characteristics . Within 6 weeks of episode start . Within 6 months of episode start . Cluster . Average age (y) . Average annual incomea . Advanced imaging . Opioids filled . Unplanned emergency . Surgery . Average total costb . 1 47.2 $73,900 2.2% 0.0% 1.0% 0.1% 1.0× 2 49.1 $62,700 0.0% 1.9% 100.0% 0.0% 2.2× 3 50.3 $80,000 0.8% 0.0% 0.0% 0.0% 1.8× 4 44.0 $77,400 0.8% 0.5% 1.0% 0.0% 2.2× 5 48.2 $69,400 5.5% 3.1% 5.0% 0.2% 2.0× 6 53.4 $84,000 2.1% 1.0% 3.5% 0.1% 7.7× 7 53.2 $68,800 11.8% 91.2% 23.5% 2.3% 4.2× 8 65.5 $71,900 12.4% 3.7% 6.4% 1.4% 10.1× 9 49.2 $85,500 27.8% 9.4% 8.2% 9.1% 10.5× 10 59.7 $67,200 89.8% 7.9% 97.4% 2.5% 7.5× 11 51.1 $81,700 10.8% 5.8% 3.0% 0.8% 5.3× 12 51.8 $84,400 17.3% 5.6% 3.0% 3.0% 5.3× Noise 50.4 $70,600 36.9% 1.2% 14.3% 1.8% 7.7× . Patient characteristics . Within 6 weeks of episode start . Within 6 months of episode start . Cluster . Average age (y) . Average annual incomea . Advanced imaging . Opioids filled . Unplanned emergency . Surgery . Average total costb . 1 47.2 $73,900 2.2% 0.0% 1.0% 0.1% 1.0× 2 49.1 $62,700 0.0% 1.9% 100.0% 0.0% 2.2× 3 50.3 $80,000 0.8% 0.0% 0.0% 0.0% 1.8× 4 44.0 $77,400 0.8% 0.5% 1.0% 0.0% 2.2× 5 48.2 $69,400 5.5% 3.1% 5.0% 0.2% 2.0× 6 53.4 $84,000 2.1% 1.0% 3.5% 0.1% 7.7× 7 53.2 $68,800 11.8% 91.2% 23.5% 2.3% 4.2× 8 65.5 $71,900 12.4% 3.7% 6.4% 1.4% 10.1× 9 49.2 $85,500 27.8% 9.4% 8.2% 9.1% 10.5× 10 59.7 $67,200 89.8% 7.9% 97.4% 2.5% 7.5× 11 51.1 $81,700 10.8% 5.8% 3.0% 0.8% 5.3× 12 51.8 $84,400 17.3% 5.6% 3.0% 3.0% 5.3× Noise 50.4 $70,600 36.9% 1.2% 14.3% 1.8% 7.7× a As estimated using the census tract data provided by the U.S. Census Bureau; salary estimates are based on the patient’s census tract area. b Compared with least expensive cluster 1. Open in new tab Table 3. Cluster summary statistics . Patient characteristics . Within 6 weeks of episode start . Within 6 months of episode start . Cluster . Average age (y) . Average annual incomea . Advanced imaging . Opioids filled . Unplanned emergency . Surgery . Average total costb . 1 47.2 $73,900 2.2% 0.0% 1.0% 0.1% 1.0× 2 49.1 $62,700 0.0% 1.9% 100.0% 0.0% 2.2× 3 50.3 $80,000 0.8% 0.0% 0.0% 0.0% 1.8× 4 44.0 $77,400 0.8% 0.5% 1.0% 0.0% 2.2× 5 48.2 $69,400 5.5% 3.1% 5.0% 0.2% 2.0× 6 53.4 $84,000 2.1% 1.0% 3.5% 0.1% 7.7× 7 53.2 $68,800 11.8% 91.2% 23.5% 2.3% 4.2× 8 65.5 $71,900 12.4% 3.7% 6.4% 1.4% 10.1× 9 49.2 $85,500 27.8% 9.4% 8.2% 9.1% 10.5× 10 59.7 $67,200 89.8% 7.9% 97.4% 2.5% 7.5× 11 51.1 $81,700 10.8% 5.8% 3.0% 0.8% 5.3× 12 51.8 $84,400 17.3% 5.6% 3.0% 3.0% 5.3× Noise 50.4 $70,600 36.9% 1.2% 14.3% 1.8% 7.7× . Patient characteristics . Within 6 weeks of episode start . Within 6 months of episode start . Cluster . Average age (y) . Average annual incomea . Advanced imaging . Opioids filled . Unplanned emergency . Surgery . Average total costb . 1 47.2 $73,900 2.2% 0.0% 1.0% 0.1% 1.0× 2 49.1 $62,700 0.0% 1.9% 100.0% 0.0% 2.2× 3 50.3 $80,000 0.8% 0.0% 0.0% 0.0% 1.8× 4 44.0 $77,400 0.8% 0.5% 1.0% 0.0% 2.2× 5 48.2 $69,400 5.5% 3.1% 5.0% 0.2% 2.0× 6 53.4 $84,000 2.1% 1.0% 3.5% 0.1% 7.7× 7 53.2 $68,800 11.8% 91.2% 23.5% 2.3% 4.2× 8 65.5 $71,900 12.4% 3.7% 6.4% 1.4% 10.1× 9 49.2 $85,500 27.8% 9.4% 8.2% 9.1% 10.5× 10 59.7 $67,200 89.8% 7.9% 97.4% 2.5% 7.5× 11 51.1 $81,700 10.8% 5.8% 3.0% 0.8% 5.3× 12 51.8 $84,400 17.3% 5.6% 3.0% 3.0% 5.3× Noise 50.4 $70,600 36.9% 1.2% 14.3% 1.8% 7.7× a As estimated using the census tract data provided by the U.S. Census Bureau; salary estimates are based on the patient’s census tract area. b Compared with least expensive cluster 1. Open in new tab For example, patients in cluster 6 (who receive 10 or more physical or occupational therapy sessions) or clusters 9, 11, and 12 (who obtain care from specialists) are more likely to live in areas with higher average salaries compared with patients who follow different patient journeys. Meanwhile, the lowest average salaries within the sample are associated with clusters 2 and 10 (seeking care from the emergency room) or cluster 7 (being prescribed opioids by a primary care physician). The alternative medicine cluster (cluster 4) is associated with the youngest average age of the sample, whereas cluster 8 (invasive pain management procedures) is associated with the highest average age. Driven largely by our use of rareness-weighted edit values, there is a high level of diversity between clusters in terms of the key back pain outcomes such as early advanced imaging and surgical rates.37 For example, high rates of back pain surgery are concentrated among the patients whose initial starting encounter is a surgeon (cluster 9 at 9.1%), compared with patients who enter the system though other clinical entry points. Episodes in cluster 8 (invasive pain management procedures) average 10.1 times higher medical costs than episodes in cluster 1. In cluster 7, in which the patient’s first point of contact is typically the primary care physician, 91.2% of patients are prescribed and fill opioids within the first 6 weeks of the start of their episode. This opioid fill rate in this primary care cluster even exceeds that of clusters 2 and 10, in which patients seek care in emergency or urgent care settings. Although early advanced imaging within 6 weeks of onset of pain is considered a major contributor to overtreatment and inappropriate medical spending, 89.8% of the patients in cluster 10, who seek care in an emergency or urgent care setting, receive magnetic resonance imaging or a computed tomography scan within 6 weeks of their index back pain claim.4 Despite the high cost associated with this cluster, the episodes appear short-lived, with 93.4% of patients ceasing treatment for back pain within the first 6 weeks compared with the overall rate of 88.9% across the sample. An additional 27.8% of patients who seek care from a surgeon (cluster 9) receive advanced imaging within the first 6 weeks, as do 17.3% of patients whose care is managed by nonprocedural specialists (cluster 12). Although they jointly comprise only 23.4% of episodes in the sample, clusters 8-12, which rely heavily on specialists, procedures, and imaging, make up 43.2% of overall back pain spending. DISCUSSION While claims data have been touted as having the potential to provide a bird’s-eye view of a patient’s healthcare records and of healthcare utilization at the population level, studies have often fallen short of that goal. Claims data are notoriously noisy (owing to, for example, variation in medical coding) and are not generated for research purposes.11–15 The proposed methodology effectively used a combination of data processing, sequence alignment, and ensemble clustering to identify primary patient journey patterns in the highly variable, preference-sensitive condition of back pain. When a group of primary care providers in a large multispecialty clinic were presented with the preliminary cluster outputs using the group’s own data, they initially voiced concerns about the validity of the treatment pathways. However, within a short time of reviewing the outputs, the clinicians moved beyond their own recollections of individual patient cases to a more objective discussion treatment options within the context of real-life complexities. In addition to engaging with an overview of care plans, clinicians leveraged the outputs to better understand variability between outcomes and inform future medical research into how patients and providers interact to create high-quality care. This study is the result of a close collaboration between healthcare informatics researchers and clinicians. Medical expertise on back pain was instrumental in designing an objective with clinical relevance and informing the patient sample criteria. Clinicians also helped define event types, set edit value assumptions, and interpret cluster results. While not used directly in this research, standardized logics, such as SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms), can aid informatics researchers when translating clinical assumptions into corresponding diagnosis codes.50 There are multiple opportunities to expand this methodology in future work. First, the methodology could be adapted to study longitudinal healthcare events such as chronic conditions or the patient experience at the end of life. Doing so may require researchers to adjust the preprocessing steps such that the patient journey is represented in blocks of time instead of as a string of distinct events. Second, researchers could consider different methods of assigning edit values. For instance, the rareness-weighting method of assigning edit values was developed when clinicians identified that the Levenshtein edit values caused insufficient differentiation of journeys with rare, high-cost events. In other research contexts that have well-established guidelines, a weighting strategy could assign edit values based on which events are recommended as first-line or second-line therapies. Third, the methodology could be adjusted based on the size of the dataset. The sequence alignment step can be computationally intensive as it compares each pair of patient journeys to obtain the full matrix of similarity scores. In comparison, conformance checking in process mining uses sequence alignment to compare event-log data to a single, predetermined understanding of the process map.19,24 While the conformance checking approach anchors the analysis to prior, potentially biased knowledge of the system, there likely exists a balance between its limited comparisons and our methodology that lets patterns emerge fully from the data. Researchers could also compare the efficiency of the presented clustering method to techniques such as spectral clustering. Finally, we have not yet explored techniques used in other applications of data science and sequence alignment, including the trace-back method that identifies sources of deviations along pairs of sequences.51 In the context of preference-sensitive conditions, the trace-back method could allow researchers to isolate key discrepancies between journeys that may have led to variation in outcome measures. The resulting clusters can also be combined with prediction algorithms to identify patients who should be targeted for early intervention. For example, certain patterns in the beginning of the journey may signal that a patient is at elevated risk for aggressive opioid prescribing or for a low-quality procedure. Despite the clinically relevant results, we acknowledge that this study has several limitations. We used a very broad definition of back pain that captured most patients presenting new general back pain symptoms. The purpose of this definition was to identify patient episodes typically expected to follow a conservative care route, as outlined by the American College of Physicians, and to understand population-level deviation from clinical recommendations.36,37 It is not known how well our selected population reflect all patients who present with back pain, as claims data were not supplemented by other data sources such as hospital notes or psychologic evaluations. Additionally, while patients are geographically dispersed throughout the United States, the sample is not nationally representative; thus, the breakdown of patient episodes into clusters may not represent national trends. CONCLUSION Compared to clinical guidelines that represent a top-down picture of patient behavior, the outputs from this methodology reveal a data-driven understanding of how patients traverse the healthcare system. Using a limited set of assumptions, the methodology is particularly effective in analyzing conditions with high levels of variability in patient care and those treated across service locations. In the preference-sensitive condition of back pain, we observed that treatment choices are associated with patient characteristics and procedure rates, thereby highlighting the potential public health impact of related future studies based on this methodology. When tailored to various care settings, this methodology can provide the medical community with an accurate overview of the current state of patient care and facilitate a shift toward high-quality practice patterns. FUNDING This research received no specific grant from any funding agency in the public, commercial or not-for-profit sector. It was supported internally by Evolent Health. AUTHOR CONTRIBUTIONS All authors were involved in revising the work for intellectual content and approved the manuscript. KB, AC, and LH made substantial contributions to the original study design. KB developed the algorithm, contributed to data analysis, interpretation, and writing the manuscript. CL contributed to data analysis, interpretation, and writing the manuscript. AC contributed to data acquisition and revising the manuscript. MB refined the algorithm and contributed to data interpretation, writing, and revising the manuscript. LH served as the primary contact with provider groups during algorithm development and contributed to clinical assumptions, data interpretation, and revising the manuscript. SUPPLEMENTARY APPENDIX Supplementary Appendix is available at Journal of the American Medical Informatics Association online. ACKNOWLEDGMENTS The authors gratefully acknowledge the following individuals for their contributions: Rich King, Malcolm Charles, and Michael Freeman who advised on algorithm development; Madina Bram who provided administrative support on the project; Feryal Erhun, Stefan Scholtes, Nico Lewine, Matthias Weidlich, and Jenny Wang who provided valuable feedback and suggestions to the development and framing of this research. The authors also thank the JAMIA review team for their constructive and encouraging comments, as well as the clinicians who participated in discussions on the back pain assumptions and outputs. CONFLICT OF INTEREST STATEMENT None declared. REFERENCES 1 Trebble TM , Hansi N, Hydes T, et al. . Process mapping the patient journey: an introduction . BMJ 2010 ; 341 ( 1 ): c4078 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Flint LA , David DJ, Smith AK, et al. . Rehabbed to death: breaking the cycle . J Am Geriatr Soc 2019 ; 67 ( 11 ): 2398 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Wennberg JE. Unwarranted variations in healthcare delivery: implications for academic medical centres . Br Med J 2002 ; 325 ( 7370 ): 961 – 4 . Google Scholar Crossref Search ADS WorldCat 4 Deyo R , Mirza S, Turner J, et al. . Overtreating chronic back pain: time to back off? J Am Board Fam Med 2009 ; 22 ( 1 ): 62 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Heyward J , Jones CM, Compton WM, et al. . Coverage of nonpharmacologic treatments for low back pain among US public and private insurers . JAMA Netw Open 2018 ; 1 ( 6 ): e183044 . Google Scholar Crossref Search ADS PubMed WorldCat 6 The Dartmouth Atlas of Healthcare: Understanding Efficiency and Effectiveness of the Health Care System. 2018 . http://www.dartmouthatlas.org/ Accessed November 19, 2018. 7 Wood A , Matula SR, Huan L, et al. . Improving the value of medical care for patients with back pain . Pain Med 2019 ; 20 ( 4 ): 664 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Huang Z , Lu X, Duan H. On mining clinical pathway patterns from medical behaviors . Artif Intell Med 2012 ; 56 ( 1 ): 35 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Maeng D , Boscarino J, Stewart W, et al. . A comparison of electronic medical records vs. claims data for rheumatoid arthritis patients in a large healthcare system: an exploratory analysis . Clin Med Res 2014 ; 12 : 108 . Google Scholar Crossref Search ADS WorldCat 10 Huang Z , Ge Z, Dong W, et al. . Probabilistic modeling personalized treatment pathways using electronic health records . J Biomed Inform 2018 ; 86 : 33 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Bertsimas D , Bjarnadóttir MV, Kane MA, et al. . Algorithmic prediction of health-care costs . Oper Res 2008 ; 56 ( 6 ): 1382 – 92 . Google Scholar Crossref Search ADS WorldCat 12 Bjarnadottir MV , Czerwinski D, Guan Y. The history and modern application of insurance claims data in healthcare research: from data to knowledge to healthcare improvement . In: Yang H, Lee E, eds. Healthcare Analytics: From Data to Knowledge to Healthcare Improvement. Hoboken, NJ: Wiley; 2016 : 561 – 91 . OpenURL Placeholder Text WorldCat 13 Simon GE , Shortreed SM, Johnson E, et al. . What health records data are required for accurate prediction of suicidal behavior? J Am Med Inform Assoc 2019 ; 26 ( 12 ): 1458 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Heintzman J , Bailey SR, Hoopes MJ, et al. . Agreement of Medicaid claims and electronic health records for assessing preventive care quality among adults . J Am Med Inform Assoc 2014 ; 21 ( 4 ): 720 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Zhang Y , Koru G. Understanding and detecting defects in healthcare administration data: Toward higher data quality to better support healthcare operations and decisions . J Am Med Inform Assoc 2020 ; 27 ( 3 ): 386 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Sun J , Sun Y. A system for automated lexical mapping . J Am Med Inform Assoc 2006 ; 13 ( 3 ): 334 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Wrenn JO , Stein DM, Bakken S, et al. . Quantifying clinical narrative redundancy in an electronic health record . J Am Med Inform Assoc 2010 ; 17 ( 1 ): 49 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Kate RJ. Normalizing clinical terms using learned edit distance patterns . J Am Med Inform Assoc 2016 ; 23 ( 2 ): 380 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Van der Aalst W , Adriansyah A, Van Dongen B. Replaying history on process models for conformance checking and performance analysis . WIREs Data Mining Knowl Discov 2012 ; 2 ( 2 ): 182 – 92 . Google Scholar Crossref Search ADS WorldCat 20 Li P , Jiang X, Wang S, et al. . HUGO: Hierarchical multi-reference genome compression for aligned reads . J Am Med Inform Assoc 2014 ; 21 ( 2 ): 363 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Jain AK. Data clustering: 50 years beyond K-means . Pattern Recognit Lett 2010 ; 31 ( 8 ): 651 – 66 . Google Scholar Crossref Search ADS WorldCat 22 Fodeh SJ , Brandt C, Luong TB, et al. . Complementary ensemble clustering of biomedical data . J Biomed Inform 2013 ; 46 ( 3 ): 436 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Zhang Y , Padman R, Levin JE. Paving the COWpath: data-driven design of pediatric order sets . J Am Med Inform Assoc 2014 ; 21 ( e2 ): e304 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Rojas E , Munoz-Gama J, Sepúlveda M, et al. . Process mining in healthcare: a literature review . J Biomed Inform 2016 ; 61 : 224 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Wellner B , Huyck M, Mardis S, et al. . Rapidly retargetable approaches to de-identification in medical records . J Am Med Inform Assoc 2007 ; 14 ( 5 ): 564 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Zhang Y , Padman R. Innovations in chronic care delivery using data-driven clinical pathways . Am J Manag Care 2015 ; 21 ( 12 ): e661 – 668 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 27 Chen JH , Podchiyska T, Altman RB. OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records . J Am Med Inform Assoc 2016 ; 23 ( 2 ): 339 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Ghasemi M , Amyot D. Process mining in healthcare: a systematised literature review . Int J Electron Healthc 2016 ; 9 ( 1 ): 60 – 88 . Google Scholar Crossref Search ADS WorldCat 29 Delias P , Doumpos M, Grigoroudis E, et al. . Supporting healthcare management decisions via robust clustering of event logs . Knowl-Based Syst 2015 ; 84 : 203 – 13 . Google Scholar Crossref Search ADS WorldCat 30 Song M , Günther CW, Van Der Aalst W. Trace clustering in process mining . In: BPM 2008: Business Process Management Workshops ; 2008 : 109 – 20 . OpenURL Placeholder Text WorldCat 31 Hripcsak G , Ryan PB, Duke JD, et al. . Characterizing treatment pathways at scale using the OHDSI network . Proc Natl Acad Sci U S A 2016 ; 113 ( 27 ): 7329 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Kuwornu JP , Lix LM, Quail JM, et al. . Identifying distinct healthcare pathways during episodes of chronic obstructive pulmonary disease exacerbations . Medicine (Baltimore) 2016 ; 95 : e288 8 . Google Scholar Crossref Search ADS WorldCat 33 Hoverman JR , Cartwright TH, Patt DA, et al. . Pathways, outcomes, and costs in colon cancer: retrospective evaluations in 2 distinct databases . Am J Manag Care 2011 ; 7 : 52 – 9 . OpenURL Placeholder Text WorldCat 34 Tessier JE , Rupp G, Gera JT, et al. . Physicians with defined clear care pathways have better discharge disposition and lower cost . J Arthroplasty 2016 ; 31 ( 9 ): 54 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Tóth K , Kósa I, Vathy-Fogarassy Á. Frequent treatment sequence mining from medical databases . Stud Health Technol Inform 2017 ; 236 : 211 – 8 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 36 Qaseem A , Wilt TJ, McLean RM, et al. . Noninvasive treatments for acute, subacute, and chronic low back pain: a clinical practice guideline from the American College of Physicians . Ann Intern Med 2017 ; 166 ( 7 ): 514 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Bernstein IA , Malik Q, Carville S, et al. . Low back pain and sciatica: summary of NICE guidance . BMJ 2017 ; 356 : 10 – 3 . OpenURL Placeholder Text WorldCat 38 Sinnott PL , Siroka AM, Shane AC, et al. . Identifying neck and back pain in administrative data: defining the right cohort . Spine (Phila Pa 1976) 2012 ; 37 ( 10 ): 860 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Deyo RA , Cherkin DC, Ciol MA. Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases . J Clin Epidemiol 1992 ; 45 ( 6 ): 613 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Jarvik JG , Gold LS, Comstock BA, et al. . Association of early imaging for back pain with clinical outcomes in older adults . JAMA 2015 ; 313 ( 11 ): 1143 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Shah A , Hayes C, Martin B. Factors influencing long-term opioid use among opioid naive patients: an examination of initial prescription characteristics and pain etiologies . J Pain 2017 ; 18 ( 11 ): 1374 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Gellhorn AC , Chan L, Martin B, et al. . Management Patterns in Acute Low Back Pain: The Role of Physical Therapy . Spine (Phila Pa 1976) 2012 ; 37 ( 9 ): 775 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Deyo RA , Von Korff M, Duhrkoop D. Opioids for low back pain . BMJ 2015 ; 350 : g6380 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Levenshtein V. Binary codes capable of correcting deletions, insertions, and reversals . Cybern Control Theory 1966 ; 10 : 707 – 10 . OpenURL Placeholder Text WorldCat 45 Pentland BT. Conceptualizing and measuring variety in the execution of organizational work processes . Manage Sci 2003 ; 49 ( 7 ): 857 – 70 . Google Scholar Crossref Search ADS WorldCat 46 Damerau FJ. A technique for computer detection and correction of spelling errors . Commun ACM 1964 ; 7 ( 3 ): 171 – 6 . Google Scholar Crossref Search ADS WorldCat 47 Bose R , van der Aalst WMP. Context aware trace clustering: towards improving process mining results. In: proceedings of the 2009 SIAM International Conference on Data Mining; 2009 . 401 – 12 . 48 Arthur D , Vassilvitskii S. K-means++: the advantages of careful seeding . in: proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms; 2007 : 1027 – 35 . 49 Fred ALN , Jain AK. Robust data clustering. In: proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2003 . 50 Lee D , Cornet R, Lau F, et al. . A survey of SNOMED CT implementations . J Biomed Inform 2013 ; 46 ( 1 ): 87 – 96 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Haque W , Aravind A, Reddy B. Pairwise sequence alignment algorithm . In: ISTA ’09: Proceedings of the 2009 Conference on Information Science, Technology and Applications; 2009 : 96 – 103 . © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of the American Medical Informatics Association Oxford University Press

The bird’s-eye view: A data-driven approach to understanding patient journeys from claims data

Loading next page...
 
/lp/oxford-university-press/the-bird-s-eye-view-a-data-driven-approach-to-understanding-patient-vTY2Q6Qo0N

References (53)

Publisher
Oxford University Press
Copyright
© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com
ISSN
1067-5027
eISSN
1527-974X
DOI
10.1093/jamia/ocaa052
Publisher site
See Article on Publisher Site

Abstract

Abstract Objective In preference-sensitive conditions such as back pain, there can be high levels of variability in the trajectory of patient care. We sought to develop a methodology that extracts a realistic and comprehensive understanding of the patient journey using medical and pharmaceutical insurance claims data. Materials and Methods We processed a sample of 10 000 patient episodes (comprised of 113 215 back pain–related claims) into strings of characters, where each letter corresponds to a distinct encounter with the healthcare system. We customized the Levenshtein edit distance algorithm to evaluate the level of similarity between each pair of episodes based on both their content (types of events) and ordering (sequence of events). We then used clustering to extract the main variations of the patient journey. Results The algorithm resulted in 12 comprehensive and clinically distinct patterns (clusters) of patient journeys that represent the main ways patients are diagnosed and treated for back pain. We further characterized demographic and utilization metrics for each cluster and observed clear differentiation between the clusters in terms of both clinical content and patient characteristics. Discussion Despite being a complex and often noisy data source, administrative claims provide a unique longitudinal overview of patient care across multiple service providers and locations. This methodology leverages claims to capture a data-driven understanding of how patients traverse the healthcare system. Conclusions When tailored to various conditions and patient settings, this methodology can provide accurate overviews of patient journeys and facilitate a shift toward high-quality practice patterns. Claims data, patient journey, clustering, edit distance, sequence alignment INTRODUCTION Medical researchers have long pointed to the importance of understanding the realistic picture of the patient journey: the chronological sequence of how a patient seeks and receives care from the healthcare system.1,2 Capturing an accurate overview of the patient journey can help identify sources of variability, evaluate why patients respond differently to the same overarching treatment plan, and compare how actual realizations of the treatment plan differ from standard clinical guidelines. However, in a fragmented healthcare system, it can be difficult to derive a comprehensive understanding patient journeys based on real utilization patterns. Understanding the patient journey is especially important for highly variable, preference-sensitive conditions such as back pain.3–5 Because back pain has numerous clinically acceptable therapeutic options, the trajectory of patient care can be highly variable and influenced by the severity of the condition, access to healthcare services, provider preferences, and the patient’s medical history.5–7 Adding to this complexity, treatment for back pain often occurs across service locations (eg, primary care, emergency services, physical therapy). While significant effort has been placed on extracting and analyzing patient journeys from electronic medical records and clinical workflows, these data sources tend to be centered around a single healthcare provider.8–10 To obtain a more comprehensive overview of the patient journey across various providers and locations, we propose a data-driven methodology based on medical and pharmaceutical claims data. In the U.S. healthcare system, administrative claims data from insurance providers offer a uniquely detailed retrospective account of how individual patients receive medical treatment.11–13 Claims data contain date, diagnostic, procedural, and provider information, which, when strung together, create an overview of services provided by a collection of clinicians. Compared with electronic health records, insurance claims are a useful platform to study longitudinal utilization and conditions that are treated across multiple locations. However, claims data are often inherently noisy, have duplicated information, and may not accurately identify a complete list of services provided to the patient.14,15 With these challenges, to our knowledge, automatic detection of representative patient journey patterns from claims data has not been successfully completed at scale. The proposed data-driven methodology uniquely combines and builds on tools leveraged elsewhere in healthcare informatics to develop an algorithmic approach to extract and understand patient journeys from claims data.16–25 We represent the back pain–related events of the patient’s journey as a string of letters, in which each letter corresponds to a distinct encounter with the healthcare system. We then evaluate the similarities between the strings based on both their content and their ordering (with a dynamic sequence alignment algorithm), and finally cluster the patient journeys together (using ensemble clustering) to identify representative patterns. Applied together, using careful data modeling, these analytic elements create a data-driven understanding of the patient journey. The proposed methodology to extract patient journey patterns from claims data combines and customizes techniques from sequence alignment and clustering. Applications of sequence alignment (such as the Levenshtein edit distance) have been successfully implemented within informatics to map laboratory text into a standardized medical vocabulary, identify duplications in electronic medical records, and normalize terms in clinical text.16–19 Clustering has been shown to be effective at compressing large clinical datasets; techniques including k-means clustering are commonly applied to image processing in the context of radiology scans and skin tissue samples.20–23 There is a large prior literature that focuses on understanding or extracting patient journey patterns from event logs, such as electronic medical record systems.8,10,26,27 When patient journey data are organized into event logs or time stamps, process mining can discover a single process map (or set of maps) that shows how entities transfer from the beginning to the end of the system.24,28 Even though noise reduction techniques have been developed to address challenges such as missing data and repeated events, the frequency of such occurrences in claims data makes it difficult to apply process discovery within the claims setting.9,29,30 Furthermore, in conditions like back pain, in which it is appropriate to revisit or repeat events such as physical therapy, it may not be appropriate to conceptualize the patient journey as an end-to-end process. When studying the patient journey using administrative claims, analyses typically limit the analysis to specific elements of the patient journey, for instance, categorizing the first-line treatment after condition onset, or looking at the first 3 events of the treatment pathway.31,32 Claims have also been used to measure outcomes of pathway effectiveness, without being leveraged to create an understanding of the patient journey itself.33,34 Other work identified common pathways by frequency, but this inherently biases the outputs to display the shortest and simplest patient pathways.35 In contrast, our proposed methodology uses a data-driven approach to identify similarities between patient journeys and understand the main patterns across the patient’s full set of interactions with the healthcare system. MATERIALS AND METHODS There are 3 analytical steps to the proposed data-driven approach to extract the patient journeys: claims processing, sequence alignment, and journey clustering. Details of all clinical assumptions, including codes used to process the data, are provided in the Supplementary Appendix. Claims processing This research utilized a nationwide U.S. dataset that included medical and pharmaceutical claims from 29 different provider networks across 23 states and the District of Columbia. While not a nationally representative dataset, the patients were insured through commercial, Medicare Advantage, and Medicaid plans and represent a variety of patient demographics and comorbidities. The research was approved by the Ethics Committee at the University of Cambridge Judge Business School as a nonhuman subject study. We analyzed all back pain–related claims between September 2012 and March 2019, in which back pain was broadly defined to encompass patients expected to follow conservative back pain guidelines, such as those released by the American College of Practitioners.36,37 Following the related back pain literature, patients were excluded if they had a history of cancer, congenital abnormalities, or certain autoimmune conditions, or if they were being treated in end-of-life care, as the care for these patients is often medically justified to deviate from the general guidelines.36–39 We identified a random sample of 10 000 back pain episodes (corresponding to 9981 unique patients) in which the patient had an initial back pain–related claim after a minimum 6-month clean period without back pain–related claims.38,40 Patients were required to be fully eligible in the dataset for at least 12 months before the start of the episode and for 6 months after the index back pain claim. We then extracted the first 6 months of back pain claims for each episode, totaling 113 215 claims. The claims processing stage uses clinical assumptions to group the medical and pharmaceutical claims into 14 event types (see Table 1). For each event type, we identified the set of diagnosis, procedure, revenue, service location, and clinician specialty codes that could be used to classify the claims. For some event types such as back pain surgery (coded as event letter “S”), the event is clearly defined and the codes used to identify claims are directly drawn from the medical literature.4,38 Table 1. Back pain episode event categories and event type descriptions Event category . Event type description . Event letter . General medical interactions Surgery (eg, fusion or laminectomy) S Invasive pain management (eg, epidural or facet injection) M Inpatient admission (without surgery) A Unplanned care (eg, emergency or urgent care visit) E Surgeon appointment G Pain medicine specialist appointment (without procedure) D Nonprocedural specialist appointment N Primary care physician or advanced care practitioner appointment P Physical or occupational therapy appointment T Alternative medicine (chiropractor or acupuncture) C Diagnostic imaging Advanced imaging (magnetic resonance imaging or computer tomography scan) I X-ray X Prescriptions Opioid prescription O Nonopioid prescription R Time Wait period between events (of 4+ weeks) or end of episode W Event category . Event type description . Event letter . General medical interactions Surgery (eg, fusion or laminectomy) S Invasive pain management (eg, epidural or facet injection) M Inpatient admission (without surgery) A Unplanned care (eg, emergency or urgent care visit) E Surgeon appointment G Pain medicine specialist appointment (without procedure) D Nonprocedural specialist appointment N Primary care physician or advanced care practitioner appointment P Physical or occupational therapy appointment T Alternative medicine (chiropractor or acupuncture) C Diagnostic imaging Advanced imaging (magnetic resonance imaging or computer tomography scan) I X-ray X Prescriptions Opioid prescription O Nonopioid prescription R Time Wait period between events (of 4+ weeks) or end of episode W Within the back pain setting, the order of the table from top to bottom reflects a decreasing clinical importance of the events. Open in new tab Table 1. Back pain episode event categories and event type descriptions Event category . Event type description . Event letter . General medical interactions Surgery (eg, fusion or laminectomy) S Invasive pain management (eg, epidural or facet injection) M Inpatient admission (without surgery) A Unplanned care (eg, emergency or urgent care visit) E Surgeon appointment G Pain medicine specialist appointment (without procedure) D Nonprocedural specialist appointment N Primary care physician or advanced care practitioner appointment P Physical or occupational therapy appointment T Alternative medicine (chiropractor or acupuncture) C Diagnostic imaging Advanced imaging (magnetic resonance imaging or computer tomography scan) I X-ray X Prescriptions Opioid prescription O Nonopioid prescription R Time Wait period between events (of 4+ weeks) or end of episode W Event category . Event type description . Event letter . General medical interactions Surgery (eg, fusion or laminectomy) S Invasive pain management (eg, epidural or facet injection) M Inpatient admission (without surgery) A Unplanned care (eg, emergency or urgent care visit) E Surgeon appointment G Pain medicine specialist appointment (without procedure) D Nonprocedural specialist appointment N Primary care physician or advanced care practitioner appointment P Physical or occupational therapy appointment T Alternative medicine (chiropractor or acupuncture) C Diagnostic imaging Advanced imaging (magnetic resonance imaging or computer tomography scan) I X-ray X Prescriptions Opioid prescription O Nonopioid prescription R Time Wait period between events (of 4+ weeks) or end of episode W Within the back pain setting, the order of the table from top to bottom reflects a decreasing clinical importance of the events. Open in new tab Other event types require a more knowledge-driven approach to identify the combination of characteristics that classify claims into events. For instance, the unplanned care event (coded as event letter “E”) looks for claims related to an emergency department visit (revenue code starting with 045, service location code 23, or Current Procedural Terminology code between 99 281 and 99 285) or urgent care visit (revenue code starting with 0516 or 0526, service location code 20, or Current Procedural Terminology code of S90088 or S9083). Further description of how we arrived at the code classification can be found in the Supplementary Appendix. Once the claims are assigned to an event type, the claims within each event category are aggregated based on overlapping dates into distinct interactions with the health system. For example, if a physical therapy appointment generated more than 1 medical claim, these claims would be grouped together into a single “physical therapy” (T) event. Likewise, all medical claims associated with a multiday inpatient hospital stay would be grouped together into a single “inpatient admission” (A) event. If an event contained claims that could be classified into different event types, the event is labeled according to the claim with the highest relative importance. For example, if a patient saw a surgeon (G) while admitted to the hospital (I), the event would be labeled as an inpatient admission (I). The order of importance of various events, also known as a clinical hierarchy, is represented from top to bottom in Table 1, with higher importance events listed first.41 Because combining claims into a single event only occurs within the partitions of an event category (eg, diagnostic imaging), it is possible that events from different categories occur on the same day. We also apply the clinical hierarchy from Table 1 to order these same-day events, such that the general medical interactions are ordered ahead of diagnostic imaging or prescriptions. This logic assumes that the clinician associated with the general medical interaction likely ordered the diagnostic imaging or prescription drugs. Events are each assigned a letter and then strung together in consecutive order to form a longitudinal view of the patient journey for back pain across distinct specialty appointments, prescriptions, facility visits, and diagnostic tests. As an example, the string P-T-T-I-O-W is a potential patient journey. It represents a patient that first went to their primary care physician for back pain (P), had 2 physical therapy appointments (T-T), was given diagnostic imaging in the form of magnetic resonance imaging or a computed tomography scan (I), and then was prescribed an opioid (O). A time-spacing event (W) indicates that significant time has elapsed between events or marks the end of an episode. Depending on the specific study context, the preprocessing stage can make a significant impact in reducing the dimensionality of the dataset. In our illustration, 113 215 back pain claims were reduced to 53 820 events (a 52.5% reduction in distinct data points), representing 2863 unique variations of the 6-month back pain patient journey. Figure 1 contains a visual representation of the first 4 back pain–related events across the patient sample. Figure 1. Open in new tabDownload slide Variation in the first 4 events of the patient back pain journey. Because back pain is a preference-sensitive condition, high variation exists in the first 6 months of the patient journey. The letters in this Sankey chart correspond to the event types displayed in Table 1. Of the patient back pain episodes, 74% contain 4 or fewer events; 89% are completed within 6 months. Figure 1. Open in new tabDownload slide Variation in the first 4 events of the patient back pain journey. Because back pain is a preference-sensitive condition, high variation exists in the first 6 months of the patient journey. The letters in this Sankey chart correspond to the event types displayed in Table 1. Of the patient back pain episodes, 74% contain 4 or fewer events; 89% are completed within 6 months. Sequence alignment For preference-sensitive conditions such as back pain, the treatment decisions (eg, whether the patient was prescribed opioids) as well as the order of treatment decisions (eg, whether the patient was sent for advanced imaging before or after attempting physical therapy) can substantially impact patient outcomes.4,37,40,42,43 The next stage of our proposed algorithm assesses the similarity between pairs of patient journey sequences based on both content and order, without requiring researchers to explicitly define clinical rules. Levenshtein’s edit distance algorithm aligns 2 sequences using a combination of edits: matches, insertions (or, equivalently, deletions), and substitutions.44 For example, the sequences G-T-T-T-P and P-T-P could be aligned by substituting the G for P at the front of the string, and inserting 2 Ts into the middle of the second sequence (see Figure 2). In the standard Levenshtein algorithm, each match between the 2 sequences is awarded a value of 1 and each insertion or substitution is penalized with a value of –1.45 As such, aligning G-T-T-T-P and P-T-P as described previously (substitute + match + insert + insert + match) with the Levenshtein edit costs would result in an alignment score of −1. Figure 2. Open in new tabDownload slide Example of sequence alignment. Our adaption of Levenshtein’s edit distance maximizes the total score of aligning 2 sequences using matches, substitutions, insertions, and transpositions. Because multiple possible alignments exist for any 2 strings, dynamic optimization is applied to maximize the sequence alignment score based on the given edit values. Figure 2. Open in new tabDownload slide Example of sequence alignment. Our adaption of Levenshtein’s edit distance maximizes the total score of aligning 2 sequences using matches, substitutions, insertions, and transpositions. Because multiple possible alignments exist for any 2 strings, dynamic optimization is applied to maximize the sequence alignment score based on the given edit values. Our algorithm relies on 2 expansions of the Levenshtein algorithm. First, we allow for transpositions, such that O-P and P-O could be aligned by swapping the last 2 characters instead of applying the insert-match-insert sequence.46 This is important in the back pain context because small changes in the order of patient actions (eg, filling a prescription and getting an x-ray) are often due to scheduling constraints and are of little consequence to the patient’s overall pattern of care. Second, unlike in the Levenshtein algorithm, in which all edits are penalized with a value of 1, our algorithm customizes the edit values based on both the type of editing action and the event being edited.19 For instance, transposing 2 letters may be awarded a smaller edit value compared with matching on the same 2 letters. Assigning edit values can be data-driven, involve the input of medical experts, or a combination of both.47 For back pain patients, some rarer treatment options, such as surgery, can be a defining aspect of the patient journey. Therefore, instead of weighting a match on surgery equal to a match on a primary care visit, we assign the value of matching events in proportion to the rareness of the event (which we refer to as rareness weighting). With match values scaled between 1 and 10 in our dataset, A (inpatient admission), which makes up 0.1% of events, has a match edit value of 10.0, while P (primary care visits), which makes up 11.7% of events, has a match edit value of 2.2. See the Supplementary Appendix for a complete list of edit values and the corresponding sensitivity analyses. As each pair of sequences may be aligned with multiple sets of edit actions, we utilize dynamic optimization to efficiently calculate the highest possible alignment score. The dynamic program is based on the principle that the maximum alignment score of strings i and j must be some combination of an action (eg, substitution) on the last letter(s) of 1 or both the sequences and the optimal score before that action. Specifically, we define si,j*(y,z) to represent the maximum score of aligning the first y elements of patient journey i (where 1≤y≤i_len, the number of elements in sequence i ⁠) with the first z elements of patient journey j (where 1≤z≤j_len, the number of elements in sequence j ⁠). The value of the yth element of i is designated as i[y] and the zth element of j as j[z] ⁠. The values (v) associated with each potential edit operation are vmtc (match), vsub (substitution), vins (insertion), and vtns (transposition). The dynamic optimization problem to maximize similarity score si,j*y,z can be expressed through the following formulation: si,j*y,z=maxsi,j*y-a,z-b+eij(y,z) Subject to: si,j*0,0=0 1≤y≤i_len; 1≤z≤j_len (a,b)∈0,1, 1,0,1,1,2,2; a≤y; b≤z eijy, z={1 vinsjz when a=0; b=12 vinsiy when a=1; b=03 vmtciy when a=b=1; iy=jz4 vsubiy+ vsubjz2 when a=b=1; iy≠jz5 maxvmtciy,vmtcjz+minvtnsiy, vtnsjz whena=b=2; iy-1=jz; iy=jz-16 -∞ when a=b=2;iy-1≠jz | iy≠jz-1 Where [1] inserts letter jz into string i ⁠, [2] inserts letter iy into string j ⁠, [3] matches letter iy=jz ⁠, [4] substitutes letter iy for jz ⁠, [5] transposes letters iy-1:y with jz-1:z ⁠, and [6] indicates that a transposition between iy-1:y and jz-1:z is not valid. After obtaining the optimal similarity score, we calculate the minimum (⁠ scale_minij ⁠) and maximum (⁠ scale_maxij ⁠) scores that could have been generated for the given pair of strings i ⁠, j (see Supplementary Appendix for calculation). We then transform the optimal value of aligning the 2 complete strings si,j*(i_len,j_len) into a normalized similarity score si,j ⁠, where 0 represents no similarity between strings and 1 implies the strings are identical: si,j=si,j*i_len,j_len-scale_minijscale_maxij-scale_minij The algorithm thus assigns high similarity scores to similar patient journeys (eg, P-X-O-W and P-X-O-O-W have a similarity score of 0.81) and lower scores to less similar journeys (eg, P-X-O-W and E-R-P-W have a similarity score of 0.21). The similarity scores si,j for each pair of journeys are compiled into a similarity matrix (see Figure 3). Figure 3. Open in new tabDownload slide Sample of the n-by-n similarity matrix. The matrix is populated using the normalized similarity scores. The index [i, j] in the similarity matrix s corresponds to the similarity score between patient journey i and patient journey j. Note that diagonal entries all have a normalized similarity score of 1 (as a given patient journey is identical to itself), and the lower diagonal is a reflection of the upper diagonal scores (because si,j=sj,i ⁠). Figure 3. Open in new tabDownload slide Sample of the n-by-n similarity matrix. The matrix is populated using the normalized similarity scores. The index [i, j] in the similarity matrix s corresponds to the similarity score between patient journey i and patient journey j. Note that diagonal entries all have a normalized similarity score of 1 (as a given patient journey is identical to itself), and the lower diagonal is a reflection of the upper diagonal scores (because si,j=sj,i ⁠). Journey clustering The goal of the clustering is to summarize the main patterns of the patient journeys. As it is important for the methodology to scale to large patient samples, we leverage k-means clustering, an effective approach when classifying objects within large datasets.21 The basic k-means algorithm (1) chooses k objects to be cluster centers, (2) assigns all other objects to their nearest cluster center, and (3) re-evaluates the center of the cluster. Steps 2 and 3 are repeated until the algorithm converges and no reassignments are made. To choose the cluster centers, we leverage the “k-means++” seeding technique, an approach that encourages starting seeds to be widely spread across the sample.48 After the first center K is randomly chosen, the next center is chosen by assigning a probability based on the squared distance between K and the other objects. Then, because k-means clustering can be sensitive to its initialization, we aggregate the results from multiple iterations of k-means using ensemble clustering. Ensemble clustering forms more stable clusters, with improved robustness and less distortion.21,22,49 After the k-means algorithm is run with different values of k and starting seeds, we calculate the percentage of times that patient journey i has been clustered together with patient journey j. These percentages are populated into what is called a co-association matrix. Researchers can then choose the single-link threshold t, which represents the minimum percentage that a patient journey i must have been clustered together with 1 (or more) of the patient journeys j in the final data partition Cn for patient journey i to be added into Cn ⁠. In the example illustrated in Figure 4, higher thresholds (eg, 90%) yield smaller and more homogenous clusters, whereas lower thresholds (eg, 50%) yield larger and more diverse clusters. Figure 4. Open in new tabDownload slide Aggregating k-means results using ensemble clustering. A single-link method partitions the outputs from multiple iterations of k-means into the final patient journey clusters Cn ⁠. When the minimum threshold t is set to 90%, 2 clusters form: POWPO-POWPRO and GWGIGW-GXIGW; the other 4 patient journeys drop out as “noise.” When t = 70%, patient journeys are categorized into 1 of 3 clusters: GWGIGW-GXIGW-GXIW, POWPO-POWPRO-PPRW, or EOXW-EPOPW. When t = 50%, 2 clusters merge, resulting in 2 more heterogeneous clusters: POWPO-POWPRO-PPRW-EOXW-EPOPW and GWGIGW-GXIGW-GXIW. Figure 4. Open in new tabDownload slide Aggregating k-means results using ensemble clustering. A single-link method partitions the outputs from multiple iterations of k-means into the final patient journey clusters Cn ⁠. When the minimum threshold t is set to 90%, 2 clusters form: POWPO-POWPRO and GWGIGW-GXIGW; the other 4 patient journeys drop out as “noise.” When t = 70%, patient journeys are categorized into 1 of 3 clusters: GWGIGW-GXIGW-GXIW, POWPO-POWPRO-PPRW, or EOXW-EPOPW. When t = 50%, 2 clusters merge, resulting in 2 more heterogeneous clusters: POWPO-POWPRO-PPRW-EOXW-EPOPW and GWGIGW-GXIGW-GXIW. The chosen threshold t should balance the specificity of the clusters (to focus on specific sets of patients) with the cluster size (to gain enough “power” for any subsequent interpretations, regressions or analyses). To gain an overview of the main patient journey patterns in this study, clinicians selected a threshold of 50% to extract 12 main patient journey patterns from the data. As detailed in the Supplementary Appendix, this threshold is appropriate for this study context in gathering a comprehensive overview of the first 6 months of back pain treatment; setting t to higher thresholds resulted in more, smaller partitions appropriate for studying more detailed clinical questions. RESULTS Using the proposed data-driven methodology, the 10 000 patient journeys were reduced into 12 primary patient journey clusters. The resulting clusters displayed in Table 2 show the distribution of patient episodes between diagnosis and treatment pathways, along with example patient journey sequences that make up each cluster. Table 2. Back pain patient journey clusters Cluster . Journeys . Example sequences . Clinical description . 1 17.0% PXWPW, PXPTW, PXPW, PW, PPXWPW, PXPPW, PPXTW, PWPXWTW, PPW Primary care + x-ray or wait 2 16.0% EWEXW, EXEW, EEXW, EXERW, EXWEW, EEWXW, EWXW, EEXRW, EXEWTTW Emergency + x-ray 3 9.7% TTWTTW, TWTTTW, TTWTTTW, TTTWTW, TTWTWT, TWTTWTW Short therapy series 4 9.3% CCCCWCCW, CCCCWCCCW, CCCWCCCCW, CCCCCCCW Alternative medicine 5 9.1% PRWPRW, PRPRW, PRPRTW, PRTWPRW, PRPRPW, PRRPW, PXRPRW, PRWPRRW Primary care + nonopioid prescriptions 6 8.1% TTTTTTTTWTTTTW, TTTTTTTTTTTTW, TTTTTTWTTTTTW, TTTTTWTTTTTTW Long therapy series 7 5.7% POWPOW, PORPOW, POPOW, PORWPOW, PWPOPOW, POROW, PXOWPOW Primary care + opioids 8 5.7% MWMWMW, MWMMW, MMWMW, MWIMWMW, TTMMWMW, MIMWMMW Pain medicine procedure 9 5.5% GXIWXW, GXIGW, GXWIGW, PGXIGW, GXIGWTTW, GXGW, GXWGIGW Surgeon + imaging 10 5.3% EXIW, EXWIW, EWIW, EIWIW, EXWIEW, EITW, EIW, EIWTTW, EIWEIW Emergency + advanced imaging 11 5.0% DDDWDW, DDDWDDW, DWDDWDW, DDDDWDW, DWDDDW, DDDDW Pain medicine specialist 12 2.0% NIWNW, NINW, NWNW, NRINW, NITTNTTW, NRNW, NNW, NINRTTW Nonprocedural specialist Noise 1.7% — Other Cluster . Journeys . Example sequences . Clinical description . 1 17.0% PXWPW, PXPTW, PXPW, PW, PPXWPW, PXPPW, PPXTW, PWPXWTW, PPW Primary care + x-ray or wait 2 16.0% EWEXW, EXEW, EEXW, EXERW, EXWEW, EEWXW, EWXW, EEXRW, EXEWTTW Emergency + x-ray 3 9.7% TTWTTW, TWTTTW, TTWTTTW, TTTWTW, TTWTWT, TWTTWTW Short therapy series 4 9.3% CCCCWCCW, CCCCWCCCW, CCCWCCCCW, CCCCCCCW Alternative medicine 5 9.1% PRWPRW, PRPRW, PRPRTW, PRTWPRW, PRPRPW, PRRPW, PXRPRW, PRWPRRW Primary care + nonopioid prescriptions 6 8.1% TTTTTTTTWTTTTW, TTTTTTTTTTTTW, TTTTTTWTTTTTW, TTTTTWTTTTTTW Long therapy series 7 5.7% POWPOW, PORPOW, POPOW, PORWPOW, PWPOPOW, POROW, PXOWPOW Primary care + opioids 8 5.7% MWMWMW, MWMMW, MMWMW, MWIMWMW, TTMMWMW, MIMWMMW Pain medicine procedure 9 5.5% GXIWXW, GXIGW, GXWIGW, PGXIGW, GXIGWTTW, GXGW, GXWGIGW Surgeon + imaging 10 5.3% EXIW, EXWIW, EWIW, EIWIW, EXWIEW, EITW, EIW, EIWTTW, EIWEIW Emergency + advanced imaging 11 5.0% DDDWDW, DDDWDDW, DWDDWDW, DDDDWDW, DWDDDW, DDDDW Pain medicine specialist 12 2.0% NIWNW, NINW, NWNW, NRINW, NITTNTTW, NRNW, NNW, NINRTTW Nonprocedural specialist Noise 1.7% — Other Open in new tab Table 2. Back pain patient journey clusters Cluster . Journeys . Example sequences . Clinical description . 1 17.0% PXWPW, PXPTW, PXPW, PW, PPXWPW, PXPPW, PPXTW, PWPXWTW, PPW Primary care + x-ray or wait 2 16.0% EWEXW, EXEW, EEXW, EXERW, EXWEW, EEWXW, EWXW, EEXRW, EXEWTTW Emergency + x-ray 3 9.7% TTWTTW, TWTTTW, TTWTTTW, TTTWTW, TTWTWT, TWTTWTW Short therapy series 4 9.3% CCCCWCCW, CCCCWCCCW, CCCWCCCCW, CCCCCCCW Alternative medicine 5 9.1% PRWPRW, PRPRW, PRPRTW, PRTWPRW, PRPRPW, PRRPW, PXRPRW, PRWPRRW Primary care + nonopioid prescriptions 6 8.1% TTTTTTTTWTTTTW, TTTTTTTTTTTTW, TTTTTTWTTTTTW, TTTTTWTTTTTTW Long therapy series 7 5.7% POWPOW, PORPOW, POPOW, PORWPOW, PWPOPOW, POROW, PXOWPOW Primary care + opioids 8 5.7% MWMWMW, MWMMW, MMWMW, MWIMWMW, TTMMWMW, MIMWMMW Pain medicine procedure 9 5.5% GXIWXW, GXIGW, GXWIGW, PGXIGW, GXIGWTTW, GXGW, GXWGIGW Surgeon + imaging 10 5.3% EXIW, EXWIW, EWIW, EIWIW, EXWIEW, EITW, EIW, EIWTTW, EIWEIW Emergency + advanced imaging 11 5.0% DDDWDW, DDDWDDW, DWDDWDW, DDDDWDW, DWDDDW, DDDDW Pain medicine specialist 12 2.0% NIWNW, NINW, NWNW, NRINW, NITTNTTW, NRNW, NNW, NINRTTW Nonprocedural specialist Noise 1.7% — Other Cluster . Journeys . Example sequences . Clinical description . 1 17.0% PXWPW, PXPTW, PXPW, PW, PPXWPW, PXPPW, PPXTW, PWPXWTW, PPW Primary care + x-ray or wait 2 16.0% EWEXW, EXEW, EEXW, EXERW, EXWEW, EEWXW, EWXW, EEXRW, EXEWTTW Emergency + x-ray 3 9.7% TTWTTW, TWTTTW, TTWTTTW, TTTWTW, TTWTWT, TWTTWTW Short therapy series 4 9.3% CCCCWCCW, CCCCWCCCW, CCCWCCCCW, CCCCCCCW Alternative medicine 5 9.1% PRWPRW, PRPRW, PRPRTW, PRTWPRW, PRPRPW, PRRPW, PXRPRW, PRWPRRW Primary care + nonopioid prescriptions 6 8.1% TTTTTTTTWTTTTW, TTTTTTTTTTTTW, TTTTTTWTTTTTW, TTTTTWTTTTTTW Long therapy series 7 5.7% POWPOW, PORPOW, POPOW, PORWPOW, PWPOPOW, POROW, PXOWPOW Primary care + opioids 8 5.7% MWMWMW, MWMMW, MMWMW, MWIMWMW, TTMMWMW, MIMWMMW Pain medicine procedure 9 5.5% GXIWXW, GXIGW, GXWIGW, PGXIGW, GXIGWTTW, GXGW, GXWGIGW Surgeon + imaging 10 5.3% EXIW, EXWIW, EWIW, EIWIW, EXWIEW, EITW, EIW, EIWTTW, EIWEIW Emergency + advanced imaging 11 5.0% DDDWDW, DDDWDDW, DWDDWDW, DDDDWDW, DWDDDW, DDDDW Pain medicine specialist 12 2.0% NIWNW, NINW, NWNW, NRINW, NITTNTTW, NRNW, NNW, NINRTTW Nonprocedural specialist Noise 1.7% — Other Open in new tab The highest proportion of patients (17.0% in cluster 1) visit a primary care practitioner and are directed to a low-acuity next step that may include waiting at least 4 weeks, getting an x-ray diagnosis, or a physical or occupational therapy appointment. Patients in cluster 1 appear to closely follow clinical guidelines that promote noninvasive, nonopioid care after initial onset of back pain.36,43 Patients in clusters 5 and 7 also begin their back pain episode in the primary care setting; however, most patients fill prescriptions (either opioid or nonopioid) as their first-line treatment. The second most common cluster is comprised of patients who make an unplanned visit to an emergency or urgent care center and receive an x-ray (16.0% in cluster 2). In 9.7% of episodes (cluster 3), we observe a self-referral to physical or occupational therapy, in which the patient proceeds to have approximately 3-5 additional therapy appointments. There also exist small, well-defined clusters such as cluster 8 (5.7% of episodes in which patients are primarily treated with facet or epidural injections) and cluster 12 (5.0% of episodes in which pain medicine specialists are consulted but do not administer epidural or facet injections). As described in the Materials and Methods, patient journeys were clustered solely on the sequence of the patient’s back pain events without considering the patient’s comorbidities or demographics. However, as seen in Table 3, there is a high level of variability between clusters in terms of patient characteristics. Table 3. Cluster summary statistics . Patient characteristics . Within 6 weeks of episode start . Within 6 months of episode start . Cluster . Average age (y) . Average annual incomea . Advanced imaging . Opioids filled . Unplanned emergency . Surgery . Average total costb . 1 47.2 $73,900 2.2% 0.0% 1.0% 0.1% 1.0× 2 49.1 $62,700 0.0% 1.9% 100.0% 0.0% 2.2× 3 50.3 $80,000 0.8% 0.0% 0.0% 0.0% 1.8× 4 44.0 $77,400 0.8% 0.5% 1.0% 0.0% 2.2× 5 48.2 $69,400 5.5% 3.1% 5.0% 0.2% 2.0× 6 53.4 $84,000 2.1% 1.0% 3.5% 0.1% 7.7× 7 53.2 $68,800 11.8% 91.2% 23.5% 2.3% 4.2× 8 65.5 $71,900 12.4% 3.7% 6.4% 1.4% 10.1× 9 49.2 $85,500 27.8% 9.4% 8.2% 9.1% 10.5× 10 59.7 $67,200 89.8% 7.9% 97.4% 2.5% 7.5× 11 51.1 $81,700 10.8% 5.8% 3.0% 0.8% 5.3× 12 51.8 $84,400 17.3% 5.6% 3.0% 3.0% 5.3× Noise 50.4 $70,600 36.9% 1.2% 14.3% 1.8% 7.7× . Patient characteristics . Within 6 weeks of episode start . Within 6 months of episode start . Cluster . Average age (y) . Average annual incomea . Advanced imaging . Opioids filled . Unplanned emergency . Surgery . Average total costb . 1 47.2 $73,900 2.2% 0.0% 1.0% 0.1% 1.0× 2 49.1 $62,700 0.0% 1.9% 100.0% 0.0% 2.2× 3 50.3 $80,000 0.8% 0.0% 0.0% 0.0% 1.8× 4 44.0 $77,400 0.8% 0.5% 1.0% 0.0% 2.2× 5 48.2 $69,400 5.5% 3.1% 5.0% 0.2% 2.0× 6 53.4 $84,000 2.1% 1.0% 3.5% 0.1% 7.7× 7 53.2 $68,800 11.8% 91.2% 23.5% 2.3% 4.2× 8 65.5 $71,900 12.4% 3.7% 6.4% 1.4% 10.1× 9 49.2 $85,500 27.8% 9.4% 8.2% 9.1% 10.5× 10 59.7 $67,200 89.8% 7.9% 97.4% 2.5% 7.5× 11 51.1 $81,700 10.8% 5.8% 3.0% 0.8% 5.3× 12 51.8 $84,400 17.3% 5.6% 3.0% 3.0% 5.3× Noise 50.4 $70,600 36.9% 1.2% 14.3% 1.8% 7.7× a As estimated using the census tract data provided by the U.S. Census Bureau; salary estimates are based on the patient’s census tract area. b Compared with least expensive cluster 1. Open in new tab Table 3. Cluster summary statistics . Patient characteristics . Within 6 weeks of episode start . Within 6 months of episode start . Cluster . Average age (y) . Average annual incomea . Advanced imaging . Opioids filled . Unplanned emergency . Surgery . Average total costb . 1 47.2 $73,900 2.2% 0.0% 1.0% 0.1% 1.0× 2 49.1 $62,700 0.0% 1.9% 100.0% 0.0% 2.2× 3 50.3 $80,000 0.8% 0.0% 0.0% 0.0% 1.8× 4 44.0 $77,400 0.8% 0.5% 1.0% 0.0% 2.2× 5 48.2 $69,400 5.5% 3.1% 5.0% 0.2% 2.0× 6 53.4 $84,000 2.1% 1.0% 3.5% 0.1% 7.7× 7 53.2 $68,800 11.8% 91.2% 23.5% 2.3% 4.2× 8 65.5 $71,900 12.4% 3.7% 6.4% 1.4% 10.1× 9 49.2 $85,500 27.8% 9.4% 8.2% 9.1% 10.5× 10 59.7 $67,200 89.8% 7.9% 97.4% 2.5% 7.5× 11 51.1 $81,700 10.8% 5.8% 3.0% 0.8% 5.3× 12 51.8 $84,400 17.3% 5.6% 3.0% 3.0% 5.3× Noise 50.4 $70,600 36.9% 1.2% 14.3% 1.8% 7.7× . Patient characteristics . Within 6 weeks of episode start . Within 6 months of episode start . Cluster . Average age (y) . Average annual incomea . Advanced imaging . Opioids filled . Unplanned emergency . Surgery . Average total costb . 1 47.2 $73,900 2.2% 0.0% 1.0% 0.1% 1.0× 2 49.1 $62,700 0.0% 1.9% 100.0% 0.0% 2.2× 3 50.3 $80,000 0.8% 0.0% 0.0% 0.0% 1.8× 4 44.0 $77,400 0.8% 0.5% 1.0% 0.0% 2.2× 5 48.2 $69,400 5.5% 3.1% 5.0% 0.2% 2.0× 6 53.4 $84,000 2.1% 1.0% 3.5% 0.1% 7.7× 7 53.2 $68,800 11.8% 91.2% 23.5% 2.3% 4.2× 8 65.5 $71,900 12.4% 3.7% 6.4% 1.4% 10.1× 9 49.2 $85,500 27.8% 9.4% 8.2% 9.1% 10.5× 10 59.7 $67,200 89.8% 7.9% 97.4% 2.5% 7.5× 11 51.1 $81,700 10.8% 5.8% 3.0% 0.8% 5.3× 12 51.8 $84,400 17.3% 5.6% 3.0% 3.0% 5.3× Noise 50.4 $70,600 36.9% 1.2% 14.3% 1.8% 7.7× a As estimated using the census tract data provided by the U.S. Census Bureau; salary estimates are based on the patient’s census tract area. b Compared with least expensive cluster 1. Open in new tab For example, patients in cluster 6 (who receive 10 or more physical or occupational therapy sessions) or clusters 9, 11, and 12 (who obtain care from specialists) are more likely to live in areas with higher average salaries compared with patients who follow different patient journeys. Meanwhile, the lowest average salaries within the sample are associated with clusters 2 and 10 (seeking care from the emergency room) or cluster 7 (being prescribed opioids by a primary care physician). The alternative medicine cluster (cluster 4) is associated with the youngest average age of the sample, whereas cluster 8 (invasive pain management procedures) is associated with the highest average age. Driven largely by our use of rareness-weighted edit values, there is a high level of diversity between clusters in terms of the key back pain outcomes such as early advanced imaging and surgical rates.37 For example, high rates of back pain surgery are concentrated among the patients whose initial starting encounter is a surgeon (cluster 9 at 9.1%), compared with patients who enter the system though other clinical entry points. Episodes in cluster 8 (invasive pain management procedures) average 10.1 times higher medical costs than episodes in cluster 1. In cluster 7, in which the patient’s first point of contact is typically the primary care physician, 91.2% of patients are prescribed and fill opioids within the first 6 weeks of the start of their episode. This opioid fill rate in this primary care cluster even exceeds that of clusters 2 and 10, in which patients seek care in emergency or urgent care settings. Although early advanced imaging within 6 weeks of onset of pain is considered a major contributor to overtreatment and inappropriate medical spending, 89.8% of the patients in cluster 10, who seek care in an emergency or urgent care setting, receive magnetic resonance imaging or a computed tomography scan within 6 weeks of their index back pain claim.4 Despite the high cost associated with this cluster, the episodes appear short-lived, with 93.4% of patients ceasing treatment for back pain within the first 6 weeks compared with the overall rate of 88.9% across the sample. An additional 27.8% of patients who seek care from a surgeon (cluster 9) receive advanced imaging within the first 6 weeks, as do 17.3% of patients whose care is managed by nonprocedural specialists (cluster 12). Although they jointly comprise only 23.4% of episodes in the sample, clusters 8-12, which rely heavily on specialists, procedures, and imaging, make up 43.2% of overall back pain spending. DISCUSSION While claims data have been touted as having the potential to provide a bird’s-eye view of a patient’s healthcare records and of healthcare utilization at the population level, studies have often fallen short of that goal. Claims data are notoriously noisy (owing to, for example, variation in medical coding) and are not generated for research purposes.11–15 The proposed methodology effectively used a combination of data processing, sequence alignment, and ensemble clustering to identify primary patient journey patterns in the highly variable, preference-sensitive condition of back pain. When a group of primary care providers in a large multispecialty clinic were presented with the preliminary cluster outputs using the group’s own data, they initially voiced concerns about the validity of the treatment pathways. However, within a short time of reviewing the outputs, the clinicians moved beyond their own recollections of individual patient cases to a more objective discussion treatment options within the context of real-life complexities. In addition to engaging with an overview of care plans, clinicians leveraged the outputs to better understand variability between outcomes and inform future medical research into how patients and providers interact to create high-quality care. This study is the result of a close collaboration between healthcare informatics researchers and clinicians. Medical expertise on back pain was instrumental in designing an objective with clinical relevance and informing the patient sample criteria. Clinicians also helped define event types, set edit value assumptions, and interpret cluster results. While not used directly in this research, standardized logics, such as SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms), can aid informatics researchers when translating clinical assumptions into corresponding diagnosis codes.50 There are multiple opportunities to expand this methodology in future work. First, the methodology could be adapted to study longitudinal healthcare events such as chronic conditions or the patient experience at the end of life. Doing so may require researchers to adjust the preprocessing steps such that the patient journey is represented in blocks of time instead of as a string of distinct events. Second, researchers could consider different methods of assigning edit values. For instance, the rareness-weighting method of assigning edit values was developed when clinicians identified that the Levenshtein edit values caused insufficient differentiation of journeys with rare, high-cost events. In other research contexts that have well-established guidelines, a weighting strategy could assign edit values based on which events are recommended as first-line or second-line therapies. Third, the methodology could be adjusted based on the size of the dataset. The sequence alignment step can be computationally intensive as it compares each pair of patient journeys to obtain the full matrix of similarity scores. In comparison, conformance checking in process mining uses sequence alignment to compare event-log data to a single, predetermined understanding of the process map.19,24 While the conformance checking approach anchors the analysis to prior, potentially biased knowledge of the system, there likely exists a balance between its limited comparisons and our methodology that lets patterns emerge fully from the data. Researchers could also compare the efficiency of the presented clustering method to techniques such as spectral clustering. Finally, we have not yet explored techniques used in other applications of data science and sequence alignment, including the trace-back method that identifies sources of deviations along pairs of sequences.51 In the context of preference-sensitive conditions, the trace-back method could allow researchers to isolate key discrepancies between journeys that may have led to variation in outcome measures. The resulting clusters can also be combined with prediction algorithms to identify patients who should be targeted for early intervention. For example, certain patterns in the beginning of the journey may signal that a patient is at elevated risk for aggressive opioid prescribing or for a low-quality procedure. Despite the clinically relevant results, we acknowledge that this study has several limitations. We used a very broad definition of back pain that captured most patients presenting new general back pain symptoms. The purpose of this definition was to identify patient episodes typically expected to follow a conservative care route, as outlined by the American College of Physicians, and to understand population-level deviation from clinical recommendations.36,37 It is not known how well our selected population reflect all patients who present with back pain, as claims data were not supplemented by other data sources such as hospital notes or psychologic evaluations. Additionally, while patients are geographically dispersed throughout the United States, the sample is not nationally representative; thus, the breakdown of patient episodes into clusters may not represent national trends. CONCLUSION Compared to clinical guidelines that represent a top-down picture of patient behavior, the outputs from this methodology reveal a data-driven understanding of how patients traverse the healthcare system. Using a limited set of assumptions, the methodology is particularly effective in analyzing conditions with high levels of variability in patient care and those treated across service locations. In the preference-sensitive condition of back pain, we observed that treatment choices are associated with patient characteristics and procedure rates, thereby highlighting the potential public health impact of related future studies based on this methodology. When tailored to various care settings, this methodology can provide the medical community with an accurate overview of the current state of patient care and facilitate a shift toward high-quality practice patterns. FUNDING This research received no specific grant from any funding agency in the public, commercial or not-for-profit sector. It was supported internally by Evolent Health. AUTHOR CONTRIBUTIONS All authors were involved in revising the work for intellectual content and approved the manuscript. KB, AC, and LH made substantial contributions to the original study design. KB developed the algorithm, contributed to data analysis, interpretation, and writing the manuscript. CL contributed to data analysis, interpretation, and writing the manuscript. AC contributed to data acquisition and revising the manuscript. MB refined the algorithm and contributed to data interpretation, writing, and revising the manuscript. LH served as the primary contact with provider groups during algorithm development and contributed to clinical assumptions, data interpretation, and revising the manuscript. SUPPLEMENTARY APPENDIX Supplementary Appendix is available at Journal of the American Medical Informatics Association online. ACKNOWLEDGMENTS The authors gratefully acknowledge the following individuals for their contributions: Rich King, Malcolm Charles, and Michael Freeman who advised on algorithm development; Madina Bram who provided administrative support on the project; Feryal Erhun, Stefan Scholtes, Nico Lewine, Matthias Weidlich, and Jenny Wang who provided valuable feedback and suggestions to the development and framing of this research. The authors also thank the JAMIA review team for their constructive and encouraging comments, as well as the clinicians who participated in discussions on the back pain assumptions and outputs. CONFLICT OF INTEREST STATEMENT None declared. REFERENCES 1 Trebble TM , Hansi N, Hydes T, et al. . Process mapping the patient journey: an introduction . BMJ 2010 ; 341 ( 1 ): c4078 . Google Scholar Crossref Search ADS PubMed WorldCat 2 Flint LA , David DJ, Smith AK, et al. . Rehabbed to death: breaking the cycle . J Am Geriatr Soc 2019 ; 67 ( 11 ): 2398 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 3 Wennberg JE. Unwarranted variations in healthcare delivery: implications for academic medical centres . Br Med J 2002 ; 325 ( 7370 ): 961 – 4 . Google Scholar Crossref Search ADS WorldCat 4 Deyo R , Mirza S, Turner J, et al. . Overtreating chronic back pain: time to back off? J Am Board Fam Med 2009 ; 22 ( 1 ): 62 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 5 Heyward J , Jones CM, Compton WM, et al. . Coverage of nonpharmacologic treatments for low back pain among US public and private insurers . JAMA Netw Open 2018 ; 1 ( 6 ): e183044 . Google Scholar Crossref Search ADS PubMed WorldCat 6 The Dartmouth Atlas of Healthcare: Understanding Efficiency and Effectiveness of the Health Care System. 2018 . http://www.dartmouthatlas.org/ Accessed November 19, 2018. 7 Wood A , Matula SR, Huan L, et al. . Improving the value of medical care for patients with back pain . Pain Med 2019 ; 20 ( 4 ): 664 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 8 Huang Z , Lu X, Duan H. On mining clinical pathway patterns from medical behaviors . Artif Intell Med 2012 ; 56 ( 1 ): 35 – 50 . Google Scholar Crossref Search ADS PubMed WorldCat 9 Maeng D , Boscarino J, Stewart W, et al. . A comparison of electronic medical records vs. claims data for rheumatoid arthritis patients in a large healthcare system: an exploratory analysis . Clin Med Res 2014 ; 12 : 108 . Google Scholar Crossref Search ADS WorldCat 10 Huang Z , Ge Z, Dong W, et al. . Probabilistic modeling personalized treatment pathways using electronic health records . J Biomed Inform 2018 ; 86 : 33 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 11 Bertsimas D , Bjarnadóttir MV, Kane MA, et al. . Algorithmic prediction of health-care costs . Oper Res 2008 ; 56 ( 6 ): 1382 – 92 . Google Scholar Crossref Search ADS WorldCat 12 Bjarnadottir MV , Czerwinski D, Guan Y. The history and modern application of insurance claims data in healthcare research: from data to knowledge to healthcare improvement . In: Yang H, Lee E, eds. Healthcare Analytics: From Data to Knowledge to Healthcare Improvement. Hoboken, NJ: Wiley; 2016 : 561 – 91 . OpenURL Placeholder Text WorldCat 13 Simon GE , Shortreed SM, Johnson E, et al. . What health records data are required for accurate prediction of suicidal behavior? J Am Med Inform Assoc 2019 ; 26 ( 12 ): 1458 – 65 . Google Scholar Crossref Search ADS PubMed WorldCat 14 Heintzman J , Bailey SR, Hoopes MJ, et al. . Agreement of Medicaid claims and electronic health records for assessing preventive care quality among adults . J Am Med Inform Assoc 2014 ; 21 ( 4 ): 720 – 4 . Google Scholar Crossref Search ADS PubMed WorldCat 15 Zhang Y , Koru G. Understanding and detecting defects in healthcare administration data: Toward higher data quality to better support healthcare operations and decisions . J Am Med Inform Assoc 2020 ; 27 ( 3 ): 386 – 95 . Google Scholar Crossref Search ADS PubMed WorldCat 16 Sun J , Sun Y. A system for automated lexical mapping . J Am Med Inform Assoc 2006 ; 13 ( 3 ): 334 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 17 Wrenn JO , Stein DM, Bakken S, et al. . Quantifying clinical narrative redundancy in an electronic health record . J Am Med Inform Assoc 2010 ; 17 ( 1 ): 49 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 18 Kate RJ. Normalizing clinical terms using learned edit distance patterns . J Am Med Inform Assoc 2016 ; 23 ( 2 ): 380 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 19 Van der Aalst W , Adriansyah A, Van Dongen B. Replaying history on process models for conformance checking and performance analysis . WIREs Data Mining Knowl Discov 2012 ; 2 ( 2 ): 182 – 92 . Google Scholar Crossref Search ADS WorldCat 20 Li P , Jiang X, Wang S, et al. . HUGO: Hierarchical multi-reference genome compression for aligned reads . J Am Med Inform Assoc 2014 ; 21 ( 2 ): 363 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 21 Jain AK. Data clustering: 50 years beyond K-means . Pattern Recognit Lett 2010 ; 31 ( 8 ): 651 – 66 . Google Scholar Crossref Search ADS WorldCat 22 Fodeh SJ , Brandt C, Luong TB, et al. . Complementary ensemble clustering of biomedical data . J Biomed Inform 2013 ; 46 ( 3 ): 436 – 43 . Google Scholar Crossref Search ADS PubMed WorldCat 23 Zhang Y , Padman R, Levin JE. Paving the COWpath: data-driven design of pediatric order sets . J Am Med Inform Assoc 2014 ; 21 ( e2 ): e304 – 11 . Google Scholar Crossref Search ADS PubMed WorldCat 24 Rojas E , Munoz-Gama J, Sepúlveda M, et al. . Process mining in healthcare: a literature review . J Biomed Inform 2016 ; 61 : 224 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 25 Wellner B , Huyck M, Mardis S, et al. . Rapidly retargetable approaches to de-identification in medical records . J Am Med Inform Assoc 2007 ; 14 ( 5 ): 564 – 73 . Google Scholar Crossref Search ADS PubMed WorldCat 26 Zhang Y , Padman R. Innovations in chronic care delivery using data-driven clinical pathways . Am J Manag Care 2015 ; 21 ( 12 ): e661 – 668 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 27 Chen JH , Podchiyska T, Altman RB. OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records . J Am Med Inform Assoc 2016 ; 23 ( 2 ): 339 – 48 . Google Scholar Crossref Search ADS PubMed WorldCat 28 Ghasemi M , Amyot D. Process mining in healthcare: a systematised literature review . Int J Electron Healthc 2016 ; 9 ( 1 ): 60 – 88 . Google Scholar Crossref Search ADS WorldCat 29 Delias P , Doumpos M, Grigoroudis E, et al. . Supporting healthcare management decisions via robust clustering of event logs . Knowl-Based Syst 2015 ; 84 : 203 – 13 . Google Scholar Crossref Search ADS WorldCat 30 Song M , Günther CW, Van Der Aalst W. Trace clustering in process mining . In: BPM 2008: Business Process Management Workshops ; 2008 : 109 – 20 . OpenURL Placeholder Text WorldCat 31 Hripcsak G , Ryan PB, Duke JD, et al. . Characterizing treatment pathways at scale using the OHDSI network . Proc Natl Acad Sci U S A 2016 ; 113 ( 27 ): 7329 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 32 Kuwornu JP , Lix LM, Quail JM, et al. . Identifying distinct healthcare pathways during episodes of chronic obstructive pulmonary disease exacerbations . Medicine (Baltimore) 2016 ; 95 : e288 8 . Google Scholar Crossref Search ADS WorldCat 33 Hoverman JR , Cartwright TH, Patt DA, et al. . Pathways, outcomes, and costs in colon cancer: retrospective evaluations in 2 distinct databases . Am J Manag Care 2011 ; 7 : 52 – 9 . OpenURL Placeholder Text WorldCat 34 Tessier JE , Rupp G, Gera JT, et al. . Physicians with defined clear care pathways have better discharge disposition and lower cost . J Arthroplasty 2016 ; 31 ( 9 ): 54 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 35 Tóth K , Kósa I, Vathy-Fogarassy Á. Frequent treatment sequence mining from medical databases . Stud Health Technol Inform 2017 ; 236 : 211 – 8 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 36 Qaseem A , Wilt TJ, McLean RM, et al. . Noninvasive treatments for acute, subacute, and chronic low back pain: a clinical practice guideline from the American College of Physicians . Ann Intern Med 2017 ; 166 ( 7 ): 514 – 30 . Google Scholar Crossref Search ADS PubMed WorldCat 37 Bernstein IA , Malik Q, Carville S, et al. . Low back pain and sciatica: summary of NICE guidance . BMJ 2017 ; 356 : 10 – 3 . OpenURL Placeholder Text WorldCat 38 Sinnott PL , Siroka AM, Shane AC, et al. . Identifying neck and back pain in administrative data: defining the right cohort . Spine (Phila Pa 1976) 2012 ; 37 ( 10 ): 860 – 74 . Google Scholar Crossref Search ADS PubMed WorldCat 39 Deyo RA , Cherkin DC, Ciol MA. Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases . J Clin Epidemiol 1992 ; 45 ( 6 ): 613 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 40 Jarvik JG , Gold LS, Comstock BA, et al. . Association of early imaging for back pain with clinical outcomes in older adults . JAMA 2015 ; 313 ( 11 ): 1143 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 41 Shah A , Hayes C, Martin B. Factors influencing long-term opioid use among opioid naive patients: an examination of initial prescription characteristics and pain etiologies . J Pain 2017 ; 18 ( 11 ): 1374 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 42 Gellhorn AC , Chan L, Martin B, et al. . Management Patterns in Acute Low Back Pain: The Role of Physical Therapy . Spine (Phila Pa 1976) 2012 ; 37 ( 9 ): 775 – 82 . Google Scholar Crossref Search ADS PubMed WorldCat 43 Deyo RA , Von Korff M, Duhrkoop D. Opioids for low back pain . BMJ 2015 ; 350 : g6380 – 13 . Google Scholar Crossref Search ADS PubMed WorldCat 44 Levenshtein V. Binary codes capable of correcting deletions, insertions, and reversals . Cybern Control Theory 1966 ; 10 : 707 – 10 . OpenURL Placeholder Text WorldCat 45 Pentland BT. Conceptualizing and measuring variety in the execution of organizational work processes . Manage Sci 2003 ; 49 ( 7 ): 857 – 70 . Google Scholar Crossref Search ADS WorldCat 46 Damerau FJ. A technique for computer detection and correction of spelling errors . Commun ACM 1964 ; 7 ( 3 ): 171 – 6 . Google Scholar Crossref Search ADS WorldCat 47 Bose R , van der Aalst WMP. Context aware trace clustering: towards improving process mining results. In: proceedings of the 2009 SIAM International Conference on Data Mining; 2009 . 401 – 12 . 48 Arthur D , Vassilvitskii S. K-means++: the advantages of careful seeding . in: proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms; 2007 : 1027 – 35 . 49 Fred ALN , Jain AK. Robust data clustering. In: proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2003 . 50 Lee D , Cornet R, Lau F, et al. . A survey of SNOMED CT implementations . J Biomed Inform 2013 ; 46 ( 1 ): 87 – 96 . Google Scholar Crossref Search ADS PubMed WorldCat 51 Haque W , Aravind A, Reddy B. Pairwise sequence alignment algorithm . In: ISTA ’09: Proceedings of the 2009 Conference on Information Science, Technology and Applications; 2009 : 96 – 103 . © The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Journal

Journal of the American Medical Informatics AssociationOxford University Press

Published: Jul 1, 2020

There are no references for this article.