Respondent Robotics: Simulating Responses to Likert-Scale Survey Items:

Jan Ketil Arnulf; Kai R. Larsen; Øyvind L. Martinsen

doi:10.1177/2158244018764803

Respondent Robotics: Simulating Responses to Likert-Scale Survey Items:

Arnulf, Jan Ketil; Larsen, Kai R.; Martinsen, Øyvind L. 2018-03-14 00:00:00 The semantic theory of survey responses (STSR) proposes that the prime source of statistical covariance in survey data is the degree of semantic similarity (overlap of meaning) among the items of the survey. Because semantic structures are possible to estimate using digital text algorithms, it is possible to predict the response structures of Likert-type scales a priori. The present study applies STSR in an experimental way by computing real survey responses using such semantic information. A sample of 153 randomly chosen respondents to the Multifactor Leadership Questionnaire (MLQ) was used as target. We developed an algorithm based on unfolding theory, where data from digital text analysis of the survey items served as input. Upon deleting progressive numbers (from 20%-95%) of the real responses, we let the algorithm replace these with simulated ones, and then compared the simulated datasets with the real ones. The simulated scores displayed sum score levels, alphas, and factor structures highly resembling their real origins even if up to 86% were simulated. In contrast, this was not the case when the same algorithm was operating without access to semantic information. The procedure was briefly repeated on a different measurement instrument and a different sample. This not only yielded similar results but also pointed to need for further theoretical and practical developments. Our study opens for experimental research on the effect of semantics on survey responses using computational procedures. Keywords semantics, simulation, surveys, semantic theory of survey response, leadership The STSR has argued and empirically documented that Introduction up to 86% of the variation in correlations among items in Is it possible to simulate and predict real survey responses organizational behavior (OB) can be explained through their before they happen? And what would that tell us? The pres- semantic properties (Arnulf & Larsen, 2015; Arnulf et al., ent article describes and tests a method to create artificial 2014). Such strong predictors of response patterns imply that responses according to the semantic properties of the survey it is possible to reverse the equations and use semantics to items based on the semantic theory of survey responses create realistic survey responses. This offers an empirical (STSR; Arnulf, Larsen, Martinsen, & Bong, 2014). According tool to explore why semantics can explain as much as 65% to to STSR, the semantic relationships will shape the baseline 86% in some surveys such as the MLQ, but as low as 5% in of correlations among items. Such relationships are now responses to the personality inventory. There is a need for accessible a priori through the use of digital semantic more detailed exploration of the phenomena involved to bet- algorithms. ter understand how and why STSR applies. Theoretically, survey responses should be predictable to Artificial responses calculated from the semantics of the the extent that their semantic relationships are fixed. The items could also enhance the scientific value of surveys. present study seeks to develop such a method and apply it to Ever since Likert devised his measurement scales (Likert, a well-known leadership questionnaire, the Multifactor 1932), recurring criticism has raised doubts about the predic- Leadership Questionnaire (MLQ; Avolio, Bass, & Jung, tive validity of the statistical models building on such scales 1995). Thereafter, we briefly show how it performs using a different measurement scale. BI Norwegian Business School, Oslo, Norway The contributions of this are threefold—primarily devel- University of Colorado Boulder, USA oping the rationale of STSR, secondarily testing a tool for Corresponding Author: establishing a baseline of response patterns from which more Jan Ketil Arnulf, BI Norwegian Business School, Nydalen, N-0442 Oslo, psychological inferences can be made, and also possibly Norway. offering an alternative approach to imputing missing data. Email: jan.k.arnulf@bi.no Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). 2 SAGE Open (Firmin, 2010; LaPiere, 1934), as they are vulnerable to been treating the systematic variation among items as expres- inflated values through common method variance (Podsakoff, sion of attitude strength toward topics in the survey. MacKenzie, & Podsakoff, 2012). The STSR proposes a contrasting view. Here, the relation- The prevalent use of covariance and correlation matri- ships among items and among survey variables are first and ces in factor analysis and structural equations (Abdi, foremost semantic (Arnulf et al., 2014), a view corroborated 2003; Jöreskog, 1993) is problematic if we cannot dis- by independent researchers (Nimon, Shuck, & Zigarmi, criminate semantic variance components more clearly 2016). Every respondent may begin the survey by expressing from attitude strength. Establishing a semantic “baseline” attitude strength toward the surveyed topic in the form of a of the factor structure in surveys would allow us to study score on the Likert-type scale. However, in the succeeding how and why people chose to depart from what is seman- responses, the scores on the coming items may be predomi- tically given. nantly determined by the degree to which these items are Finally, a technology for simulating survey responses may semantically similar. This was earlier argued and documented have its own value. Present-day techniques of replacing miss- by Feldman and Lynch (1988). A slightly different version of ing values are basically mere extrapolations of what is already this hypothesis was also formulated by Schwarz (1999). in the matrix, and only work if the missing values make up However, both these precursors to STSR were speculating minute fractions of data (Rubin, 1987). In the current study, that calculation of responses may be exceptional to situations we present a technique to calculate the likely responses when where people hold no real attitudes, or become unduly influ- up to 95% of responses are missing. This kind of simulated enced in their response patterns by recent responses to other data help improve the theoretical foundations of psychomet- items. The formulation of STSR was the first claim that rics that hitherto has left semantics out of its standard inven- semantic calculation may actually be the fundamental mecha- tory of procedures (Borsboom, 2008, 2009). nism explaining systematic variance among items. Finally, data simulation based on item semantics could be Another antecedent to STSR is “unfolding theory” as a valuable accessory to otherwise complicated methods for described by Coombs (Coombs, 1964; Coombs & Kao, 1960) testing methodological artifacts (Bagozzi, 2011; Ortiz de and later by Michell (1994). We will deal with unfolding the- Guinea, Titah, & Léger, 2013). ory in some detail as it has direct consequences for creating We first present how semantics can be stepwise turned algorithms to mimic the real responses. A practical example into artificial responses. These responses are then compared may be a job satisfaction item, such as “I like working here.” with a sample of real responses and artificial responses with When respondents choose to answer this on a scale from 1 to no semantic information. The procedure is then applied to a 5, it may be hard to explain what the number means. To quan- second scale and dataset to test its applicability across instru- tify an attitude, one could split the statement in discreet ments. Finally, we discuss how the relevant findings may answering categories such as the extremely positive attitude: help develop STSR from an abstract theory to practical “I would prefer working here to any other job or even leisure applications. activity.” A neutral attitude could be a statement such as “I do not care if I work here or not,” or the negative statement “I would take any other job to get away from this one.” The Theory central point in unfolding theory is that any respondent’s pre- ferred response would be the point at which item response Semantics and Correlations scale “folds.” Folding implies that the response alternatives Rensis Likert assumed that his scales delivered measures of need to be sorted in their mutual distance from the preferred attitude strength (Likert, 1932). Statistic modeling of such option. If someone picks the option 4 on a scale from 1 to 5, data in classic psychometrics viewed survey responses as it would mean that the options 3 and 5 are about equally dis- basically composed of a true score and an error component. tant from 4, but that 2 and certainly 1 would be further away The error component of the score would reflect random from the preferred statement. In this way, the scale is said to influences on the response, and these could be minimized by be “folding” around the preferred value 4, which determines averaging scores of semantically related questions for each the distance of all other responses from the folding point. variable (Nunnally & Bernstein, 2010). The error variance is Michell (1994) showed mathematically and experimen- assumed to converge around 0, making average scale scores tally that the quantitative properties of surveys stem from a better expression of the true attitude strength of the respon- these semantic distinctions. Just as Coombs claimed, all dents. The relationships among other surveyed variables respondents need to understand the common semantic prop- should however not be determined by the semantics of the erties—the meaning—of any survey item to attach numerical items, but instead only covary to the extent that they are values to the questions in the survey. For two respondents to empirically related. A frequent way of demonstrating this rate an item such as “I like to work here” with 1 or 5, they relative independence has been done by applying factor ana- need to agree on the meaning of this response—the one lytical techniques (Abdi, 2003; Hu & Bentler, 1999). In respondent likes his job, the other does not, but both need to short, the prevalent psychometric practices have until now understand the meaning of the other response alternatives for Arnulf et al. 3 one’s own response to be quantitatively comparable. Michell of respondents is the source of measures informing the showed how any survey scale needs to fold along a “domi- empirical research (Bagozzi, 2011; Lamiell, 2013; nant path” —the mutual meaning of items and response MacKenzie, Podsakoff, & Podsakoff, 2011; Michell, 2013; options used in a scale. This “dominant path” will affect the Slaney, 2017; Slaney & Racine, 2013a, 2013b). responses to other items if they are semantically related. Other researchers have reported that the survey structure Take the following simple example measuring job satis- itself may create distinct factors for items that were origi- faction and turnover intention, two commonly measured nally devised as “reversed” or negatively phrased items variables in OB research: One item measuring job satisfac- (Roysamb & Strype, 2002; van Schuur & Kiers, 1994). One tion is the item “I like working here,” and one item measur- reason for this is the uncertain relationship between the ing turnover intention is “I will probably look for a new job actual measurements obtained from the survey and the in the next weeks.” A person who answers 5 to “I like work- assumed quantifiable nature of the latent construct in ques- ing here” is by semantic implication less likely to look for a tion. Kathleen Slaney’s (2017) recent review of construct new job in the next week than someone who scores 1, and validation procedures shows how “measurement” of atti- vice versa. Less obvious is the effect of what Michell called tudes may come about by imposing numbers on an unknown the “dominant path”: If someone has a slightly positive atti- structure. As shown by Andrew Maul (2017), acceptable tude toward the job without giving it full score, this person psychometric properties of scales are obtainable even if key- will be slightly inclined, but maybe not determined, to turn words in the items are replaced by nonsensical words. The down offers for a new job. The dominant path of such items psychometric properties were largely retained even if the will make the respondents rank the mutual answering alter- item texts were replaced by totally meaningless sentences or natives in an “unfolding way.” Not only are the extreme even by entirely empty items carrying nothing but response points of the Likert-type scales semantically linked but peo- alternatives. The survey structure seems to be a powerful ple also appear to rank the response option of all items in source of methods effects, imposing structure on response mutual order. A third item measuring organizational citizen- statistics. ship behavior (OCB), for example, is “I frequently attend to The purpose here is to reconstruct survey responses using problems that really are not part of my job.” The semantic semantic information and other a priori known information specification of responses to this scale may be negative items about the survey structure. Semantic information about the such as “I only do as little as possible so I don’t get fired” or semantic content of items is precisely void of knowledge positive items such as “I feel capable and responsible for cor- about attitude strength. If this type of information can be recting any problem that may arise.” used to create artificial responses with meaningful character- According to unfolding theory, people will respond such istics akin to the original ones, it will substantiate the claims that their response pattern is semantically coherent, that is, of STSR. In particular, it will deliver empirical evidence that consistent with an unfolding of the semantic properties of common psychometric practices may risk treating attitude items. The dominant path will prevent most people from strength as error variance, leaving mostly semantic relation- choosing answer alternatives that are not semantically ships in the statistics. This attempt is exploratory in nature, coherent. and we will therefore not derive hypotheses but instead seek Any survey will need a semantically invariant structure to to explore the research question from various angles. The attain reliably different but consistent responses from differ- following exploration is undertaken as two independent ent people. Coombs and Kao showed experimentally that studies: Study 1 is an in-depth study of the MLQ, containing there is a necessary structure in all surveys emanating from the main procedures to investigate and explore. Study 2 is a how respondents commonly understand the survey (Coombs brief application of the same procedure to a different, shorter & Kao, 1960; Habing, Finch, & Roberts, 2005; Roysamb & scale, and another sample of respondents. Strype, 2002). In STSR, correlations among survey items are primarily Study 1 explained by the likelihood that they evoke similar mean- ings. As we will show below, the semantic relationships Sample among survey items contain information isomorphic to the correlations among the same items in a survey. This implies Real survey responses were used to train the algorithms and that individual responses are shaped—and thereby princi- serve as validation criteria. These consisted of 153 randomly pally computable—because the semantics of items are given selected responses from an original sample of more than and possible to estimate a priori to administering the survey. 1,200 respondents in a Norwegian financial institution. The To the extent that this is possible, current-day analytical responses were collected anonymously through an online techniques risk treating attitude strength as error variance. survey instrument. Participation was voluntary with informed This is contrary to what is commonly believed, as the tradi- consent, complying with the ethical regulations of the tion of “construct validation” in survey research rests on the Norwegian Centre for Research Data (http://www.nsd.uib. assumption that attitude strength across samples no/nsd/english/index.html). 4 SAGE Open It contains no knowledge about surveys, leadership, or Estimating Item Semantics respondent behavior. The MLQ has 45 items. This yields (45 A number of algorithms exist that allow computing the simi- × (45 − 1)) / 2 or 990 unique item pairs, for which we obtain larity of the survey items. Here, we have chosen one termed MI values. “MI” (Mihalcea, Corley, & Strapparava, 2006; Mohler & One special problem concerns the direction of signs. In Mihalcea, 2009). MI is chosen because it has been previ- the MLQ, 264 of 990 pairs of items are negatively correlated. ously published, is well understood, and allows easy replica- Theory suggests that two scales, Laissez-faire and Passive tion. The Arnulf et al. study in 2014 also showed that MI Management by Exception, are likely to relate negatively to values are probably closer to everyday language than some effective leadership. The problem has been treated exten- LSA-generated values that may carry specialized domain sively elsewhere (Arnulf et al., 2014), so we will only offer a knowledge. brief explanation here. MI does not take negative values, and The MI algorithm derives its knowledge about words does not differentiate well between positive and negative from a lexical database called WordNet, containing informa- statements about the same content. For two items describing tion about 147,278 unique words that were encoded by a how (a) a manager is unapproachable when called for and (b) team of linguists between 1990 and 2007 (Leacock, Miller, that the same person uses appropriate methods of leadership, & Chodorow, 1998; Miller, 1995; Poli, Healy, & Kameas, the surveyed responses correlate at –.42 in the present sam- 2010). Building on knowledge about each single word in ple, while the MI value is .38. The chosen solution is to allow WordNet as its point of departure, MI computes a similarity MI values to be negative for all pairs of items from Laissez- measure for two candidate sentences: S1 and S2. It identifies faire and Passive Management by Exception (correctly iden- part of speech (POS), beginning with tokenization, and POS tifying 255 of the 264 negative correlations, p < .001). tagging of all the words in the survey item with their respec- tive word classes (noun, verb, adverb, adjective, and cardi- Semantics and Survey Correlations nal, which play a very important role in text understanding). It then calculates word similarity by measuring each word in STSR argues that there is an isomorphic relationship between the sentence against all the words from the other sentence. the preadministration semantic properties (the IM values) This identifies the highest semantic similarity (maxSim) and the postadministration survey correlations. This means from six word-similarity metrics originally created to mea- that the two sets of numbers contain the same information, sure concept likeness (instead of word likeness). The metrics representing the same facts albeit in different ways: are adapted here to compute word similarity by computing Correlations represent different degrees of systematic covari- the shortest distance of given words’ synsets in the WordNet ation, whereas semantics represent different degrees of over- hierarchy. The word–word similarity measure is directional. lap in meanings. It begins with each word in S1 being computed against each Correlations express the likelihood that the variation in word in S2, and then vice versa. The algorithm finally con- Item B depends on the variation in Item A. A high correlation siders sentence similarity by normalizing the highest seman- between the two implies that if someone scores high on Item tic similarity (maxSim) for each word in the sentences by A, this person is more likely to score high on Item B also. A applying “inverse document frequency” (IDF) to the British correlation approaching 0 means that we cannot know from National Corpus to weight rare and common terms. The nor- the response to Item A how the respondent will score Item B. malized scores are then summed up for a sentence similarity In other words, the uncertainty in predicting the value of B score, SimMI, as follows: increases with decreasing correlations until 0, after which certainty increases again for predictions in the opposite  maxSim() wS , × IDFw ()  () wS ∈ direction.  +  The semantic values can be read in a similar way: If the IDFw  ∑ ()  () wS ∈ MI score of Items A and B is high, they are likely to overlap SimS () ,S =× , MI 12 2 maxSim () wS , × IDFw ()  ∑  () w∈S S in meaning. A person who agrees with Item A is likely to   agree with Item B as well. However, as the MI values are  IDFw  () () wS ∈  2  reduced, we cannot any longer make precise guesses about how the respondent will perceive Item B. where maxSim(w, S2) is the score of the most similar word In both cases, low values translate into increasing uncer- in S2 to w, and IDF (w) is the IDF of word w. tainty. In Likert-type scale data, the response values are The final output of MI is a numeric value between 0 and restricted to integers in a fixed range, for example, between 1, where 0 indicates no semantic overlap, and numbers 1 and 5. Low correlations and low MI values indicate that the approaching 1 indicate identical meaning of the two sen- response to Item B can be any of the five values in the scale. tences. These numbers serve as the input to our simulating Higher correlations and MI values reduce uncertainty, and algorithm for constructing artificial responses. Note that the restrict the likely variation of responses to B. As these values information in the MI values is entirely lexical and syntactic. increase, the expected uncertainty is reduced to a point where Arnulf et al. 5 Table 1. Correlations Between Average Score Differences, Standard Deviations of Score Differences, Magnitude of Surveyed Correlations, and MI Scores. Average score difference, Survey correlations magnitude Item A − Item B SD of score differences Average score difference Item A − Item B −.94** SD of score differences −.08* .10** MI scores .88** −.79** −.07* Note. N for the surveyed sample was 153, N for the sample of differences and correlations was 990. *Correlation is significant at the .05 level (two-tailed). **Correlation is significant at the .01 level (two-tailed). Table 2. Hierarchical Regression Where MI Values (Step 1), Survey Correlations (Step 2) Were Regressed on the Average Score Differences (N = 990). Step 1 Step 2 MI values −.79** −.14** Survey correlations — −1.07** R .63 .89 F 1,683.58 3,981.42 **p < .01. the score on Item B is likely to be identical to the score on slightly inferior to that of the correlations. This is to be Item A. expected, as the correlations and the standard deviations If this is true, then both the MI scores and the real response stem from the same source, while the MI algorithm is only correlations should be negatively related to two aspects of one, imperfect algorithm out of several available choices. It the surveyed data: The average distance between Item A and has been shown elsewhere that it will usually take the output Item B, and the variance in this distance. A low correlation or of several present-day algorithms to approximate the seman- a low MI value should indicate that the range of expected tic parsing of natural human speakers (Arnulf et al., 2014), values of Item B increases. We explore this in Table 1, sup- but improved algorithms may alleviate the problems in the porting this proposition. MI values and empirically surveyed future. Most importantly, we can use the beta of the first step correlations are strongly, negatively, and about equally to estimate a specific item response from knowledge about related to the standard deviations of score differences. In the MI value. In other words, we are training our respondent other words, correlations and MI values express the same simulation algorithm using the regression equation above, information about uncertainty of scores between two survey capturing the beta as key to further computations. items. The signs are opposite, because higher MI scores indi- cate lower differences between scores of two items. Simulating Responses This provides a key to how MI values can allow us to estimate the value of a response to B if we know the response Based on the consideration above, it is possible to hypothe- to A. MI scores can be translated into score distances because size that a given respondent’s responses are not free to vary. they are systematically related to the differences. By regress- Once the respondent has chosen a response to the initial ing the MI values on the score differences, the resulting stan- items, the subsequent responses should be determined by the dardized beta can be used to estimate the distance from A to semantic relationships of the items (Arnulf et al., 2014; B, given that we know A. Table 2 shows this regression. It Nimon et al., 2016) and the structure of the survey, most displays a hierarchical model that enters the preadministra- notably the response categories (Maul, 2017; Slaney, 2017) tion MI values in the first step. By also entering the postad- and the unfolding patterns following from expected negative ministration in the second step, we supply additional support correlations (Michell, 1994; Roysamb & Strype, 2002; van for the claim that these two sets of scores indeed contain the Schuur & Kiers, 1994). same information. Ideally, it should be possible to predict any given response After entering the original surveyed correlations in Step based on the knowledge of the semantic matrix and a mini- 2, the beta for the MI values is substantially reduced, indicat- mum of initial responses. In our simulations, we can see that ing that the information contained in the MI values is indeed any response in the MLQ is predictable by using other known isomorphic to the information in the survey correlations. The responses and knowledge about the distances between items. same table also shows how the information in MI values is The R s of these predictions are in the range of .86 to .94. 6 SAGE Open As the semantic MI values correlate at –.79 and predict the if the unfolding pattern appears to be negative. This distances significantly (R = .63), it should theoretically be information is picked up by comparing the responses possible to substitute the distances with the semantic values, of Items 1, 2, and 3. While Items 1 and 2 are descrip- and thus predict later responses with a minimum of initial tions of positive leadership, Item 3 contains a nega- responses. tive appreciation. The perfect formula is yet to be found, but we have cre- 6. In the case that the Items A and B are assumed to be ated a preliminary algorithm that can possibly mimic real negatively related (this was discussed in the explana- responses to the MLQ. The present approach is explicitly tion of MI values above), the same relationship aiming at reproducing existing responses as this gives us the between MI and distances hold. However, the esti- best opportunity to compare simulated with real responses. mated value should logically be at the other end of The rationale for the algorithm combines semantics and the Likert-type scale (in a perfect negative correlation, unfolding theory as follows: a score of 5 on A indicates that the score for B is 1). So in the case of expected negative correlations, the 1. Responses are restricted to the values 1 to 5 of the direction of the algorithm formula is reversed within same Likert-type scale. The difference between any the 5-point Likert-type scale, such that two items, A and B, within this Likert-type scale is here referred to as the “distance” between A and B; ValueItem B = 6 - Value Item A () () for example, if A is 5 and B is 4, the distance between + MI for Item A and Item B x - 0.79. () them is 1 (5 − 4). 2. In the case of high MI values, Item B is likely to be 7. In this way, it is possible to start with Item 1, and use very close to its preceding item, A. Lower MI values the MI values to calculate the relationship of Item 1 indicate higher and less determinate distances. to Items 2, 3, and so on until Item 45. This process is 3. The most probable absolute distance between Item A repeated for Item 2 to all Items 3 to 45 and so on, and Item B is calculated as the MI value for A and B until all values have been calculated for all 990 multiplied by the standardized beta in the regression unique pairs of items. equation of Table 2 (–0.79). To predict a given dis- 8. To simulate missing responses, we can now delete tance from this type of regression equation, the for- the original responses and replace them with those mula should be as follows: Value (Item B) = Constant computed in Step 7 above. + (MI for Item A and Item B) x – 0.79. However, the 9. One final requirement is theoretically and practically distances were computed as absolute measures; that important. As mentioned, the MI values and correla- is, the absolute distance from 3 to 5 = 2, but so is 5 to tions are not really distance measures, but a measure 3. In practice, though, the algorithm may need to pre- of uncertainty, which in cases of low MI values dict a high number from a low number or vice versa. should be indeterminate. The formula used here The constant will therefore not “anchor” the distance instead applies the beta from the regression equation at the right point in the scale. as a measure of distance. However, uncertain values 4. We therefore need to tie the estimated point to the are in turn restricted by having closer relationships to value of Item A. We have tested several approaches other items. The whole matrix of 990 unique pairs of to this, and the formula that seems to work best for items is comparable with a huge Sudoku puzzle calculating any response B is to simply replace the where each item score is defined by its relationship to constant with the value for Item A, thus Value(Item 44 other items. We can use this to smooth out the B) = Value(Item A) + (MI for Item A and Item B) x simulated values for each item by averaging all the − 0.79. 44 estimated values resulting from each of its 44 5. This formula does impose the structure of semantic relationships. values on the subsequent numbers. It also seems counterintuitive because if MI increases (indicating In this way, our algorithm is based on the complete pattern of higher similarity), the term will grow in absolute semantic distances for every item with all other items, as numbers. However, the beta is negative, and the well as a hypothesis on the direction of scale unfolding based resulting number will be smaller. The impact on the on the initial three responses. It is admittedly explorative and ensuing calculations now comes from the unfolding based on an incomplete understanding of the issues involved, operations, depending on whether Response B is and our intention is to invite criticism and improvements higher or lower than A. To comply with predictions from others. One questionable feature of this algorithm is the from unfolding theory, the formula above keeps its tendency for positive evaluations to escalate positively and positive form if the respondent’s first three responses vice versa, probably due to a deficiency of the formula in indicate a positive evaluation (biasing the item dis- Step 4. In the course of all 990 iterations however, tances in a positive direction) but should be negative Arnulf et al. 7 these tendencies seem to balance each other out, and fix the 1. Scale reliability: The simulated scores should have averaged responses as dictated by the mutual pattern of acceptable reliability scores (Cronbach’s alpha), semantic distances. We have also checked that this formula preferably similar to the real scores. performs better than simply using averages of the known val- 2. Accumulated scores: A simulated survey response ues instead of semantics, thus substantiating the use of should yield summated scale values similar to the semantics in the formula. A further contrasting procedure ones of the surveyed population. Ideally, the average will be described below. scores on simulated leadership scales should be non- The MLQ has 45 items. Of these, 36 measure different significantly different from the average summated types of leadership behaviors, and the nine last items mea- scores of real survey scores. The average, summated sure how well the rated person’s work group does, com- simulated scores should also be significantly differ- monly treated as “outcome” variables. The Arnulf et al. ent from the other scales (differential reliability). (2014) study found the “outcome” variables to be deter- 3. Pattern similarity: The simulated survey scores mined by the responses to the preceding items. We will should not only show similar magnitude, but the pat- therefore start by trying to predict the individual cases of tern of simulated scores should also correlate signifi- these by deleting them from real response sets. By deleting cantly with the real individual score profiles. In progressive numbers of items, we will then explore how particular, there should be few or no negative correla- well the semantics will perform to predict the missing tions between real and simulated score profiles in a responses. sample of simulated protocols. Therefore, our first simulated step will be concerned with 4. Sample correlation matrix: The simulated scores predicting outcomes training the algorithm on the first 36 should yield a correlation matrix similar to the one items. In the next steps, we simply subtract remaining half of obtained from real survey scores. the survey until all real responses are deleted, offering the 5. Factor structure: The factor structure of simulated algorithm diminishing amounts of training information. In responses should bear resemblance to the factor this way, we can evaluate the degree to which the computed structure emerging from the real sample. values still bear resemblance to the original values. 6. Unfolding structure: Seen from the perspective of unfolding theory, extreme score responses are easier Contrast validation procedure. Algorithms like this may cre- to understand than midlevel responses. In an extreme ate artificial structures that are not due to the semantic MI score, a positive respondent will have a general ten- values but simply artifacts created by the algorithm proce- dency to reject negative statements and endorse high dures themselves. To control for this, we have created simi- positive scores, and a negative respondent will rank lar sets of responses with the same numbers of missing items in the opposite direction. Midlevel items across values, where the MI values in the algorithm are replaced by a complex scale would require more complex evalu- randomly generated values in the same range as the MI val- ations of how to “fold” each single item so as to stay ues (from −1 to +1). If similarities between artificial and with the dominant unfolding path (Michell, 1994). real responses are created by biases in the algorithmic pro- This is a tougher task for both respondents and the cedure and not by semantics, the output of randomly gener- simulating algorithm. We therefore want to check if ated numbers should also be able to reproduce numbers our algorithm is more appropriate for high and low resembling the original scores. The difference between the than for medium scores. output of random and semantically created numbers expresses the value of (present-day) semantics in predicting Results real responses. Table 3 shows the alpha values for all MLQ scales. Values for the real responses are in the first column. Computations Simulation Criteria are made for increasing numbers of missing values to the There are no previously tested criteria for assessing the qual- right. It can be seen that the alphas for simulated responses ity of simulated survey responses compared with real ones. are generally better than those for the real responses (the Survey data are generally used either as summated scores to alphas for simulated responses are lower for the simulated indicate the respondents’ attitude toward the survey topic values in only six of 40 cases). The alphas generated from (score level or attitude strength) or as input to statistical mod- random semantic responses are inadequate and keep deterio- eling techniques such as structural equation modeling (SEM). rating as items are replaced by simulated responses. In addition, survey data are often scrutinized by statistical Table 4 shows the mean summated scores for each of the methods to check their properties prior to such modeling MLQ subscales in the sample. When the nine outcome mea- (Nunnally & Bernstein, 2010). Therefore, we propose the fol- sures are missing (replaced by simulated scores), their simu- lowing common parameters to evaluate the resemblance of lated scale is nonsignificantly different from the original. the artificial responses to the real ones: When 21 item scores are missing (46% missing), there are 8 SAGE Open Table 3. Cronbach’s Alpha for All MLQ Scales, Real and Simulated Responses. Outcome 21 (46%) 33 (73%) 33 items 39 (86%) 39 items 42 (95%) 42 items items items items random items random items random 100% Real missing missing missing semantics missing semantics missing semantics synthetic Idealized influence attr .74 .77 .82 .88 −.10 .99 .13 1.00 −.15 .99 Idealized influence beh .72 .72 .72 .90 −.07 .92 −.04 .99 −.06 .99 Inspiring motivation .80 .80 .82 .91 .09 .99 −.12 1.00 −.05 .99 Intellectual stimulation .83 .82 .84 .85 .45 .91 −.20 .93 −.11 .76 Indvidualized consider. .78 .78 .82 .99 −.22 1.00 .16 1.00 −.06 .99 Conditional reward .73 .73 .79 .90 .42 .99 .10 1.00 −.20 .99 Mgmnt by exception act. .51 .52 .43 .72 .00 .77 .13 .97 −.27 .95 Mgmnt by exception pas. .47 .47 .47 .76 .38 .82 −.09 .83 −.06 .83 Laissez-faire .77 .77 .75 .78 .33 .84 −.03 .99 −.07 .97 Outcome measures .92 1.00 1.00 1.00 .18 1.00 −.02 1.00 .07 1.00 Note. MLQ = Multifactor Leadership Questionnaire. Table 4. Means for Subscales by Simulated Populations. Outcome 21 (46%) 33 (73%) 33 items 39 (86%) 39 items 42 (95%) 42 items items items items random items random items random Main constructs Real missing missing missing semantics missing semantics missing semantics IdealizedAttrib 3.43 3.42 3.39 3.58 3.03 3.79 3.02 3.87 3.00 IdealizedBehv 3.94 3.95 3.84 3.78 3.23 3.86 3.22 3.83 2.98 InspMotive 3.83 3.84 3.78 3.77 3.23 3.78 3.00 3.86 2.99 IntellStim 3.28 3.28 3.44 3.55 3.14 3.63 3.06 3.69 3.06 IndConsid 3.59 3.59 3.59 3.73 3.01 3.84 3.00 3.90 3.02 CondReward 3.79 3.79 3.71 3.80 3.44 3.84 3.27 3.90 3.23 MBEact 3.06 3.08 3.11 3.63 3.06 3.70 3.06 3.78 2.97 MBEpass 2.63 2.62 2.62 2.38 2.73 2.39 2.98 2.33 2.98 LaissFaire 2.37 2.37 2.43 2.32 2.71 2.28 2.85 2.22 3.01 Outcome 3.53 3.59 3.69 3.85 3.00 3.91 3.00 3.94 2.99 Average difference from .01 .07 .20 .38 .25 .47 .30 .52 real Note. Bold types: Not significantly different from their real human counterparts, p <. 05. only two instances of significant scale differences. When 33 semantics depart quicker and more dramatically from their or 39 items are missing, the number of significant differences real counterparts, see Table 5. increases, but the average differences from the real scores are Every individual’s simulated responses were correlated very small: 0.08 Likert-type scale points even for the 35 with their real counterparts to compare the pattern of real missing items, and 0.18 points in difference where 39 items versus simulated responses. Table 6 shows how these corre- (86% of the responses) are missing and replaced by simu- lations were distributed in the various simulated groups. As lated scores. Most of the scales are also still significantly dif- could be expected, there is a decline in the resemblance ferent from each other, such that no scale measuring between the simulated scores and their real duals as the num- transformational leadership overlaps with Laissez-Faire, ber of simulated scores increases. However, this decline hap- Passive Management by Exception, or outcome variable pens much faster for the scores generated by random patterns, scores. There is a tendency for some of the differences and when 43 items are replaced with simulated scores, there between the scales within the transformational leadership are still only eight cases (5%) that correlate negatively with construct to overlap with increasing number of simulated the real respondents, see Figure 1. items. We explored how the relationships among the subscales When all these scores are summed up in their purported of the MLQ changed with increasing numbers of missing higher level constructs—transformational, transactional, items. An interesting difference appeared between the values laissez-faire leadership and outcomes, this pattern of average replaced by the semantically informed algorithm and the scores is maintained. Scores computed with random algorithm with random semantic values: With increasing Arnulf et al. 9 Table 5. Means for Main Constructs by Simulated Populations. Outcome 21 (46%) 33 (73%) 33 items 39 (86%) 39 items 42 (95%) 42 items items items items random items random items random Main constructs Real missing missing missing semantics missing semantics missing semantics Transformational 3.62 3.62 3.61 3.68 3.13 3.78 3.06 3.83 3.01 Transactional 3.16 3.16 3.15 3.27 3.07 3.31 3.10 3.34 3.06 Laissez-faire 2.37 2.37 2.43 2.32 2.71 2.28 2.85 2.22 3.01 Outcomes 3.53 3.59 3.69 3.85 3.00 3.91 3.00 3.94 2.99 Average difference .02 .06 .14 .36 .20 .41 .24 .47 from real Note. Bold types: Not significantly different from their real human counterparts, p <. 05. Table 6. Characteristics of the Average Correlations Between Real and Simulated Respondents by Number of Simulated Item Responses. No of negative Minimum Maximum Mean Scale correlations correlation correlation correlation SD Outcome items (nine) missing 0 .79 1.00 .94 .05 21 items missing 0 .35 1.00 .83 .10 33 items missing 0 .06 .91 .61 .18 33 items random semantics 0 .11 .81 .50 .11 39 items missing 2 −.24 .87 .34 .31 39 items random semantics 2 –.08 .57 .31 .11 42 items missing 8 −.62 .88 .44 .29 42 items random semantics 22 –.26 .42 .14 .13 numbers of simulated values, the correlations among the reminiscent of factor structures, and rotational procedures subscales tended to increase for the semantically informed did not change the emerging patterns. The two factors emerg- simulations. Where the semantic predictions were replaced ing from the purely synthetic condition seem to be an artifact by random numbers (leaving only the pattern of the algo- of the algorithm because it needs two (randomly chosen) ini- rithm itself, void of semantics), the correlations among the tial values to get started. subscales decreased, approaching 0 where 39 of 45 responses We finally checked whether the score levels could affect the were simulated, see Figure 2. similarity between simulated and real responses. As we were We then performed a principal components analysis expecting, higher scores of both transformational leadership (PCA) on these samples to compare their ensuing patterns. and laissez-faire (and, by implication, the outcome values) The MLQ has been criticized for its messy factor structure were all related to higher correlations between the real response over the years, as some people find support for it and others and its simulated duplicate. This tendency was increasing for a do not (Avolio et al., 1995; Schriesheim, Wu, & Scandura, higher number of simulated scores but absent in responses 2009; Tejeda, Scandura, & Pillai, 2001). In our sample here computed in the random control condition, see Table 8. (N = 153), there emerged eight or nine factors, but the rotated factors were not clearly delineated and did not fully support Discussion of Study 1 the theorized structure of the survey. However, we are here not concerned with the structure of the MLQ itself but with Summing up our findings, the following descriptions seem the similarity of the real and simulated measures. Table 7 supported: shows that as an increasing number of items are replaced by semantically simulated ones, there is a gradual reduction in Outcome measures: When the outcome measures were the number of factors identified. This is completely opposite substituted with simulated measures, these were virtually from what happens where scores are computed with random nondistinguishable from the real measures. This implies input to the algorithm. In these cases, there is a proliferation that the purported outcome variables are not independent of eigenvalues increasing with the numbers of simulated and empirical but determined directly by the semantic variables. The numbers of factors indicated by scree plots are relationships to the previous survey items. The simulated displayed in brackets as these may be just as interesting as outcome levels were nondistinguishable from the real factors identified by eigenvalues (see Figure 3). The MI val- ones even when 39 of 45 items were replaced by simu- ues seem to impose a simplified structure on the data in PCA lated items. 10 SAGE Open Figure 1. The frequency distribution of correlations between real and stimulated responses for the simulated populations, replacing 42 of 45 item responses with simulated scores. Figure 2. Absolute interscale correlations by simulated sample. Reliability: The reliability levels of scales in the simulated substituted by simulated items, the alpha values increased. responses were comparable with and in most cases better Responses computed with random semantic figures pre- than the real responses. With increasing numbers of items sented deteriorating alphas. This supports our claim that Arnulf et al. 11 Table 7. Number of Factors With Eigenvalue >1 Extracted in Principal Components Analysis, Real and Simulated Samples (Factors Indicated by Scree Plots in Brackets). Outcome 33 items 39 items 42 items items 21 items 33 items random 39 items random 42 items random Real missing missing missing semantics missing semantics missing semantics Synthetic Computed 9 (4) 8 (4) 6 4 19 3 (5) 18 2 (3) 30 2 (3) on all 45 items Computed 8 (4) 8 (4) 6 4 16 3 (6) 15 2 (3) 16 2 (3) without outcome items Figure 3. Principal components scree plots, one real and three simulated samples (39 times missing, 39 items replaced with random semantics, one completely synthetic sample). the psychometric structures are caused by the semantic each of the 10 subscales. The respondents’ levels of patterns and are not an artifact of the algorithm. endorsing or criticizing their managers’ leadership behav- Summated scale levels: Even with the simple algorithm iors were reliably captured by a small subset of items. applied here, six real item responses (of 45 scale items) When the computed composite scores started deviating in are enough to predict the level of transformational leader- a statistically significant way from the real score levels, ship and laissez-faire scale scores precisely. Twelve items the differences were still quite small, and with the excep- allow a fairly precise calculation of the summated level of tion of the scale Passive Management By Exception, they 12 SAGE Open Table 8. The Relationship Between Magnitude of Correlation Between Subscale Score Levels, and the Relationships Between Real and Simulated Response by Number of Simulated Items. Outcome 33 missing 39 missing, 42 missing, (nine) items 21 items 33 items random 39 items random 42 item random MLQ subscale missing missing missing semantics missing semantics missing semantics Transform. .46** .50** .50** −.05 .27** −.02 .59** .02 Transact. .15 .12 .23** −.13 .14 −.13 .54** .04 Laissez-faire −.36** −.53** −.51** −.23** −.28** −.19* −.60** .02 Outcomes .45** .43** .41** .00 .26** −.06 .57** .12 Note. MLQ = Multifactor Leadership Questionnaire. *p < .05 level (two-tailed). **p < .01 level (two-tailed). were always closer to the real ones than to the randomly as clearer than the real sample did. Random responses generated scores. developed in the opposite direction and quickly began Pattern similarity: The simulated survey responses were generating extra factors proliferating upward to 15 to 30 correlating highly with their real origins, and there were factors. almost no cases where these correlations took negative Unfolding structure: As we expected, the simulator was values. That is interesting, given Michell’s (1994) find- most accurate in recreating response patterns at the ings that only a few percentage of survey respondents will extreme score level; that is, respondents who were very respond in a way that violates the semantic structure of negative or very positive toward their managers. the survey and its unfolding pattern. Even the sample Intermediate levels were harder to simulate exactly, and computing 42 simulated scores from three given responses the scale “Active management by exception” seems in all was highly and significantly correlated with their real explorations to offer the least precisely estimated scores counterparts. It seems warranted to say that the pattern of by our algorithm. This difficulty handling the “lukewarm” scores created by our simulation algorithm largely repli- scores is expected from unfolding theory (Andrich, 1996; cated the pattern of real responses. The randomly gener- Coombs, 1964; Coombs & Kao, 1960; Michell, 1994; ated patterns performed clearly inferior to the true Roberts, 2008) because such intermediate response pat- semantic values. terns give rise to more complex folding of scales. Correlation matrices: For the sake of brevity, we com- pared only the correlation matrices of the accumulated Study 2 subscales, substituting real scores for samples with increasing numbers of simulated responses. This compar- Measures ison is probably the one where simulated scores did not perform so well. The correlations among the scales were The scale subjected to simulation of scores here is a compos- increasing with increasing numbers of simulated ite of three scales frequently used in OB research: Two scales responses. This finding is however mixed in terms of published measuring perceptions of economic and social STSR relevance: While our algorithm seems to be less exchange, comprising eight and seven items, respectively sensitive to differential information with more simulated (Shore, Tetrick, Lynch, & Barksdale, 2006), and one scale items, the correlations will tend to increase in magnitude. measuring intrinsic motivation comprising five items This means that all else being equal, semantic information (Kuvaas, 2006). These scales were chosen because they orig- is a powerful source of correlations in survey data. This inate from different researchers and have not been part of a was evident in comparison with the correlation matrices coherent instrument. They are also shorter and offer less generated from random values, which were approaching 0 complexities than the MLQ. These scales displayed semantic as more responses were replaced by simulated ones. predictability in the previous study on STSR (Arnulf et al., Factor structure: As with the correlation matrices (and 2014). related to this matter), the factor structures of the data samples were increasingly simple with more semantics Sample based on simulated scores, ending with a two-factor model when all but three items were computed (95% of A randomly chosen sample of 100 employees from a the items replaced). The MLQ may not be a good testing Norwegian governmental research organization was used to ground for factor structures, as it was itself quite messy in train and validate the algorithm. About 72% of the respon- the small random sample we used here. Still, the sample dents were male, and the majority of respondents were hold- using simulated outcome scores identified the outcomes ing university degrees at bachelor level or higher. Arnulf et al. 13 For predicted negative correlations, Analytical Procedures We used the MI algorithm to compute semantic similarities ValueItem B = 6 - ValueItem A + () () between all 20 items. This yields a matrix of 20 × 19 / 2 = MI for Item A and Item B + 1.342. () 190 unique item pairs. The problem of negatives was solved as described in the case of the MLQ, as the scale measuring economic exchanged can be shown a priori to be negatively We then proceeded to explore if the responses simulated correlated with the other two (see Arnulf et al., 2014). Also, from semantic values predict their “real” counterparts bet- one item measuring social exchange is originally reversed, ter than random values in the same range (control and kept that way to conform with the theoretical handling of condition). negatives. The semantic indices from the MI algorithm predicted the sample correlation matrix significantly with an adjusted R Results of .52. As in the study above, this relationship was even The results will be reported summarily along the same lines stronger with the interitem distances (the average distance in as in Study 2: scores between Item A and Item B . . . ), reaching an adjusted R of .81. To train the predicting algorithms, we kept the con- Summated scale levels: Figure 4 shows the average accu- stant (1.342) and unstandardized beta (–.907) from the latter mulated scores for three test samples. The patterns of the regression analysis. semantically simulated scores are similar to the real sam- Individual response patterns were predicted by applying ple, but the average score on intrinsic motivation is some- the algorithm developed in Study 1. We replaced the sample what low (albeit significantly higher than the score for constant and unstandardized beta with the values from this social exchange). Adjusting the unfolding pattern in the sample, tested this version first: algorithm could possibly alleviate this. Importantly, the pattern seemed driven by the semantic values, as the ran- For predicted positive correlations, dom values tend to wipe out the pattern and the average ValueItem B = Value Item A + scores become similar. () () Pattern similarity: The semantically simulated test MI for Item A and Item B x -. .907 () responses correlated on average .56 with the originals. The highest correlation was .89 and the lowest was –.37, For predicted negative correlations, but only two of the 100 simulated responses correlated actually negatively with their real counterparts. The simu- ValueItem B = 6 - Value Item A + () () lations using random semantics yielded an average cor- relation of .10 with 30% negative correlations. MI for Item A and Item B x - .907. (() Reliability: The simulated responses yielded an α of 1.00, α for the random semantics was .99, and α for the real The resulting numbers were promising but did not seem sample was .79. totally satisfactory, possibly due to unfolding problems. Factor structure: The 20 items were subjected to PCA Whereas the MLQ is composed of highly heterogeneous with varimax rotation. The real responses yielded five subscales distributed in a mixed sequence, the Study 2 scales factors explaining 65.5% of the variation. The responses are very homogeneous and distributed one by one. It is hard simulated with semantic values yielded two factors to find an a priori rule for the unfolding of the combined explaining 98%, and the random semantics also produced scale. However, the unstandardized beta is –.907 which is two factors explaining 99%. A more interesting picture almost −1, and so plays a small role when multiplied with emerges when presenting two-dimensional plots of the other values except changing the sign. We first removed the factor structures, as displayed in Figure 5. sign to check the effect on unfolding, but results were equally promising but unsatisfying. We then decided to remove the The two-dimensional plots reveal that the random seman- beta and replace it with the constant for the item differences tics cannot distinguish between the three scales. The real instead (1.342) plus the semantic MI value. This provided a sample produces three distinct clusters even if it does not better approximation of the scores: present a satisfactory solution. The simulated sample pres- ents a clear three-factor plot of the items. The reversed item For predicted positive correlations, in the social exchange scale is plotted on the same axis but orthogonally to the nonreversed, as theoretically expected. ValueItem B = Value Item A + () () Still, social exchange items were erroneously grouped with MI for Item A and Item B + 1.342. intrinsic motivation. () 14 SAGE Open Figure 4. Average scale scores for the three scales for semantically simulated, real respondents and random semantics. Note. CI = confidence interval. Figure 5. Factor structures of random, semantic, and real samples. Unfolding structure: As in Study 1, there was a clear rela- Simulated scales were similar in the sense that (a) the tionship between the semantic predictability of the indi- aggregated means of the main variables were of similar mag- vidual response patterns and their score levels. The nitude and exhibited similar mutual patterns, (b) the reliabili- simulated response patterns correlated at .67 with the dis- ties were high or higher than the originals, (c) the majority of persion of scores (standard deviation of scores within the the simulated response patterns correlated highly with the individual) and .57 with the score level on intrinsic moti- original patterns with only 2% in a negative direction, and vation (p < .01). Elevated scores increase the score disper- (d) the factor structure in PCA indicated a three-factor solu- sion, allowing the responses to be more predictable. tion but only in a two-dimensional plot. The simulated responses failed to produce a level of intrinsic motivation as high as the original (higher than the Discussion of Study 2 two other scales but significantly lower), and the factor As in Study 1, the semantically simulated responses were structure failed to reproduce three clear-cut factors. similar but not completely identical to the original responses On the contrary, the simulated scores created with random that they were meant to predict. semantics failed to replicate the originals on all accounts Arnulf et al. 15 except for the alphas. This indicates that key characteristics A more precise weighting measure: In Study 1, we conse- of survey data—score levels, factor structures, and variable quently used the beta from a model where the semantic relationships—were reproducible by means of semantic indi- values are regressed on the observed score differences. ces in these scales. This was used as a benchmark to translate from MI values Also, the three scales did not emerge clearly from the real into probable score distances because it could be justified responses. The present dataset may not have been ideal for fairly simply. Study 2 showed that using the constant training simulation algorithms. For the sake of brevity, we do yielded better results. A more systematic mathematical not report the systematic effect of deleting real responses in rationale could create scores that are less uniform in the Study 2. way they impose structure on the data, and could possibly keep the factor structure intact as produced by humans. One possibility is to replace the distance approximation with a probability function that could add some random Final Discussion and Suggestions for error to the formula. Future Research A better model for unfolding of the items: The unfolding The main purpose of this article was to develop and apply a pattern we created in Study 1 was also just a quick rule of simple algorithm for creating artificial responses, and compare thumb, and in Study 2, we did not take the unfolding into these with a sample of real responses, explaining the rationale account at all, except for the negative correlations. More behind STSR and opening a field of exploring survey responses differentiated unfolding patterns could be modeled. One through computation. Across two different scales and samples, way would be to include more knowledge from the initial we were able to check the psychometric properties of simu- training data. This could increase the variation in data and lated scores compared with the real human responses. The reduce the tendency toward simplification of structures, semantic indices always performed much better in predicting as well as improving the performance of the algorithm in real scores than random numbers in the same range. responses with medium-range responses. An important This is a new field with no established quality criteria, and question to address is the case of multidimensional scales so our aim was simply to conduct a test applying what we as in our second dataset. In such cases, it may be neces- know. We also want to be transparent about what we do, sary to fix the response level for each dimension, which omitting overly complicated steps that could have improved points to the entry of nonsemantic information about atti- the performance. tude strength in the data. The results could partly be artifacts of the algorithm itself. More advanced smoothing function: The fact that all items As we have pointed out, research on the effects of unfolding are locked in a grid of differing relationships to 44 other and measurements in construct validation has repeatedly items is intriguing. A mathematical procedure that could shown that the survey structure itself is a major source of capture this complex network of values would be a much systematic variation, and hence needs to be considered in more direct approach to calculations, possibly akin to predicting responses (Maul, 2017; Michell, 1994; Slaney, multidimensional scaling (Borg & Groenen, 2005). This 2017; van Schuur & Kiers, 1994). could let us test the degree to which people create response However, we do think that improvements in predicting patterns deviating from what is semantically given. Not real scores are foreseeable already, addressing the following only would it inform STSR and unfolding theory but also series of issues: allow us to differ better between empirical questions (per- taining to how people actually respond) and logical ques- A theoretically more precise formula: It should ideally be tions (setting up conditions for how people ideally should possible to formulate a mathematically rigorous way to respond; Semin, 1989; Smedslund, 1988). translate the semantic matrix into the distance matrix, and from the distance matrix to a prediction of Item B if Item The results seem to support our main theoretical proposi- A is known. This is the main theoretical goal of STSR, tion to some degree. To the extent that survey responses are and we are not yet there. semantically determined, they are predictable a priori. More precise semantic estimates: This study applied The semantic values generally produced high alphas, semantics from the MI algorithm only. It is shown else- high correlations, and orderly patterns in the data, which where that a combination of semantic algorithms will the randomly generated semantic values failed to produce have incremental explanatory power (Arnulf & Larsen, even if the other steps of the algorithm were identical in 2015; Arnulf et al., 2014). Also, other computational both sets of simulated responses. An alarming finding in methods have been shown to produce similar results and our data is that the semantic structure seems to produce bet- could possibly be combined with what we do here (Gefen ter alphas and factor structures, possibly leading research- & Larsen, 2017; Nimon et al., 2016). More advanced ers to lean toward semantics in scale constructions to combinations of semantic values in the model may allow comply with current guidelines for fit indices (Hu & more precise replications of real responses. Bentler, 1999). 16 SAGE Open In STSR, survey responses may be seen more as an ORCID iD expression of coherent beliefs than a series of quantitative Jan Ketil Arnulf https://orcid.org/0000-0002-3798-1477 responses. The initial responses signal the endorsement of opinions. These could have been semantically explicit speci- References fications of the response alternatives as in, for example, Abdi, H. (2003). Factor rotations in factor analysis. In M. Lewis- Guttman scales (Michell, 1994). “Response strength” may Beck, A. Bryman, & T. Futing (Eds.), Encyclopedia of social be seen as a signal carrier for the semantic anchor of the sciences research methods (pp. 792-795). Thousand Oaks, CA: respondent’s interpretation of the items. Sage. In this regard, it is important to distinguish between sur- Andrich, D. (1996). A hyperbolic cosine latent trait model for vey responses as an individual expression and the survey unfolding polytomous responses: Reconciling Thurstone and responses as input to aggregated sample statistics. STSR Likert methodologies. British Journal of Mathematical and Statistical Psychology, 49, 347-365. cannot predict the initial response level of a given respondent Arnulf, J. K., & Larsen, K. R. (2015). Overlapping semantics a priori, the “theta” in item response theory (Singh, 2004). of leadership and heroism: Expectations of omnipotence, What the theory predicts is that once the individual’s level is identification with ideal leaders and disappointment in real set, the patterns (or values) of the remaining items are influ- managers. Scandinavian Psychologist, 2, e3. doi:10.15714/ enced or even determined by their semantic structure. Their scandpsychol.2.e3 values are not free to vary because they share overlapping Arnulf, J. K., Larsen, K. R., Martinsen, Ø. L., & Egeland, T. (2018). meaning, and therefore share the same subjective evaluation. The failing measurement of attitudes: How semantic determi- Thus, it will be the semantically determined patterns that nants of individual survey responses replace measures of atti- carry over into the sample statistics, not so much the attitude tude strength. Behavior Research Methods, 1-21. doi:10.3758/ strength (Arnulf, Larsen, Martinsen, & Egeland, 2018). s13428-017-0999-y Sample statistics—the bulk of the correlations in the Arnulf, J. K., Larsen, K. R., Martinsen, Ø. L., & Bong, C. H. (2014). Predicting survey responses: How and why semantics MLQ—may therefore be determined by semantic relation- shape survey statistics in organizational behavior. PLos ONE, ships that are void of attitude strength. This allows a precise 9(9), e106361. doi:10.1371/journal.pone.0106361 prediction of the “outcome” scales by semantics as demon- Avolio, B. J., Bass, B. M., & Jung, D. I. (1995). Multifactor strated above and theoretically predicted by others (Van Leadership Questionnaire technical report. Redwood City, Knippenberg & Sitkin, 2013). CA: Mind Garden. Taken together, our preliminary outline of a simulation pro- Bagozzi, R. P. (2011). Measurement and meaning in information cedure indicates how simulating semantically expected scores systems and organizational research: Methodological and phil- is possible. Subsequently, this may allow us to explore how to osophical foundations. MIS Quarterly, 35, 261-292. depart from what is semantically expected instead of rediscov- Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: ering semantically predetermined relationships. Theory and applications (2nd ed.). New York, NY: Springer. STSR does not propose that all survey data come about as Borsboom, D. (2008). Latent variable theory. Measurement, 6, 25-53. Borsboom, D. (2009). Educational measurement: Book a result of semantics. Neither does the theory claim that this review. Structural Equation Modeling, 16, 702-711. model holds across all constructs. STSR simply proposes doi:10.1080/10705510903206097 that whatever the sources of variation in survey data, the Coombs, C. H. (1964). A theory of data. New York, NY: Wiley. semantics implied is the first source to evaluate, often more Coombs, C. H., & Kao, R. C. (1960). On a connection between fac- powerful and systematic than hitherto assumed. By offering tor analysis and multidimensional unfolding. Psychometrika, a rationale and an outline for experimental research on STSR, 25, 219-231. we hope future developments can address more detailed Feldman, J. M., & Lynch, J. G. J. (1988). Self-generated validity questions of the nature and interaction of survey response. and other effects of measurement on belief, attitude, intention, and behavior. Journal of Applied Psychology, 73, 421-435. Firmin, M. W. (2010). Commentary: The seminal contribution of Declaration of Conflicting Interests Richard LaPiere’s attitudes vs actions (1934) research study. The author(s) declared no potential conflicts of interest with respect International Journal of Epidemiology, 39, 18-20. doi:10.1093/ to the research, authorship, and/or publication of this article. ije/dyp401 Gefen, D., & Larsen, K. R. (2017). Controlling for lexical close- ness in survey research: A demonstration on the technology Funding acceptance model. Journal of the Association for Information The author(s) disclosed receipt of the following financial support Systems, 18, 727-757. for the research, authorship, and/or publication of this article: We Habing, B., Finch, H., & Roberts, J. S. (2005). A Q3 statis- thank the U.S. National Science Foundation for research support tic for unfolding item response theory models: Assessment under Grant NSF 0965338 and the National Institutes of Health of unidimensionality with two factors and simple struc- through Colorado Clinical & Translational Sciences Institute for ture. Applied Psychological Measurement, 29, 457-471. research support under NIH/CTSI 5 UL1 RR025780. doi:10.1177/0146621604279550 Arnulf et al. 17 Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in Poli, R., Healy, M., & Kameas, A. (2010). WordNet. In C. Fellbaum covariance structure analysis: Conventional criteria versus new (Ed.), Theory and applications of ontology: Computer applica- alternatives. Structural Equation Modeling, 6, 1-55. tions (pp. 231-243). New York, NY: Springer. Jöreskog, K. G. (1993). Testing structural equation models. In K. A. Roberts, J. S. (2008). Modified likelihood-based item fit statistics for Bollen & J. S. Long (Eds.), Testing structural equation models the generalized graded unfolding model. Applied Psychological (pp. 294-316). Newbury Park, CA: Sage. Measurement, 32, 407-423. doi:10.1177/0146621607301278 Kuvaas, B. (2006). Work performance, affective commitment, Roysamb, E., & Strype, J. (2002). Optimism and pessimism: and work motivation: The roles of pay administration and pay Underlying structure and dimensionality. Journal of Social & level. Journal of Organizational Behavior, 27, 365-385. Clinical Psychology, 21, 1-19. Lamiell, J. T. (2013). Statisticism in personality psychologists’ use Rubin, D. B. (1987). Multiple imputation for nonresponse in sur- of trait constructs: What is it? How was it contracted? Is there veys. New York, NY: Wiley. a cure? New Ideas in Psychology, 31, 65-71. doi:10.1016/j. Schriesheim, C. A., Wu, J. B., & Scandura, T. A. (2009). A newideapsych.2011.02.009 meso measure? Examination of the levels of analysis of the LaPiere, R. T. (1934). Attitudes vs. actions. Social Forces, 13, 230- Multifactor Leadership Questionnaire (MLQ). The Leadership 237. Quarterly, 20, 604-616. doi:10.1016/j.leaqua.2009.04.005 Leacock, C., Miller, G. A., & Chodorow, M. (1998). Using cor- Schwarz, N. (1999). Self-reports: How the questions shape the pus statistics and WordNet relations for sense identification. answers. American Psychologist, 54, 93-105. Computational Linguistics, 24, 147-165. Semin, G. (1989). The contribution of linguistic factors to attri- Likert, R. (1932). A technique for the measurement of attitudes. bute inferences and semantic similarity judgements. European Archives of Psychology, 140, 1-55. Journal of Social Psychology, 19, 85-100. MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P. (2011). Shore, L. M., Tetrick, L. E., Lynch, P., & Barksdale, K. (2006). Construct measurement and validation procedures in MIS and Social and economic exchange: Construct development and behavioral research: Integrating new and existing techniques. validation. Journal of Applied Social Psychology, 36, 837-867. MIS Quarterly, 35, 293-334. Singh, J. (2004). Tackling measurement problems with item Maul, A. (2017). Rethinking traditional methods of survey response theory: Principles, characteristics, and assessment, validation. Measurement: Interdisciplinary Research and with an illustrative example. Journal of Business Research, 57, Perspectives, 15, 51-69. doi:10.1080/15366367.2017.1348108 184-208. doi:10.1016/s0148-2963(01)00302-2 Michell, J. (1994). Measuring dimensions of belief by unidimen- Slaney, K. L. (2017). Validating psychological constructs: sional unfolding. Journal of Mathematical Psychology, 38, Historical, philosophical, and practical dimensions. London, 244-273. England: Palgrave Macmillan. Michell, J. (2013). Constructs, inferences, and mental measure- Slaney, K. L., & Racine, T. P. (2013a). Constructing an under- ment. New Ideas in Psychology, 31, 13-21. doi:10.1016/j. standing of constructs. New Ideas in Psychology, 31, 1-3. newideapsych.2011.02.004 doi:10.1016/j.newideapsych.2011.02.010 Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based Slaney, K. L., & Racine, T. P. (2013b). What’s in a name? and knowledge-based measures of text semantic similarity. Psychology’s ever evasive construct. New Ideas in Psychology, American Association for Artificial Intelligence, 6, 775-780. 31, 4-12. doi:10.1016/j.newideapsych.2011.02.003 Miller, G. (1995). WordNet: A lexical database for Smedslund, J. (1988). What is measured by a psychological mea- English. Communications of the ACM, 38(11), 39-41. sure? Scandinavian Journal of Psychology, 29, 148-151. doi:10.1145/219717.219748 Tejeda, M. J., Scandura, T. A., & Pillai, R. (2001). The MLQ Mohler, M., & Mihalcea, R. (2009, March 30-April 03). Text-to- revisited: Psychometric properties and recommendations. text semantic similarity for automatic short answer grading. The Leadership Quarterly, 12, 31-52. doi:10.1016/S1048- Paper presented at the 12th Conference European Chapter of 9843(01)00063-7 the Association for Computational Linguistics (EACL 2009), Van Knippenberg, D., & Sitkin, S. B. (2013). A critical assessment Athens, Greece. of charismatic-transformational leadership research: Back to Nimon, K., Shuck, B., & Zigarmi, D. (2016). Construct overlap the drawing board? The Academy of Management Annals, 7, between employee engagement and job satisfaction: A func- 1-60. doi:10.1080/19416520.2013.759433 tion of semantic equivalence? Journal of Happiness Studies, van Schuur, W. H., & Kiers, H. A. L. (1994). Why factor analy- 17, 1149-1171. doi:10.1007/s10902-015-9636-6 sis often is the incorrect model for analyzing bipolar con- Nunnally, J. C., & Bernstein, I. H. (2010). Psychometric theory (3rd cepts, and what models to use instead. Applied Psychological ed.). New York, NY: McGraw-Hill. Measurement, 18, 97-110. Ortiz de Guinea, A., Titah, R., & Léger, P.-M. (2013). Measure for measure: A two study multi-trait multi-method investiga- Author Biographies tion of construct validity in IS research. Computers in Human Jan Ketil Arnulf, PhD, is an associate professor at BI Norwegian Behavior, 29, 833-844. Business School, teaching and researching leadership and leader- Podsakoff, P. M., MacKenzie, S. B., & Podsakoff, N. P. (2012). ship development. He has served as an associate dean to the Sources of method bias in social science research and recom- BI-Fudan MBA program in Shanghai, China. mendations on how to control it. In S. T. Fiske, D. L. Schacter, & S. E. Taylor (Eds.), Annual review of psychology (Vol. 63, Kai R. Larsen, PhD, is an associate professor of management and pp. 539-569). Palo Alto, CA: Annual Reviews. entrepreneurship at Leeds Business School, University of Colorado 18 SAGE Open at Boulder. He serves as the director of the federally supported Øyvind L. Martinsen, PhD, is a full professor at BI Norwegian Human Behavior Project, researching a transdisciplinary “back- Business School in Oslo, Norway. He conducts research in leader- bone” for theoretical research. He teaches business intelligence and ship, personality, and creativity, and also teaches these issues as privacy in the age of Facebook. well as psychometrics. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png SAGE Open SAGE http://www.deepdyve.com/lp/sage/respondent-robotics-simulating-responses-to-likert-scale-survey-items-DhnSpzQRgx

Loading next page...

References (65)

G. Semin (1989)
The contribution of linguistic factors to attribute inferences and semantic similarity judgements
European Journal of Social Psychology, 19
Li-tze Hu, P. Bentler (1999)
Cutoff criteria for fit indexes in covariance structure analysis : Conventional criteria versus new alternatives
Structural Equation Modeling, 6
P. Kline (1985)
The factor structure
Biological Psychology, 20
of semantic values on the subsequent numbers
One final requirement
The most probable absolute distance between Item A and Item B is calculated as the MI value for A and B multiplied by the standardized beta in the regression equation of Table 2 (–0.79)
J. Michell (1994)
Measuring dimensions of belief by unidimensional unfolding
Journal of Mathematical Psychology, 38
D. Gefen, Kai Larsen (2017)
Controlling for Lexical Closeness in Survey Research: A Demonstration on the Technology Acceptance Model
J. Assoc. Inf. Syst., 18
L. Shore, L. Tetrick, Patricia Lynch, Kevin Barksdale (2006)
Social and Economic Exchange: Construct Development and Validation.
Journal of Applied Social Psychology, 36
C. Leacock, M. Chodorow, G. Miller (1998)
Using Corpus Statistics and WordNet Relations for Sense Identification
Comput. Linguistics, 24
M. Tejeda, T. Scandura, Rajnandini Pillai (2001)
The MLQ revisited: psychometric properties and recommendations
Leadership Quarterly, 12
Attitudes vs. actions
D. Borsboom (2009)
Educational Measurement (4th ed.)
Structural Equation Modeling: A Multidisciplinary Journal, 16
Kim Nimon, B. Shuck, Drea Zigarmi (2015)
Construct Overlap Between Employee Engagement and Job Satisfaction: A Function of Semantic Equivalence?
Journal of Happiness Studies, 17
Ana Guinea, Ryad Titah, Pierre-Majorique Léger (2013)
Measure for Measure: A two study multi-trait multi-method investigation of construct validity in IS research
Comput. Hum. Behav., 29
C. Schriesheim, Joshua Wu, T. Scandura (2009)
A meso measure? Examination of the levels of analysis of the Multifactor Leadership Questionnaire (MLQ)☆
Leadership Quarterly, 20
Scale reliability: The simulated scores should have acceptable reliability scores (Cronbach’s alpha), preferably similar to the real scores
Rada Mihalcea, Courtney Corley, C. Strapparava (2006)
Corpus-based and Knowledge-based Measures of Text Semantic Similarity
I. Verschoor (1951)
Educational Measurement (Book Review)
College & Research Libraries, 12
Sample correlation matrix: The simulated scores should yield a correlation matrix similar to the one obtained from real survey scores
Jagdip Singh (2004)
Tackling measurement problems with Item Response Theory: Principles, characteristics, and assessment, with an illustrative example
Journal of Business Research, 57
A. Maul (2017)
Rethinking Traditional Methods of Survey Validation
Measurement: Interdisciplinary Research and Perspectives, 15
D. Knippenberg, S. Sitkin (2013)
A Critical Assessment of Charismatic—Transformational Leadership Research: Back to the Drawing Board?
The Academy of Management Annals, 7
G. Miller (1995)
WordNet: A Lexical Database for English
Commun. ACM, 38
Scott MacKenzie, P. Podsakoff, Nathan Podsakoff (2011)
Construct Measurement and Validation Procedures in MIS and Behavioral Research: Integrating New and Existing Techniques
MIS Q., 35
WordNet
N. Schwarz (1999)
Self-reports: How the questions shape the answers.
American Psychologist, 54
Brian Habing, Holmes Finch, J. Roberts (2005)
A Q3 Statistic for Unfolding Item Response Theory Models: Assessment of Unidimensionality With Two Factors and Simple Structure
Applied Psychological Measurement, 29
(1995)
Multifactor Leadership Questionnaire technical report
J. Roberts (2008)
Modified Likelihood-Based Item Fit Statistics for the Generalized Graded Unfolding Model
Applied Psychological Measurement, 32
(1934)
Attitudes vs
B. Kuvaas (2006)
Work performance, affective commitment, and work motivation: the roles of pay administration and pay level
Journal of Organizational Behavior, 27
I. Borg, P. Groenen (2005)
Modern multidimensional scaling: Theory and applications, 2nd ed.
J. Arnulf, Kai Larsen, Ø. Martinsen, T. Egeland (2018)
The failing measurement of attitudes: How semantic determinants of individual survey responses come to replace measures of attitude strength
Behavior Research Methods, 50
K. Bollen, J. Long (1993)
Testing Structural Equation Models
A technique for the measurement of attitudes
K. Slaney, T. Racine (2013)
What’s in a name? Psychology’s ever evasive construct
New Ideas in Psychology, 31
Unfolding structure: Seen from the perspective of unfolding theory, extreme score responses are easier to understand than midlevel responses
J. Michell (2013)
Constructs, inferences, and mental measurement
New Ideas in Psychology, 31
To simulate missing responses
(2010)
Psychometric theory (3rd ed.)
Michael Mohler, Rada Mihalcea (2009)
Text-to-Text Semantic Similarity for Automatic Short Answer Grading
K. Slaney, T. Racine (2013)
Constructing an understanding of constructs
New Ideas in Psychology, 31
A. Bos, P. Eykhoff (1988)
Model building and parameter estimation as means for intelligent measurement
Measurement, 6
Lena Osterhagen (2016)
Multiple Imputation For Nonresponse In Surveys
J. Smedslund (1988)
What is measured by a psychological measure
Scandinavian Journal of Psychology, 29
D. Borsboom (2008)
Latent Variable Theory
Measurement: Interdisciplinary Research and Perspectives, 6
E. Røysamb, Jon Strype (2002)
Optimism and Pessimism: Underlying Structure and Dimensionality
Journal of Social and Clinical Psychology, 21
J. Feldman, John Lynch (1988)
Self-generated validity and other effects of measurement on belief, attitude, intention, and behavior.
Journal of Applied Psychology, 73
J. Arnulf, Kai Larsen (2015)
Overlapping Semantics of Leadership and Heroism: Expectations of Omnipotence, Identification With Ideal Leaders and Disappointment in Real Managers
Leadership & Organizational Behavior eJournal
J. Arnulf, Kai Larsen, Ø. Martinsen, Chih Bong (2014)
Predicting Survey Responses: How and Why Semantics Shape Survey Statistics on Organizational Behaviour
PLoS ONE, 9
Accumulated scores: A simulated survey response should yield summated scale values similar to the ones of the surveyed population
Wijbrandt Schuur, H. Kiers (1994)
Why Factor Analysis Often is the Incorrect Model for Analyzing Bipolar Concepts, and What Model to Use Instead
Applied Psychological Measurement, 18
H. Abdi (2003)
Factor Rotations in Factor Analyses
In the case of high MI values, Item B is likely to be very close to its preceding item, A. Lower MI values indicate higher and less determinate distances
D. Andrich (1996)
A hyperbolic cosine latent trait model for unfolding polytomous responses: Reconciling Thurstone and Likert methodologies
British Journal of Mathematical and Statistical Psychology, 49
R. Bagozzi (2011)
Measurement and Meaning in Information Systems and Organizational Research: Methodological and Philosophical Foundations
MIS Q., 35
Paul Green, C. Coombs (1967)
A Theory of Data
Journal of Marketing Research, 4
Michael Firmin (2010)
Commentary: The seminal contribution of Richard LaPiere's attitudes vs actions (1934) research study.
International journal of epidemiology, 39 1
K. Slaney (2017)
Validating Psychological Constructs
C. Coombs, R. Kao (1960)
On a connection between factor analysis and multidimensional unfolding
Psychometrika, 25
PhD, is a full professor at BI Norwegian Business School in Oslo, Norway. He conducts research in leadership, personality, and creativity, and also teaches these issues as well as psychometrics
J. Lamiell (2013)
Statisticism in personality psychologists’ use of trait constructs: What is it? How was it contracted? Is there a cure?
New Ideas in Psychology, 31
R. Poli, Michael Healy, A. Kameas (2010)
Theory and Applications of Ontology: Computer Applications
Philip Podsakoff, Scott MacKenzie, Nathan Podsakoff (2012)
Sources of method bias in social science research and recommendations on how to control it.
Annual review of psychology, 63

Publisher: SAGE
Copyright: Copyright © 2022 by SAGE Publications Inc, unless otherwise noted. Manuscript content on this site is licensed under Creative Commons Licenses.
ISSN: 2158-2440
eISSN: 2158-2440
DOI: 10.1177/2158244018764803
Publisher site: See Article on Publisher Site

Abstract

The semantic theory of survey responses (STSR) proposes that the prime source of statistical covariance in survey data is the degree of semantic similarity (overlap of meaning) among the items of the survey. Because semantic structures are possible to estimate using digital text algorithms, it is possible to predict the response structures of Likert-type scales a priori. The present study applies STSR in an experimental way by computing real survey responses using such semantic information. A sample of 153 randomly chosen respondents to the Multifactor Leadership Questionnaire (MLQ) was used as target. We developed an algorithm based on unfolding theory, where data from digital text analysis of the survey items served as input. Upon deleting progressive numbers (from 20%-95%) of the real responses, we let the algorithm replace these with simulated ones, and then compared the simulated datasets with the real ones. The simulated scores displayed sum score levels, alphas, and factor structures highly resembling their real origins even if up to 86% were simulated. In contrast, this was not the case when the same algorithm was operating without access to semantic information. The procedure was briefly repeated on a different measurement instrument and a different sample. This not only yielded similar results but also pointed to need for further theoretical and practical developments. Our study opens for experimental research on the effect of semantics on survey responses using computational procedures. Keywords semantics, simulation, surveys, semantic theory of survey response, leadership The STSR has argued and empirically documented that Introduction up to 86% of the variation in correlations among items in Is it possible to simulate and predict real survey responses organizational behavior (OB) can be explained through their before they happen? And what would that tell us? The pres- semantic properties (Arnulf & Larsen, 2015; Arnulf et al., ent article describes and tests a method to create artificial 2014). Such strong predictors of response patterns imply that responses according to the semantic properties of the survey it is possible to reverse the equations and use semantics to items based on the semantic theory of survey responses create realistic survey responses. This offers an empirical (STSR; Arnulf, Larsen, Martinsen, & Bong, 2014). According tool to explore why semantics can explain as much as 65% to to STSR, the semantic relationships will shape the baseline 86% in some surveys such as the MLQ, but as low as 5% in of correlations among items. Such relationships are now responses to the personality inventory. There is a need for accessible a priori through the use of digital semantic more detailed exploration of the phenomena involved to bet- algorithms. ter understand how and why STSR applies. Theoretically, survey responses should be predictable to Artificial responses calculated from the semantics of the the extent that their semantic relationships are fixed. The items could also enhance the scientific value of surveys. present study seeks to develop such a method and apply it to Ever since Likert devised his measurement scales (Likert, a well-known leadership questionnaire, the Multifactor 1932), recurring criticism has raised doubts about the predic- Leadership Questionnaire (MLQ; Avolio, Bass, & Jung, tive validity of the statistical models building on such scales 1995). Thereafter, we briefly show how it performs using a different measurement scale. BI Norwegian Business School, Oslo, Norway The contributions of this are threefold—primarily devel- University of Colorado Boulder, USA oping the rationale of STSR, secondarily testing a tool for Corresponding Author: establishing a baseline of response patterns from which more Jan Ketil Arnulf, BI Norwegian Business School, Nydalen, N-0442 Oslo, psychological inferences can be made, and also possibly Norway. offering an alternative approach to imputing missing data. Email: jan.k.arnulf@bi.no Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). 2 SAGE Open (Firmin, 2010; LaPiere, 1934), as they are vulnerable to been treating the systematic variation among items as expres- inflated values through common method variance (Podsakoff, sion of attitude strength toward topics in the survey. MacKenzie, & Podsakoff, 2012). The STSR proposes a contrasting view. Here, the relation- The prevalent use of covariance and correlation matri- ships among items and among survey variables are first and ces in factor analysis and structural equations (Abdi, foremost semantic (Arnulf et al., 2014), a view corroborated 2003; Jöreskog, 1993) is problematic if we cannot dis- by independent researchers (Nimon, Shuck, & Zigarmi, criminate semantic variance components more clearly 2016). Every respondent may begin the survey by expressing from attitude strength. Establishing a semantic “baseline” attitude strength toward the surveyed topic in the form of a of the factor structure in surveys would allow us to study score on the Likert-type scale. However, in the succeeding how and why people chose to depart from what is seman- responses, the scores on the coming items may be predomi- tically given. nantly determined by the degree to which these items are Finally, a technology for simulating survey responses may semantically similar. This was earlier argued and documented have its own value. Present-day techniques of replacing miss- by Feldman and Lynch (1988). A slightly different version of ing values are basically mere extrapolations of what is already this hypothesis was also formulated by Schwarz (1999). in the matrix, and only work if the missing values make up However, both these precursors to STSR were speculating minute fractions of data (Rubin, 1987). In the current study, that calculation of responses may be exceptional to situations we present a technique to calculate the likely responses when where people hold no real attitudes, or become unduly influ- up to 95% of responses are missing. This kind of simulated enced in their response patterns by recent responses to other data help improve the theoretical foundations of psychomet- items. The formulation of STSR was the first claim that rics that hitherto has left semantics out of its standard inven- semantic calculation may actually be the fundamental mecha- tory of procedures (Borsboom, 2008, 2009). nism explaining systematic variance among items. Finally, data simulation based on item semantics could be Another antecedent to STSR is “unfolding theory” as a valuable accessory to otherwise complicated methods for described by Coombs (Coombs, 1964; Coombs & Kao, 1960) testing methodological artifacts (Bagozzi, 2011; Ortiz de and later by Michell (1994). We will deal with unfolding the- Guinea, Titah, & Léger, 2013). ory in some detail as it has direct consequences for creating We first present how semantics can be stepwise turned algorithms to mimic the real responses. A practical example into artificial responses. These responses are then compared may be a job satisfaction item, such as “I like working here.” with a sample of real responses and artificial responses with When respondents choose to answer this on a scale from 1 to no semantic information. The procedure is then applied to a 5, it may be hard to explain what the number means. To quan- second scale and dataset to test its applicability across instru- tify an attitude, one could split the statement in discreet ments. Finally, we discuss how the relevant findings may answering categories such as the extremely positive attitude: help develop STSR from an abstract theory to practical “I would prefer working here to any other job or even leisure applications. activity.” A neutral attitude could be a statement such as “I do not care if I work here or not,” or the negative statement “I would take any other job to get away from this one.” The Theory central point in unfolding theory is that any respondent’s pre- ferred response would be the point at which item response Semantics and Correlations scale “folds.” Folding implies that the response alternatives Rensis Likert assumed that his scales delivered measures of need to be sorted in their mutual distance from the preferred attitude strength (Likert, 1932). Statistic modeling of such option. If someone picks the option 4 on a scale from 1 to 5, data in classic psychometrics viewed survey responses as it would mean that the options 3 and 5 are about equally dis- basically composed of a true score and an error component. tant from 4, but that 2 and certainly 1 would be further away The error component of the score would reflect random from the preferred statement. In this way, the scale is said to influences on the response, and these could be minimized by be “folding” around the preferred value 4, which determines averaging scores of semantically related questions for each the distance of all other responses from the folding point. variable (Nunnally & Bernstein, 2010). The error variance is Michell (1994) showed mathematically and experimen- assumed to converge around 0, making average scale scores tally that the quantitative properties of surveys stem from a better expression of the true attitude strength of the respon- these semantic distinctions. Just as Coombs claimed, all dents. The relationships among other surveyed variables respondents need to understand the common semantic prop- should however not be determined by the semantics of the erties—the meaning—of any survey item to attach numerical items, but instead only covary to the extent that they are values to the questions in the survey. For two respondents to empirically related. A frequent way of demonstrating this rate an item such as “I like to work here” with 1 or 5, they relative independence has been done by applying factor ana- need to agree on the meaning of this response—the one lytical techniques (Abdi, 2003; Hu & Bentler, 1999). In respondent likes his job, the other does not, but both need to short, the prevalent psychometric practices have until now understand the meaning of the other response alternatives for Arnulf et al. 3 one’s own response to be quantitatively comparable. Michell of respondents is the source of measures informing the showed how any survey scale needs to fold along a “domi- empirical research (Bagozzi, 2011; Lamiell, 2013; nant path” —the mutual meaning of items and response MacKenzie, Podsakoff, & Podsakoff, 2011; Michell, 2013; options used in a scale. This “dominant path” will affect the Slaney, 2017; Slaney & Racine, 2013a, 2013b). responses to other items if they are semantically related. Other researchers have reported that the survey structure Take the following simple example measuring job satis- itself may create distinct factors for items that were origi- faction and turnover intention, two commonly measured nally devised as “reversed” or negatively phrased items variables in OB research: One item measuring job satisfac- (Roysamb & Strype, 2002; van Schuur & Kiers, 1994). One tion is the item “I like working here,” and one item measur- reason for this is the uncertain relationship between the ing turnover intention is “I will probably look for a new job actual measurements obtained from the survey and the in the next weeks.” A person who answers 5 to “I like work- assumed quantifiable nature of the latent construct in ques- ing here” is by semantic implication less likely to look for a tion. Kathleen Slaney’s (2017) recent review of construct new job in the next week than someone who scores 1, and validation procedures shows how “measurement” of atti- vice versa. Less obvious is the effect of what Michell called tudes may come about by imposing numbers on an unknown the “dominant path”: If someone has a slightly positive atti- structure. As shown by Andrew Maul (2017), acceptable tude toward the job without giving it full score, this person psychometric properties of scales are obtainable even if key- will be slightly inclined, but maybe not determined, to turn words in the items are replaced by nonsensical words. The down offers for a new job. The dominant path of such items psychometric properties were largely retained even if the will make the respondents rank the mutual answering alter- item texts were replaced by totally meaningless sentences or natives in an “unfolding way.” Not only are the extreme even by entirely empty items carrying nothing but response points of the Likert-type scales semantically linked but peo- alternatives. The survey structure seems to be a powerful ple also appear to rank the response option of all items in source of methods effects, imposing structure on response mutual order. A third item measuring organizational citizen- statistics. ship behavior (OCB), for example, is “I frequently attend to The purpose here is to reconstruct survey responses using problems that really are not part of my job.” The semantic semantic information and other a priori known information specification of responses to this scale may be negative items about the survey structure. Semantic information about the such as “I only do as little as possible so I don’t get fired” or semantic content of items is precisely void of knowledge positive items such as “I feel capable and responsible for cor- about attitude strength. If this type of information can be recting any problem that may arise.” used to create artificial responses with meaningful character- According to unfolding theory, people will respond such istics akin to the original ones, it will substantiate the claims that their response pattern is semantically coherent, that is, of STSR. In particular, it will deliver empirical evidence that consistent with an unfolding of the semantic properties of common psychometric practices may risk treating attitude items. The dominant path will prevent most people from strength as error variance, leaving mostly semantic relation- choosing answer alternatives that are not semantically ships in the statistics. This attempt is exploratory in nature, coherent. and we will therefore not derive hypotheses but instead seek Any survey will need a semantically invariant structure to to explore the research question from various angles. The attain reliably different but consistent responses from differ- following exploration is undertaken as two independent ent people. Coombs and Kao showed experimentally that studies: Study 1 is an in-depth study of the MLQ, containing there is a necessary structure in all surveys emanating from the main procedures to investigate and explore. Study 2 is a how respondents commonly understand the survey (Coombs brief application of the same procedure to a different, shorter & Kao, 1960; Habing, Finch, & Roberts, 2005; Roysamb & scale, and another sample of respondents. Strype, 2002). In STSR, correlations among survey items are primarily Study 1 explained by the likelihood that they evoke similar mean- ings. As we will show below, the semantic relationships Sample among survey items contain information isomorphic to the correlations among the same items in a survey. This implies Real survey responses were used to train the algorithms and that individual responses are shaped—and thereby princi- serve as validation criteria. These consisted of 153 randomly pally computable—because the semantics of items are given selected responses from an original sample of more than and possible to estimate a priori to administering the survey. 1,200 respondents in a Norwegian financial institution. The To the extent that this is possible, current-day analytical responses were collected anonymously through an online techniques risk treating attitude strength as error variance. survey instrument. Participation was voluntary with informed This is contrary to what is commonly believed, as the tradi- consent, complying with the ethical regulations of the tion of “construct validation” in survey research rests on the Norwegian Centre for Research Data (http://www.nsd.uib. assumption that attitude strength across samples no/nsd/english/index.html). 4 SAGE Open It contains no knowledge about surveys, leadership, or Estimating Item Semantics respondent behavior. The MLQ has 45 items. This yields (45 A number of algorithms exist that allow computing the simi- × (45 − 1)) / 2 or 990 unique item pairs, for which we obtain larity of the survey items. Here, we have chosen one termed MI values. “MI” (Mihalcea, Corley, & Strapparava, 2006; Mohler & One special problem concerns the direction of signs. In Mihalcea, 2009). MI is chosen because it has been previ- the MLQ, 264 of 990 pairs of items are negatively correlated. ously published, is well understood, and allows easy replica- Theory suggests that two scales, Laissez-faire and Passive tion. The Arnulf et al. study in 2014 also showed that MI Management by Exception, are likely to relate negatively to values are probably closer to everyday language than some effective leadership. The problem has been treated exten- LSA-generated values that may carry specialized domain sively elsewhere (Arnulf et al., 2014), so we will only offer a knowledge. brief explanation here. MI does not take negative values, and The MI algorithm derives its knowledge about words does not differentiate well between positive and negative from a lexical database called WordNet, containing informa- statements about the same content. For two items describing tion about 147,278 unique words that were encoded by a how (a) a manager is unapproachable when called for and (b) team of linguists between 1990 and 2007 (Leacock, Miller, that the same person uses appropriate methods of leadership, & Chodorow, 1998; Miller, 1995; Poli, Healy, & Kameas, the surveyed responses correlate at –.42 in the present sam- 2010). Building on knowledge about each single word in ple, while the MI value is .38. The chosen solution is to allow WordNet as its point of departure, MI computes a similarity MI values to be negative for all pairs of items from Laissez- measure for two candidate sentences: S1 and S2. It identifies faire and Passive Management by Exception (correctly iden- part of speech (POS), beginning with tokenization, and POS tifying 255 of the 264 negative correlations, p < .001). tagging of all the words in the survey item with their respec- tive word classes (noun, verb, adverb, adjective, and cardi- Semantics and Survey Correlations nal, which play a very important role in text understanding). It then calculates word similarity by measuring each word in STSR argues that there is an isomorphic relationship between the sentence against all the words from the other sentence. the preadministration semantic properties (the IM values) This identifies the highest semantic similarity (maxSim) and the postadministration survey correlations. This means from six word-similarity metrics originally created to mea- that the two sets of numbers contain the same information, sure concept likeness (instead of word likeness). The metrics representing the same facts albeit in different ways: are adapted here to compute word similarity by computing Correlations represent different degrees of systematic covari- the shortest distance of given words’ synsets in the WordNet ation, whereas semantics represent different degrees of over- hierarchy. The word–word similarity measure is directional. lap in meanings. It begins with each word in S1 being computed against each Correlations express the likelihood that the variation in word in S2, and then vice versa. The algorithm finally con- Item B depends on the variation in Item A. A high correlation siders sentence similarity by normalizing the highest seman- between the two implies that if someone scores high on Item tic similarity (maxSim) for each word in the sentences by A, this person is more likely to score high on Item B also. A applying “inverse document frequency” (IDF) to the British correlation approaching 0 means that we cannot know from National Corpus to weight rare and common terms. The nor- the response to Item A how the respondent will score Item B. malized scores are then summed up for a sentence similarity In other words, the uncertainty in predicting the value of B score, SimMI, as follows: increases with decreasing correlations until 0, after which certainty increases again for predictions in the opposite  maxSim() wS , × IDFw ()  () wS ∈ direction.  +  The semantic values can be read in a similar way: If the IDFw  ∑ ()  () wS ∈ MI score of Items A and B is high, they are likely to overlap SimS () ,S =× , MI 12 2 maxSim () wS , × IDFw ()  ∑  () w∈S S in meaning. A person who agrees with Item A is likely to   agree with Item B as well. However, as the MI values are  IDFw  () () wS ∈  2  reduced, we cannot any longer make precise guesses about how the respondent will perceive Item B. where maxSim(w, S2) is the score of the most similar word In both cases, low values translate into increasing uncer- in S2 to w, and IDF (w) is the IDF of word w. tainty. In Likert-type scale data, the response values are The final output of MI is a numeric value between 0 and restricted to integers in a fixed range, for example, between 1, where 0 indicates no semantic overlap, and numbers 1 and 5. Low correlations and low MI values indicate that the approaching 1 indicate identical meaning of the two sen- response to Item B can be any of the five values in the scale. tences. These numbers serve as the input to our simulating Higher correlations and MI values reduce uncertainty, and algorithm for constructing artificial responses. Note that the restrict the likely variation of responses to B. As these values information in the MI values is entirely lexical and syntactic. increase, the expected uncertainty is reduced to a point where Arnulf et al. 5 Table 1. Correlations Between Average Score Differences, Standard Deviations of Score Differences, Magnitude of Surveyed Correlations, and MI Scores. Average score difference, Survey correlations magnitude Item A − Item B SD of score differences Average score difference Item A − Item B −.94** SD of score differences −.08* .10** MI scores .88** −.79** −.07* Note. N for the surveyed sample was 153, N for the sample of differences and correlations was 990. *Correlation is significant at the .05 level (two-tailed). **Correlation is significant at the .01 level (two-tailed). Table 2. Hierarchical Regression Where MI Values (Step 1), Survey Correlations (Step 2) Were Regressed on the Average Score Differences (N = 990). Step 1 Step 2 MI values −.79** −.14** Survey correlations — −1.07** R .63 .89 F 1,683.58 3,981.42 **p < .01. the score on Item B is likely to be identical to the score on slightly inferior to that of the correlations. This is to be Item A. expected, as the correlations and the standard deviations If this is true, then both the MI scores and the real response stem from the same source, while the MI algorithm is only correlations should be negatively related to two aspects of one, imperfect algorithm out of several available choices. It the surveyed data: The average distance between Item A and has been shown elsewhere that it will usually take the output Item B, and the variance in this distance. A low correlation or of several present-day algorithms to approximate the seman- a low MI value should indicate that the range of expected tic parsing of natural human speakers (Arnulf et al., 2014), values of Item B increases. We explore this in Table 1, sup- but improved algorithms may alleviate the problems in the porting this proposition. MI values and empirically surveyed future. Most importantly, we can use the beta of the first step correlations are strongly, negatively, and about equally to estimate a specific item response from knowledge about related to the standard deviations of score differences. In the MI value. In other words, we are training our respondent other words, correlations and MI values express the same simulation algorithm using the regression equation above, information about uncertainty of scores between two survey capturing the beta as key to further computations. items. The signs are opposite, because higher MI scores indi- cate lower differences between scores of two items. Simulating Responses This provides a key to how MI values can allow us to estimate the value of a response to B if we know the response Based on the consideration above, it is possible to hypothe- to A. MI scores can be translated into score distances because size that a given respondent’s responses are not free to vary. they are systematically related to the differences. By regress- Once the respondent has chosen a response to the initial ing the MI values on the score differences, the resulting stan- items, the subsequent responses should be determined by the dardized beta can be used to estimate the distance from A to semantic relationships of the items (Arnulf et al., 2014; B, given that we know A. Table 2 shows this regression. It Nimon et al., 2016) and the structure of the survey, most displays a hierarchical model that enters the preadministra- notably the response categories (Maul, 2017; Slaney, 2017) tion MI values in the first step. By also entering the postad- and the unfolding patterns following from expected negative ministration in the second step, we supply additional support correlations (Michell, 1994; Roysamb & Strype, 2002; van for the claim that these two sets of scores indeed contain the Schuur & Kiers, 1994). same information. Ideally, it should be possible to predict any given response After entering the original surveyed correlations in Step based on the knowledge of the semantic matrix and a mini- 2, the beta for the MI values is substantially reduced, indicat- mum of initial responses. In our simulations, we can see that ing that the information contained in the MI values is indeed any response in the MLQ is predictable by using other known isomorphic to the information in the survey correlations. The responses and knowledge about the distances between items. same table also shows how the information in MI values is The R s of these predictions are in the range of .86 to .94. 6 SAGE Open As the semantic MI values correlate at –.79 and predict the if the unfolding pattern appears to be negative. This distances significantly (R = .63), it should theoretically be information is picked up by comparing the responses possible to substitute the distances with the semantic values, of Items 1, 2, and 3. While Items 1 and 2 are descrip- and thus predict later responses with a minimum of initial tions of positive leadership, Item 3 contains a nega- responses. tive appreciation. The perfect formula is yet to be found, but we have cre- 6. In the case that the Items A and B are assumed to be ated a preliminary algorithm that can possibly mimic real negatively related (this was discussed in the explana- responses to the MLQ. The present approach is explicitly tion of MI values above), the same relationship aiming at reproducing existing responses as this gives us the between MI and distances hold. However, the esti- best opportunity to compare simulated with real responses. mated value should logically be at the other end of The rationale for the algorithm combines semantics and the Likert-type scale (in a perfect negative correlation, unfolding theory as follows: a score of 5 on A indicates that the score for B is 1). So in the case of expected negative correlations, the 1. Responses are restricted to the values 1 to 5 of the direction of the algorithm formula is reversed within same Likert-type scale. The difference between any the 5-point Likert-type scale, such that two items, A and B, within this Likert-type scale is here referred to as the “distance” between A and B; ValueItem B = 6 - Value Item A () () for example, if A is 5 and B is 4, the distance between + MI for Item A and Item B x - 0.79. () them is 1 (5 − 4). 2. In the case of high MI values, Item B is likely to be 7. In this way, it is possible to start with Item 1, and use very close to its preceding item, A. Lower MI values the MI values to calculate the relationship of Item 1 indicate higher and less determinate distances. to Items 2, 3, and so on until Item 45. This process is 3. The most probable absolute distance between Item A repeated for Item 2 to all Items 3 to 45 and so on, and Item B is calculated as the MI value for A and B until all values have been calculated for all 990 multiplied by the standardized beta in the regression unique pairs of items. equation of Table 2 (–0.79). To predict a given dis- 8. To simulate missing responses, we can now delete tance from this type of regression equation, the for- the original responses and replace them with those mula should be as follows: Value (Item B) = Constant computed in Step 7 above. + (MI for Item A and Item B) x – 0.79. However, the 9. One final requirement is theoretically and practically distances were computed as absolute measures; that important. As mentioned, the MI values and correla- is, the absolute distance from 3 to 5 = 2, but so is 5 to tions are not really distance measures, but a measure 3. In practice, though, the algorithm may need to pre- of uncertainty, which in cases of low MI values dict a high number from a low number or vice versa. should be indeterminate. The formula used here The constant will therefore not “anchor” the distance instead applies the beta from the regression equation at the right point in the scale. as a measure of distance. However, uncertain values 4. We therefore need to tie the estimated point to the are in turn restricted by having closer relationships to value of Item A. We have tested several approaches other items. The whole matrix of 990 unique pairs of to this, and the formula that seems to work best for items is comparable with a huge Sudoku puzzle calculating any response B is to simply replace the where each item score is defined by its relationship to constant with the value for Item A, thus Value(Item 44 other items. We can use this to smooth out the B) = Value(Item A) + (MI for Item A and Item B) x simulated values for each item by averaging all the − 0.79. 44 estimated values resulting from each of its 44 5. This formula does impose the structure of semantic relationships. values on the subsequent numbers. It also seems counterintuitive because if MI increases (indicating In this way, our algorithm is based on the complete pattern of higher similarity), the term will grow in absolute semantic distances for every item with all other items, as numbers. However, the beta is negative, and the well as a hypothesis on the direction of scale unfolding based resulting number will be smaller. The impact on the on the initial three responses. It is admittedly explorative and ensuing calculations now comes from the unfolding based on an incomplete understanding of the issues involved, operations, depending on whether Response B is and our intention is to invite criticism and improvements higher or lower than A. To comply with predictions from others. One questionable feature of this algorithm is the from unfolding theory, the formula above keeps its tendency for positive evaluations to escalate positively and positive form if the respondent’s first three responses vice versa, probably due to a deficiency of the formula in indicate a positive evaluation (biasing the item dis- Step 4. In the course of all 990 iterations however, tances in a positive direction) but should be negative Arnulf et al. 7 these tendencies seem to balance each other out, and fix the 1. Scale reliability: The simulated scores should have averaged responses as dictated by the mutual pattern of acceptable reliability scores (Cronbach’s alpha), semantic distances. We have also checked that this formula preferably similar to the real scores. performs better than simply using averages of the known val- 2. Accumulated scores: A simulated survey response ues instead of semantics, thus substantiating the use of should yield summated scale values similar to the semantics in the formula. A further contrasting procedure ones of the surveyed population. Ideally, the average will be described below. scores on simulated leadership scales should be non- The MLQ has 45 items. Of these, 36 measure different significantly different from the average summated types of leadership behaviors, and the nine last items mea- scores of real survey scores. The average, summated sure how well the rated person’s work group does, com- simulated scores should also be significantly differ- monly treated as “outcome” variables. The Arnulf et al. ent from the other scales (differential reliability). (2014) study found the “outcome” variables to be deter- 3. Pattern similarity: The simulated survey scores mined by the responses to the preceding items. We will should not only show similar magnitude, but the pat- therefore start by trying to predict the individual cases of tern of simulated scores should also correlate signifi- these by deleting them from real response sets. By deleting cantly with the real individual score profiles. In progressive numbers of items, we will then explore how particular, there should be few or no negative correla- well the semantics will perform to predict the missing tions between real and simulated score profiles in a responses. sample of simulated protocols. Therefore, our first simulated step will be concerned with 4. Sample correlation matrix: The simulated scores predicting outcomes training the algorithm on the first 36 should yield a correlation matrix similar to the one items. In the next steps, we simply subtract remaining half of obtained from real survey scores. the survey until all real responses are deleted, offering the 5. Factor structure: The factor structure of simulated algorithm diminishing amounts of training information. In responses should bear resemblance to the factor this way, we can evaluate the degree to which the computed structure emerging from the real sample. values still bear resemblance to the original values. 6. Unfolding structure: Seen from the perspective of unfolding theory, extreme score responses are easier Contrast validation procedure. Algorithms like this may cre- to understand than midlevel responses. In an extreme ate artificial structures that are not due to the semantic MI score, a positive respondent will have a general ten- values but simply artifacts created by the algorithm proce- dency to reject negative statements and endorse high dures themselves. To control for this, we have created simi- positive scores, and a negative respondent will rank lar sets of responses with the same numbers of missing items in the opposite direction. Midlevel items across values, where the MI values in the algorithm are replaced by a complex scale would require more complex evalu- randomly generated values in the same range as the MI val- ations of how to “fold” each single item so as to stay ues (from −1 to +1). If similarities between artificial and with the dominant unfolding path (Michell, 1994). real responses are created by biases in the algorithmic pro- This is a tougher task for both respondents and the cedure and not by semantics, the output of randomly gener- simulating algorithm. We therefore want to check if ated numbers should also be able to reproduce numbers our algorithm is more appropriate for high and low resembling the original scores. The difference between the than for medium scores. output of random and semantically created numbers expresses the value of (present-day) semantics in predicting Results real responses. Table 3 shows the alpha values for all MLQ scales. Values for the real responses are in the first column. Computations Simulation Criteria are made for increasing numbers of missing values to the There are no previously tested criteria for assessing the qual- right. It can be seen that the alphas for simulated responses ity of simulated survey responses compared with real ones. are generally better than those for the real responses (the Survey data are generally used either as summated scores to alphas for simulated responses are lower for the simulated indicate the respondents’ attitude toward the survey topic values in only six of 40 cases). The alphas generated from (score level or attitude strength) or as input to statistical mod- random semantic responses are inadequate and keep deterio- eling techniques such as structural equation modeling (SEM). rating as items are replaced by simulated responses. In addition, survey data are often scrutinized by statistical Table 4 shows the mean summated scores for each of the methods to check their properties prior to such modeling MLQ subscales in the sample. When the nine outcome mea- (Nunnally & Bernstein, 2010). Therefore, we propose the fol- sures are missing (replaced by simulated scores), their simu- lowing common parameters to evaluate the resemblance of lated scale is nonsignificantly different from the original. the artificial responses to the real ones: When 21 item scores are missing (46% missing), there are 8 SAGE Open Table 3. Cronbach’s Alpha for All MLQ Scales, Real and Simulated Responses. Outcome 21 (46%) 33 (73%) 33 items 39 (86%) 39 items 42 (95%) 42 items items items items random items random items random 100% Real missing missing missing semantics missing semantics missing semantics synthetic Idealized influence attr .74 .77 .82 .88 −.10 .99 .13 1.00 −.15 .99 Idealized influence beh .72 .72 .72 .90 −.07 .92 −.04 .99 −.06 .99 Inspiring motivation .80 .80 .82 .91 .09 .99 −.12 1.00 −.05 .99 Intellectual stimulation .83 .82 .84 .85 .45 .91 −.20 .93 −.11 .76 Indvidualized consider. .78 .78 .82 .99 −.22 1.00 .16 1.00 −.06 .99 Conditional reward .73 .73 .79 .90 .42 .99 .10 1.00 −.20 .99 Mgmnt by exception act. .51 .52 .43 .72 .00 .77 .13 .97 −.27 .95 Mgmnt by exception pas. .47 .47 .47 .76 .38 .82 −.09 .83 −.06 .83 Laissez-faire .77 .77 .75 .78 .33 .84 −.03 .99 −.07 .97 Outcome measures .92 1.00 1.00 1.00 .18 1.00 −.02 1.00 .07 1.00 Note. MLQ = Multifactor Leadership Questionnaire. Table 4. Means for Subscales by Simulated Populations. Outcome 21 (46%) 33 (73%) 33 items 39 (86%) 39 items 42 (95%) 42 items items items items random items random items random Main constructs Real missing missing missing semantics missing semantics missing semantics IdealizedAttrib 3.43 3.42 3.39 3.58 3.03 3.79 3.02 3.87 3.00 IdealizedBehv 3.94 3.95 3.84 3.78 3.23 3.86 3.22 3.83 2.98 InspMotive 3.83 3.84 3.78 3.77 3.23 3.78 3.00 3.86 2.99 IntellStim 3.28 3.28 3.44 3.55 3.14 3.63 3.06 3.69 3.06 IndConsid 3.59 3.59 3.59 3.73 3.01 3.84 3.00 3.90 3.02 CondReward 3.79 3.79 3.71 3.80 3.44 3.84 3.27 3.90 3.23 MBEact 3.06 3.08 3.11 3.63 3.06 3.70 3.06 3.78 2.97 MBEpass 2.63 2.62 2.62 2.38 2.73 2.39 2.98 2.33 2.98 LaissFaire 2.37 2.37 2.43 2.32 2.71 2.28 2.85 2.22 3.01 Outcome 3.53 3.59 3.69 3.85 3.00 3.91 3.00 3.94 2.99 Average difference from .01 .07 .20 .38 .25 .47 .30 .52 real Note. Bold types: Not significantly different from their real human counterparts, p <. 05. only two instances of significant scale differences. When 33 semantics depart quicker and more dramatically from their or 39 items are missing, the number of significant differences real counterparts, see Table 5. increases, but the average differences from the real scores are Every individual’s simulated responses were correlated very small: 0.08 Likert-type scale points even for the 35 with their real counterparts to compare the pattern of real missing items, and 0.18 points in difference where 39 items versus simulated responses. Table 6 shows how these corre- (86% of the responses) are missing and replaced by simu- lations were distributed in the various simulated groups. As lated scores. Most of the scales are also still significantly dif- could be expected, there is a decline in the resemblance ferent from each other, such that no scale measuring between the simulated scores and their real duals as the num- transformational leadership overlaps with Laissez-Faire, ber of simulated scores increases. However, this decline hap- Passive Management by Exception, or outcome variable pens much faster for the scores generated by random patterns, scores. There is a tendency for some of the differences and when 43 items are replaced with simulated scores, there between the scales within the transformational leadership are still only eight cases (5%) that correlate negatively with construct to overlap with increasing number of simulated the real respondents, see Figure 1. items. We explored how the relationships among the subscales When all these scores are summed up in their purported of the MLQ changed with increasing numbers of missing higher level constructs—transformational, transactional, items. An interesting difference appeared between the values laissez-faire leadership and outcomes, this pattern of average replaced by the semantically informed algorithm and the scores is maintained. Scores computed with random algorithm with random semantic values: With increasing Arnulf et al. 9 Table 5. Means for Main Constructs by Simulated Populations. Outcome 21 (46%) 33 (73%) 33 items 39 (86%) 39 items 42 (95%) 42 items items items items random items random items random Main constructs Real missing missing missing semantics missing semantics missing semantics Transformational 3.62 3.62 3.61 3.68 3.13 3.78 3.06 3.83 3.01 Transactional 3.16 3.16 3.15 3.27 3.07 3.31 3.10 3.34 3.06 Laissez-faire 2.37 2.37 2.43 2.32 2.71 2.28 2.85 2.22 3.01 Outcomes 3.53 3.59 3.69 3.85 3.00 3.91 3.00 3.94 2.99 Average difference .02 .06 .14 .36 .20 .41 .24 .47 from real Note. Bold types: Not significantly different from their real human counterparts, p <. 05. Table 6. Characteristics of the Average Correlations Between Real and Simulated Respondents by Number of Simulated Item Responses. No of negative Minimum Maximum Mean Scale correlations correlation correlation correlation SD Outcome items (nine) missing 0 .79 1.00 .94 .05 21 items missing 0 .35 1.00 .83 .10 33 items missing 0 .06 .91 .61 .18 33 items random semantics 0 .11 .81 .50 .11 39 items missing 2 −.24 .87 .34 .31 39 items random semantics 2 –.08 .57 .31 .11 42 items missing 8 −.62 .88 .44 .29 42 items random semantics 22 –.26 .42 .14 .13 numbers of simulated values, the correlations among the reminiscent of factor structures, and rotational procedures subscales tended to increase for the semantically informed did not change the emerging patterns. The two factors emerg- simulations. Where the semantic predictions were replaced ing from the purely synthetic condition seem to be an artifact by random numbers (leaving only the pattern of the algo- of the algorithm because it needs two (randomly chosen) ini- rithm itself, void of semantics), the correlations among the tial values to get started. subscales decreased, approaching 0 where 39 of 45 responses We finally checked whether the score levels could affect the were simulated, see Figure 2. similarity between simulated and real responses. As we were We then performed a principal components analysis expecting, higher scores of both transformational leadership (PCA) on these samples to compare their ensuing patterns. and laissez-faire (and, by implication, the outcome values) The MLQ has been criticized for its messy factor structure were all related to higher correlations between the real response over the years, as some people find support for it and others and its simulated duplicate. This tendency was increasing for a do not (Avolio et al., 1995; Schriesheim, Wu, & Scandura, higher number of simulated scores but absent in responses 2009; Tejeda, Scandura, & Pillai, 2001). In our sample here computed in the random control condition, see Table 8. (N = 153), there emerged eight or nine factors, but the rotated factors were not clearly delineated and did not fully support Discussion of Study 1 the theorized structure of the survey. However, we are here not concerned with the structure of the MLQ itself but with Summing up our findings, the following descriptions seem the similarity of the real and simulated measures. Table 7 supported: shows that as an increasing number of items are replaced by semantically simulated ones, there is a gradual reduction in Outcome measures: When the outcome measures were the number of factors identified. This is completely opposite substituted with simulated measures, these were virtually from what happens where scores are computed with random nondistinguishable from the real measures. This implies input to the algorithm. In these cases, there is a proliferation that the purported outcome variables are not independent of eigenvalues increasing with the numbers of simulated and empirical but determined directly by the semantic variables. The numbers of factors indicated by scree plots are relationships to the previous survey items. The simulated displayed in brackets as these may be just as interesting as outcome levels were nondistinguishable from the real factors identified by eigenvalues (see Figure 3). The MI val- ones even when 39 of 45 items were replaced by simu- ues seem to impose a simplified structure on the data in PCA lated items. 10 SAGE Open Figure 1. The frequency distribution of correlations between real and stimulated responses for the simulated populations, replacing 42 of 45 item responses with simulated scores. Figure 2. Absolute interscale correlations by simulated sample. Reliability: The reliability levels of scales in the simulated substituted by simulated items, the alpha values increased. responses were comparable with and in most cases better Responses computed with random semantic figures pre- than the real responses. With increasing numbers of items sented deteriorating alphas. This supports our claim that Arnulf et al. 11 Table 7. Number of Factors With Eigenvalue >1 Extracted in Principal Components Analysis, Real and Simulated Samples (Factors Indicated by Scree Plots in Brackets). Outcome 33 items 39 items 42 items items 21 items 33 items random 39 items random 42 items random Real missing missing missing semantics missing semantics missing semantics Synthetic Computed 9 (4) 8 (4) 6 4 19 3 (5) 18 2 (3) 30 2 (3) on all 45 items Computed 8 (4) 8 (4) 6 4 16 3 (6) 15 2 (3) 16 2 (3) without outcome items Figure 3. Principal components scree plots, one real and three simulated samples (39 times missing, 39 items replaced with random semantics, one completely synthetic sample). the psychometric structures are caused by the semantic each of the 10 subscales. The respondents’ levels of patterns and are not an artifact of the algorithm. endorsing or criticizing their managers’ leadership behav- Summated scale levels: Even with the simple algorithm iors were reliably captured by a small subset of items. applied here, six real item responses (of 45 scale items) When the computed composite scores started deviating in are enough to predict the level of transformational leader- a statistically significant way from the real score levels, ship and laissez-faire scale scores precisely. Twelve items the differences were still quite small, and with the excep- allow a fairly precise calculation of the summated level of tion of the scale Passive Management By Exception, they 12 SAGE Open Table 8. The Relationship Between Magnitude of Correlation Between Subscale Score Levels, and the Relationships Between Real and Simulated Response by Number of Simulated Items. Outcome 33 missing 39 missing, 42 missing, (nine) items 21 items 33 items random 39 items random 42 item random MLQ subscale missing missing missing semantics missing semantics missing semantics Transform. .46** .50** .50** −.05 .27** −.02 .59** .02 Transact. .15 .12 .23** −.13 .14 −.13 .54** .04 Laissez-faire −.36** −.53** −.51** −.23** −.28** −.19* −.60** .02 Outcomes .45** .43** .41** .00 .26** −.06 .57** .12 Note. MLQ = Multifactor Leadership Questionnaire. *p < .05 level (two-tailed). **p < .01 level (two-tailed). were always closer to the real ones than to the randomly as clearer than the real sample did. Random responses generated scores. developed in the opposite direction and quickly began Pattern similarity: The simulated survey responses were generating extra factors proliferating upward to 15 to 30 correlating highly with their real origins, and there were factors. almost no cases where these correlations took negative Unfolding structure: As we expected, the simulator was values. That is interesting, given Michell’s (1994) find- most accurate in recreating response patterns at the ings that only a few percentage of survey respondents will extreme score level; that is, respondents who were very respond in a way that violates the semantic structure of negative or very positive toward their managers. the survey and its unfolding pattern. Even the sample Intermediate levels were harder to simulate exactly, and computing 42 simulated scores from three given responses the scale “Active management by exception” seems in all was highly and significantly correlated with their real explorations to offer the least precisely estimated scores counterparts. It seems warranted to say that the pattern of by our algorithm. This difficulty handling the “lukewarm” scores created by our simulation algorithm largely repli- scores is expected from unfolding theory (Andrich, 1996; cated the pattern of real responses. The randomly gener- Coombs, 1964; Coombs & Kao, 1960; Michell, 1994; ated patterns performed clearly inferior to the true Roberts, 2008) because such intermediate response pat- semantic values. terns give rise to more complex folding of scales. Correlation matrices: For the sake of brevity, we com- pared only the correlation matrices of the accumulated Study 2 subscales, substituting real scores for samples with increasing numbers of simulated responses. This compar- Measures ison is probably the one where simulated scores did not perform so well. The correlations among the scales were The scale subjected to simulation of scores here is a compos- increasing with increasing numbers of simulated ite of three scales frequently used in OB research: Two scales responses. This finding is however mixed in terms of published measuring perceptions of economic and social STSR relevance: While our algorithm seems to be less exchange, comprising eight and seven items, respectively sensitive to differential information with more simulated (Shore, Tetrick, Lynch, & Barksdale, 2006), and one scale items, the correlations will tend to increase in magnitude. measuring intrinsic motivation comprising five items This means that all else being equal, semantic information (Kuvaas, 2006). These scales were chosen because they orig- is a powerful source of correlations in survey data. This inate from different researchers and have not been part of a was evident in comparison with the correlation matrices coherent instrument. They are also shorter and offer less generated from random values, which were approaching 0 complexities than the MLQ. These scales displayed semantic as more responses were replaced by simulated ones. predictability in the previous study on STSR (Arnulf et al., Factor structure: As with the correlation matrices (and 2014). related to this matter), the factor structures of the data samples were increasingly simple with more semantics Sample based on simulated scores, ending with a two-factor model when all but three items were computed (95% of A randomly chosen sample of 100 employees from a the items replaced). The MLQ may not be a good testing Norwegian governmental research organization was used to ground for factor structures, as it was itself quite messy in train and validate the algorithm. About 72% of the respon- the small random sample we used here. Still, the sample dents were male, and the majority of respondents were hold- using simulated outcome scores identified the outcomes ing university degrees at bachelor level or higher. Arnulf et al. 13 For predicted negative correlations, Analytical Procedures We used the MI algorithm to compute semantic similarities ValueItem B = 6 - ValueItem A + () () between all 20 items. This yields a matrix of 20 × 19 / 2 = MI for Item A and Item B + 1.342. () 190 unique item pairs. The problem of negatives was solved as described in the case of the MLQ, as the scale measuring economic exchanged can be shown a priori to be negatively We then proceeded to explore if the responses simulated correlated with the other two (see Arnulf et al., 2014). Also, from semantic values predict their “real” counterparts bet- one item measuring social exchange is originally reversed, ter than random values in the same range (control and kept that way to conform with the theoretical handling of condition). negatives. The semantic indices from the MI algorithm predicted the sample correlation matrix significantly with an adjusted R Results of .52. As in the study above, this relationship was even The results will be reported summarily along the same lines stronger with the interitem distances (the average distance in as in Study 2: scores between Item A and Item B . . . ), reaching an adjusted R of .81. To train the predicting algorithms, we kept the con- Summated scale levels: Figure 4 shows the average accu- stant (1.342) and unstandardized beta (–.907) from the latter mulated scores for three test samples. The patterns of the regression analysis. semantically simulated scores are similar to the real sam- Individual response patterns were predicted by applying ple, but the average score on intrinsic motivation is some- the algorithm developed in Study 1. We replaced the sample what low (albeit significantly higher than the score for constant and unstandardized beta with the values from this social exchange). Adjusting the unfolding pattern in the sample, tested this version first: algorithm could possibly alleviate this. Importantly, the pattern seemed driven by the semantic values, as the ran- For predicted positive correlations, dom values tend to wipe out the pattern and the average ValueItem B = Value Item A + scores become similar. () () Pattern similarity: The semantically simulated test MI for Item A and Item B x -. .907 () responses correlated on average .56 with the originals. The highest correlation was .89 and the lowest was –.37, For predicted negative correlations, but only two of the 100 simulated responses correlated actually negatively with their real counterparts. The simu- ValueItem B = 6 - Value Item A + () () lations using random semantics yielded an average cor- relation of .10 with 30% negative correlations. MI for Item A and Item B x - .907. (() Reliability: The simulated responses yielded an α of 1.00, α for the random semantics was .99, and α for the real The resulting numbers were promising but did not seem sample was .79. totally satisfactory, possibly due to unfolding problems. Factor structure: The 20 items were subjected to PCA Whereas the MLQ is composed of highly heterogeneous with varimax rotation. The real responses yielded five subscales distributed in a mixed sequence, the Study 2 scales factors explaining 65.5% of the variation. The responses are very homogeneous and distributed one by one. It is hard simulated with semantic values yielded two factors to find an a priori rule for the unfolding of the combined explaining 98%, and the random semantics also produced scale. However, the unstandardized beta is –.907 which is two factors explaining 99%. A more interesting picture almost −1, and so plays a small role when multiplied with emerges when presenting two-dimensional plots of the other values except changing the sign. We first removed the factor structures, as displayed in Figure 5. sign to check the effect on unfolding, but results were equally promising but unsatisfying. We then decided to remove the The two-dimensional plots reveal that the random seman- beta and replace it with the constant for the item differences tics cannot distinguish between the three scales. The real instead (1.342) plus the semantic MI value. This provided a sample produces three distinct clusters even if it does not better approximation of the scores: present a satisfactory solution. The simulated sample pres- ents a clear three-factor plot of the items. The reversed item For predicted positive correlations, in the social exchange scale is plotted on the same axis but orthogonally to the nonreversed, as theoretically expected. ValueItem B = Value Item A + () () Still, social exchange items were erroneously grouped with MI for Item A and Item B + 1.342. intrinsic motivation. () 14 SAGE Open Figure 4. Average scale scores for the three scales for semantically simulated, real respondents and random semantics. Note. CI = confidence interval. Figure 5. Factor structures of random, semantic, and real samples. Unfolding structure: As in Study 1, there was a clear rela- Simulated scales were similar in the sense that (a) the tionship between the semantic predictability of the indi- aggregated means of the main variables were of similar mag- vidual response patterns and their score levels. The nitude and exhibited similar mutual patterns, (b) the reliabili- simulated response patterns correlated at .67 with the dis- ties were high or higher than the originals, (c) the majority of persion of scores (standard deviation of scores within the the simulated response patterns correlated highly with the individual) and .57 with the score level on intrinsic moti- original patterns with only 2% in a negative direction, and vation (p < .01). Elevated scores increase the score disper- (d) the factor structure in PCA indicated a three-factor solu- sion, allowing the responses to be more predictable. tion but only in a two-dimensional plot. The simulated responses failed to produce a level of intrinsic motivation as high as the original (higher than the Discussion of Study 2 two other scales but significantly lower), and the factor As in Study 1, the semantically simulated responses were structure failed to reproduce three clear-cut factors. similar but not completely identical to the original responses On the contrary, the simulated scores created with random that they were meant to predict. semantics failed to replicate the originals on all accounts Arnulf et al. 15 except for the alphas. This indicates that key characteristics A more precise weighting measure: In Study 1, we conse- of survey data—score levels, factor structures, and variable quently used the beta from a model where the semantic relationships—were reproducible by means of semantic indi- values are regressed on the observed score differences. ces in these scales. This was used as a benchmark to translate from MI values Also, the three scales did not emerge clearly from the real into probable score distances because it could be justified responses. The present dataset may not have been ideal for fairly simply. Study 2 showed that using the constant training simulation algorithms. For the sake of brevity, we do yielded better results. A more systematic mathematical not report the systematic effect of deleting real responses in rationale could create scores that are less uniform in the Study 2. way they impose structure on the data, and could possibly keep the factor structure intact as produced by humans. One possibility is to replace the distance approximation with a probability function that could add some random Final Discussion and Suggestions for error to the formula. Future Research A better model for unfolding of the items: The unfolding The main purpose of this article was to develop and apply a pattern we created in Study 1 was also just a quick rule of simple algorithm for creating artificial responses, and compare thumb, and in Study 2, we did not take the unfolding into these with a sample of real responses, explaining the rationale account at all, except for the negative correlations. More behind STSR and opening a field of exploring survey responses differentiated unfolding patterns could be modeled. One through computation. Across two different scales and samples, way would be to include more knowledge from the initial we were able to check the psychometric properties of simu- training data. This could increase the variation in data and lated scores compared with the real human responses. The reduce the tendency toward simplification of structures, semantic indices always performed much better in predicting as well as improving the performance of the algorithm in real scores than random numbers in the same range. responses with medium-range responses. An important This is a new field with no established quality criteria, and question to address is the case of multidimensional scales so our aim was simply to conduct a test applying what we as in our second dataset. In such cases, it may be neces- know. We also want to be transparent about what we do, sary to fix the response level for each dimension, which omitting overly complicated steps that could have improved points to the entry of nonsemantic information about atti- the performance. tude strength in the data. The results could partly be artifacts of the algorithm itself. More advanced smoothing function: The fact that all items As we have pointed out, research on the effects of unfolding are locked in a grid of differing relationships to 44 other and measurements in construct validation has repeatedly items is intriguing. A mathematical procedure that could shown that the survey structure itself is a major source of capture this complex network of values would be a much systematic variation, and hence needs to be considered in more direct approach to calculations, possibly akin to predicting responses (Maul, 2017; Michell, 1994; Slaney, multidimensional scaling (Borg & Groenen, 2005). This 2017; van Schuur & Kiers, 1994). could let us test the degree to which people create response However, we do think that improvements in predicting patterns deviating from what is semantically given. Not real scores are foreseeable already, addressing the following only would it inform STSR and unfolding theory but also series of issues: allow us to differ better between empirical questions (per- taining to how people actually respond) and logical ques- A theoretically more precise formula: It should ideally be tions (setting up conditions for how people ideally should possible to formulate a mathematically rigorous way to respond; Semin, 1989; Smedslund, 1988). translate the semantic matrix into the distance matrix, and from the distance matrix to a prediction of Item B if Item The results seem to support our main theoretical proposi- A is known. This is the main theoretical goal of STSR, tion to some degree. To the extent that survey responses are and we are not yet there. semantically determined, they are predictable a priori. More precise semantic estimates: This study applied The semantic values generally produced high alphas, semantics from the MI algorithm only. It is shown else- high correlations, and orderly patterns in the data, which where that a combination of semantic algorithms will the randomly generated semantic values failed to produce have incremental explanatory power (Arnulf & Larsen, even if the other steps of the algorithm were identical in 2015; Arnulf et al., 2014). Also, other computational both sets of simulated responses. An alarming finding in methods have been shown to produce similar results and our data is that the semantic structure seems to produce bet- could possibly be combined with what we do here (Gefen ter alphas and factor structures, possibly leading research- & Larsen, 2017; Nimon et al., 2016). More advanced ers to lean toward semantics in scale constructions to combinations of semantic values in the model may allow comply with current guidelines for fit indices (Hu & more precise replications of real responses. Bentler, 1999). 16 SAGE Open In STSR, survey responses may be seen more as an ORCID iD expression of coherent beliefs than a series of quantitative Jan Ketil Arnulf https://orcid.org/0000-0002-3798-1477 responses. The initial responses signal the endorsement of opinions. These could have been semantically explicit speci- References fications of the response alternatives as in, for example, Abdi, H. (2003). Factor rotations in factor analysis. In M. Lewis- Guttman scales (Michell, 1994). “Response strength” may Beck, A. Bryman, & T. Futing (Eds.), Encyclopedia of social be seen as a signal carrier for the semantic anchor of the sciences research methods (pp. 792-795). Thousand Oaks, CA: respondent’s interpretation of the items. Sage. In this regard, it is important to distinguish between sur- Andrich, D. (1996). A hyperbolic cosine latent trait model for vey responses as an individual expression and the survey unfolding polytomous responses: Reconciling Thurstone and responses as input to aggregated sample statistics. STSR Likert methodologies. British Journal of Mathematical and Statistical Psychology, 49, 347-365. cannot predict the initial response level of a given respondent Arnulf, J. K., & Larsen, K. R. (2015). Overlapping semantics a priori, the “theta” in item response theory (Singh, 2004). of leadership and heroism: Expectations of omnipotence, What the theory predicts is that once the individual’s level is identification with ideal leaders and disappointment in real set, the patterns (or values) of the remaining items are influ- managers. Scandinavian Psychologist, 2, e3. doi:10.15714/ enced or even determined by their semantic structure. Their scandpsychol.2.e3 values are not free to vary because they share overlapping Arnulf, J. K., Larsen, K. R., Martinsen, Ø. L., & Egeland, T. (2018). meaning, and therefore share the same subjective evaluation. The failing measurement of attitudes: How semantic determi- Thus, it will be the semantically determined patterns that nants of individual survey responses replace measures of atti- carry over into the sample statistics, not so much the attitude tude strength. Behavior Research Methods, 1-21. doi:10.3758/ strength (Arnulf, Larsen, Martinsen, & Egeland, 2018). s13428-017-0999-y Sample statistics—the bulk of the correlations in the Arnulf, J. K., Larsen, K. R., Martinsen, Ø. L., & Bong, C. H. (2014). Predicting survey responses: How and why semantics MLQ—may therefore be determined by semantic relation- shape survey statistics in organizational behavior. PLos ONE, ships that are void of attitude strength. This allows a precise 9(9), e106361. doi:10.1371/journal.pone.0106361 prediction of the “outcome” scales by semantics as demon- Avolio, B. J., Bass, B. M., & Jung, D. I. (1995). Multifactor strated above and theoretically predicted by others (Van Leadership Questionnaire technical report. Redwood City, Knippenberg & Sitkin, 2013). CA: Mind Garden. Taken together, our preliminary outline of a simulation pro- Bagozzi, R. P. (2011). Measurement and meaning in information cedure indicates how simulating semantically expected scores systems and organizational research: Methodological and phil- is possible. Subsequently, this may allow us to explore how to osophical foundations. MIS Quarterly, 35, 261-292. depart from what is semantically expected instead of rediscov- Borg, I., & Groenen, P. (2005). Modern multidimensional scaling: ering semantically predetermined relationships. Theory and applications (2nd ed.). New York, NY: Springer. STSR does not propose that all survey data come about as Borsboom, D. (2008). Latent variable theory. Measurement, 6, 25-53. Borsboom, D. (2009). Educational measurement: Book a result of semantics. Neither does the theory claim that this review. Structural Equation Modeling, 16, 702-711. model holds across all constructs. STSR simply proposes doi:10.1080/10705510903206097 that whatever the sources of variation in survey data, the Coombs, C. H. (1964). A theory of data. New York, NY: Wiley. semantics implied is the first source to evaluate, often more Coombs, C. H., & Kao, R. C. (1960). On a connection between fac- powerful and systematic than hitherto assumed. By offering tor analysis and multidimensional unfolding. Psychometrika, a rationale and an outline for experimental research on STSR, 25, 219-231. we hope future developments can address more detailed Feldman, J. M., & Lynch, J. G. J. (1988). Self-generated validity questions of the nature and interaction of survey response. and other effects of measurement on belief, attitude, intention, and behavior. Journal of Applied Psychology, 73, 421-435. Firmin, M. W. (2010). Commentary: The seminal contribution of Declaration of Conflicting Interests Richard LaPiere’s attitudes vs actions (1934) research study. The author(s) declared no potential conflicts of interest with respect International Journal of Epidemiology, 39, 18-20. doi:10.1093/ to the research, authorship, and/or publication of this article. ije/dyp401 Gefen, D., & Larsen, K. R. (2017). Controlling for lexical close- ness in survey research: A demonstration on the technology Funding acceptance model. Journal of the Association for Information The author(s) disclosed receipt of the following financial support Systems, 18, 727-757. for the research, authorship, and/or publication of this article: We Habing, B., Finch, H., & Roberts, J. S. (2005). A Q3 statis- thank the U.S. National Science Foundation for research support tic for unfolding item response theory models: Assessment under Grant NSF 0965338 and the National Institutes of Health of unidimensionality with two factors and simple struc- through Colorado Clinical & Translational Sciences Institute for ture. Applied Psychological Measurement, 29, 457-471. research support under NIH/CTSI 5 UL1 RR025780. doi:10.1177/0146621604279550 Arnulf et al. 17 Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in Poli, R., Healy, M., & Kameas, A. (2010). WordNet. In C. Fellbaum covariance structure analysis: Conventional criteria versus new (Ed.), Theory and applications of ontology: Computer applica- alternatives. Structural Equation Modeling, 6, 1-55. tions (pp. 231-243). New York, NY: Springer. Jöreskog, K. G. (1993). Testing structural equation models. In K. A. Roberts, J. S. (2008). Modified likelihood-based item fit statistics for Bollen & J. S. Long (Eds.), Testing structural equation models the generalized graded unfolding model. Applied Psychological (pp. 294-316). Newbury Park, CA: Sage. Measurement, 32, 407-423. doi:10.1177/0146621607301278 Kuvaas, B. (2006). Work performance, affective commitment, Roysamb, E., & Strype, J. (2002). Optimism and pessimism: and work motivation: The roles of pay administration and pay Underlying structure and dimensionality. Journal of Social & level. Journal of Organizational Behavior, 27, 365-385. Clinical Psychology, 21, 1-19. Lamiell, J. T. (2013). Statisticism in personality psychologists’ use Rubin, D. B. (1987). Multiple imputation for nonresponse in sur- of trait constructs: What is it? How was it contracted? Is there veys. New York, NY: Wiley. a cure? New Ideas in Psychology, 31, 65-71. doi:10.1016/j. Schriesheim, C. A., Wu, J. B., & Scandura, T. A. (2009). A newideapsych.2011.02.009 meso measure? Examination of the levels of analysis of the LaPiere, R. T. (1934). Attitudes vs. actions. Social Forces, 13, 230- Multifactor Leadership Questionnaire (MLQ). The Leadership 237. Quarterly, 20, 604-616. doi:10.1016/j.leaqua.2009.04.005 Leacock, C., Miller, G. A., & Chodorow, M. (1998). Using cor- Schwarz, N. (1999). Self-reports: How the questions shape the pus statistics and WordNet relations for sense identification. answers. American Psychologist, 54, 93-105. Computational Linguistics, 24, 147-165. Semin, G. (1989). The contribution of linguistic factors to attri- Likert, R. (1932). A technique for the measurement of attitudes. bute inferences and semantic similarity judgements. European Archives of Psychology, 140, 1-55. Journal of Social Psychology, 19, 85-100. MacKenzie, S. B., Podsakoff, P. M., & Podsakoff, N. P. (2011). Shore, L. M., Tetrick, L. E., Lynch, P., & Barksdale, K. (2006). Construct measurement and validation procedures in MIS and Social and economic exchange: Construct development and behavioral research: Integrating new and existing techniques. validation. Journal of Applied Social Psychology, 36, 837-867. MIS Quarterly, 35, 293-334. Singh, J. (2004). Tackling measurement problems with item Maul, A. (2017). Rethinking traditional methods of survey response theory: Principles, characteristics, and assessment, validation. Measurement: Interdisciplinary Research and with an illustrative example. Journal of Business Research, 57, Perspectives, 15, 51-69. doi:10.1080/15366367.2017.1348108 184-208. doi:10.1016/s0148-2963(01)00302-2 Michell, J. (1994). Measuring dimensions of belief by unidimen- Slaney, K. L. (2017). Validating psychological constructs: sional unfolding. Journal of Mathematical Psychology, 38, Historical, philosophical, and practical dimensions. London, 244-273. England: Palgrave Macmillan. Michell, J. (2013). Constructs, inferences, and mental measure- Slaney, K. L., & Racine, T. P. (2013a). Constructing an under- ment. New Ideas in Psychology, 31, 13-21. doi:10.1016/j. standing of constructs. New Ideas in Psychology, 31, 1-3. newideapsych.2011.02.004 doi:10.1016/j.newideapsych.2011.02.010 Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based Slaney, K. L., & Racine, T. P. (2013b). What’s in a name? and knowledge-based measures of text semantic similarity. Psychology’s ever evasive construct. New Ideas in Psychology, American Association for Artificial Intelligence, 6, 775-780. 31, 4-12. doi:10.1016/j.newideapsych.2011.02.003 Miller, G. (1995). WordNet: A lexical database for Smedslund, J. (1988). What is measured by a psychological mea- English. Communications of the ACM, 38(11), 39-41. sure? Scandinavian Journal of Psychology, 29, 148-151. doi:10.1145/219717.219748 Tejeda, M. J., Scandura, T. A., & Pillai, R. (2001). The MLQ Mohler, M., & Mihalcea, R. (2009, March 30-April 03). Text-to- revisited: Psychometric properties and recommendations. text semantic similarity for automatic short answer grading. The Leadership Quarterly, 12, 31-52. doi:10.1016/S1048- Paper presented at the 12th Conference European Chapter of 9843(01)00063-7 the Association for Computational Linguistics (EACL 2009), Van Knippenberg, D., & Sitkin, S. B. (2013). A critical assessment Athens, Greece. of charismatic-transformational leadership research: Back to Nimon, K., Shuck, B., & Zigarmi, D. (2016). Construct overlap the drawing board? The Academy of Management Annals, 7, between employee engagement and job satisfaction: A func- 1-60. doi:10.1080/19416520.2013.759433 tion of semantic equivalence? Journal of Happiness Studies, van Schuur, W. H., & Kiers, H. A. L. (1994). Why factor analy- 17, 1149-1171. doi:10.1007/s10902-015-9636-6 sis often is the incorrect model for analyzing bipolar con- Nunnally, J. C., & Bernstein, I. H. (2010). Psychometric theory (3rd cepts, and what models to use instead. Applied Psychological ed.). New York, NY: McGraw-Hill. Measurement, 18, 97-110. Ortiz de Guinea, A., Titah, R., & Léger, P.-M. (2013). Measure for measure: A two study multi-trait multi-method investiga- Author Biographies tion of construct validity in IS research. Computers in Human Jan Ketil Arnulf, PhD, is an associate professor at BI Norwegian Behavior, 29, 833-844. Business School, teaching and researching leadership and leader- Podsakoff, P. M., MacKenzie, S. B., & Podsakoff, N. P. (2012). ship development. He has served as an associate dean to the Sources of method bias in social science research and recom- BI-Fudan MBA program in Shanghai, China. mendations on how to control it. In S. T. Fiske, D. L. Schacter, & S. E. Taylor (Eds.), Annual review of psychology (Vol. 63, Kai R. Larsen, PhD, is an associate professor of management and pp. 539-569). Palo Alto, CA: Annual Reviews. entrepreneurship at Leeds Business School, University of Colorado 18 SAGE Open at Boulder. He serves as the director of the federally supported Øyvind L. Martinsen, PhD, is a full professor at BI Norwegian Human Behavior Project, researching a transdisciplinary “back- Business School in Oslo, Norway. He conducts research in leader- bone” for theoretical research. He teaches business intelligence and ship, personality, and creativity, and also teaches these issues as privacy in the age of Facebook. well as psychometrics.

Journal

SAGE Open – SAGE

Published: Mar 14, 2018

Keywords: semantics; simulation; surveys; semantic theory of survey response; leadership

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Respondent Robotics: Simulating Responses to Likert-Scale Survey Items:

Respondent Robotics: Simulating Responses to Likert-Scale Survey Items:

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Respondent Robotics: Simulating Responses to Likert-Scale Survey Items:

Respondent Robotics: Simulating Responses to Likert-Scale Survey Items:

References (65)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies