Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Lexical Processing Strongly Affects Reading Times But Not Skipping During Natural Reading

Lexical Processing Strongly Affects Reading Times But Not Skipping During Natural Reading REPORT Lexical Processing Strongly Affects Reading Times But Not Skipping During Natural Reading 1,2,3 1 1,2 1 Micha Heilbron , Jorie van Haren , Peter Hagoort , and Floris P. de Lange Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands University of Amsterdam, Amsterdam, The Netherlands Keywords: reading, eye movements, prediction, preview, Bayesian reader, neural networks an open access journal ABSTRACT In a typical text, readers look much longer at some words than at others, even skipping many altogether. Historically, researchers explained this variation via low-level visual or oculomotor factors, but today it is primarily explained via factors determining a word’s lexical processing ease, such as how well word identity can be predicted from context or discerned from parafoveal preview. While the existence of these effects is well established in controlled experiments, the relative importance of prediction, preview and low-level factors in natural reading remains unclear. Here, we address this question in three large naturalistic reading corpora (n = 104, 1.5 million words), using deep neural networks and Bayesian ideal observers to model linguistic prediction and parafoveal preview from moment to moment in natural reading. Strikingly, neither prediction nor preview was important for explaining word skipping—the vast majority of explained variation was explained by a simple oculomotor model, using just fixation position and word length. For reading times, by contrast, we found Citation: Heilbron, M., van Haren, J., strong but independent contributions of prediction and preview, with effect sizes matching Hagoort, P., & de Lange, F. P. (2023). Lexical Processing Strongly Affects those from controlled experiments. Together, these results challenge dominant models of eye Reading Times But Not Skipping During Natural Reading. Open Mind: movements in reading, and instead support alternative models that describe skipping (but not Discoveries in Cognitive Science, 7, 757–783. https://doi.org/10.1162/opmi reading times) as largely autonomous from word identification, and mostly determined by _a_00099 low-level oculomotor information. DOI: https://doi.org/10.1162/opmi_a_00099 Supplemental Materials: https://doi.org/10.1162/opmi_a_00099 INTRODUCTION Received: 18 April 2023 When reading a text, readers move their eyes across the page to bring new information to the Accepted: 27 July 2023 centre of the visual field, where perceptual sensitivity is highest. While it may subjectively feel Competing Interests: The authors as if the eyes smoothly slide along the text, they in fact traverse the words with rapid jerky declare no conflict of interests. movements called saccades, followed by brief stationary periods called fixations. Across a text, Corresponding Author: saccades and fixations are highly variable and seemingly erratic: Some fixations last less than Micha Heilbron m.heilbron@uva.nl 100 ms, others more than 400; and while some words are fixated multiple times, many other words are skipped altogether (Dearborn, 1906; Rayner & Pollatsek, 1987). What explains this striking variation? Copyright: © 2023 Massachusetts Institute of Technology Historically, researchers have pointed to low-level non-linguistic factors like word length, Published under a Creative Commons Attribution 4.0 International oculomotor noise, or the relative position where the eyes happen to land (Bouma & de Voogd, (CC BY 4.0) license 1974; Buswell, 1920; Dearborn, 1906;O’Regan, 1980). Such explanations were motivated by the idea that oculomotor control was largely autonomous. In this view, readers can adjust The MIT Press Lexical Processing Affects Reading Times Not Skipping Heilbron et al. saccade lengths and fixation durations to global characteristics like text difficulty or reading strategy, but not to subtle word-by-word differences in language processing (Bouma & de Voogd, 1974; Buswell, 1920; Dearborn, 1906; Morton, 1964). As reading was studied in more detail, however, it became clear that the link between eye movements and cognition was more direct. For instance, it was found that fixation durations were shorter for words with higher frequency (Inhoff, 1984; Rayner, 1977). Eye movements were even shown to depend on how well a word’s identity could be inferred before fixation. Specifically, researchers found that words are read faster and skipped more often if they are predictable from linguistic context (Balota et al., 1985; Ehrlich & Rayner, 1981) or if they are identifiable from a parafoveal preview (McConkie & Rayner, 1975; Rayner, 1975; Schotter et al., 2012). These demonstrations of a direct link between eye movements and language processing overturned the autonomous view, replacing it by cognitive accounts describing eye move- ments during reading as largely, if not entirely, controlled by linguistic processing (Clifton et al., 2016; Reichle et al., 2003). Today, many studies still build on the powerful techniques like gaze-contingent displays that helped overturn the autonomous view, but now ask much more detailed questions, like whether word identification is a distributed or sequential process (Kliegl et al., 2006, 2007); how many words can be processed in the parafovea (Rayner et al., 2007); at which level they are analysed (Hohenstein & Kliegl, 2014; Pan et al., 2021), and how this may differ between writing systems or orthographies (Tiffin-Richards & Schroeder, 2015; Yan et al., 2010). Here, we ask a different, perhaps more elemental question: how much of the variation in eye movements do linguistic prediction, parafoveal preview, and non-linguistic factors each explain? That is, how important are these factors for determining how the eyes move during reading? Dominant, cognitive models explain eye movement variation primarily as a function of lexical processing. Skipping, for instance, is modelled as the probability that a word is iden- tified before fixation (Engbert & Kliegl, 2003; Engbert et al., 2005; Reichle et al., 2003). Some, however, have questioned this purely cognitive view, suggesting that low-level features like word eccentricity or length might be more important (Brysbaert et al., 2005; Reilly & O’Regan, 1998; Vitu et al., 1995). One particularly relevant analysis comes from Brysbaert et al. (2005). Presenting a meta-analysis on the aggregate effect sizes on word skipping, they argue that the effect of length and distance is so large that skipping may not just be driven by ongoing word identification, but also—and indeed perhaps primarily—by low-level heuristics part of a sim- ple, scanning strategy (Brysbaert et al., 2005). Similarly, one may ask what drives next-word identification: is identifying the next word mostly driven by linguistic predictions (Goodman, 1967) or by parafoveal perception? Remarkably, while it is well-established that both linguistic and oculomotor, and both predictive and parafoveal processing, all affect eye-movements (Brysbaert et al., 2005; Kliegl et al., 2004; Schotter et al., 2012; Staub, 2015), a comprehensive picture of their relative explanatory power is currently missing, perhaps because they are sel- dom studied all at the same time. To arrive at such a comprehensive picture we focus on natural reading, analysing three large datasets of participants reading passages, long articles, and even an entire novel—together encompassing 1.5 million (un)fixated words, across 108 individuals (Cop et al., 2017; Kennedy, 2003; Luke & Christianson, 2018). We use a model-based approach: instead of manipulating word predictability or perturbing parafoveal perceptibility, we com- bine deep neural language modelling (Radford et al., 2019) and Bayesian ideal observer anal- ysis (Duan & Bicknell, 2020) to quantify how much information about next-word identity is OPEN MIND: Discoveries in Cognitive Science 758 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. conveyed by both prediction and preview, on a moment-by-moment basis. Our model-based analysis is quite different from the experimental approach, especially in the case of parafoveal preview which is generally studied with a boundary paradigm. However, the underlying logic is the same: in the boundary paradigm, eye movements are compared between conditions in which the preview is informative (valid) and when it conveys no (or incorrect) information about word identity. We—following (Bicknell & Levy, 2010; Duan & Bicknell, 2020)—simply replace this categorical contrast with a more continuous analysis, quantifying the subtle word-by-word variation in the amount of information conveyed by the prior preview. In this sense, our approach can be seen as an extension and refinement of the seminal analyses by Brysbaert and colleagues, allowing for instance to quantify not just the effect of word length on skipping—but also, simultaneously, estimate and control for the effect word length has on a word’s prior parafoveal identifiability. In this way, our word-by-word, information-theoretic analysis brings us closer to the under- lying mechanisms than analysing effect sizes in the aggregate. However, we want to stress we use these models as normative models to estimate how much information is in principle avail- able from prediction and preview at each moment, but do not take these as processing models of human cognition (see Methods and Discussion for a more extensive comparison of our model-based approach and traditional methods). Such a broad-coverage model-based approach has been applied to predictability effects on reading before (Frank et al., 2013; Goodkind & Bicknell, 2018; Kliegl et al., 2004; Luke & Christianson, 2016;Shain et al., 2022; Smith & Levy, 2013), but either without considering preview or only through coarse heuristics such as using word frequency as a proxy for parafoveal identifiability (Kennedy et al., 2013; Kliegl et al., 2006; Pynte & Kennedy, 2006) (but see Duan & Bicknell, 2020). By contrast, we explicitly model both, in addition to low-level explanations like autonomous oculomotor control. To assess explanatory power, we use set theory to derive the unique and shared variation in eye movements explained by each model. To preview the results, this revealed a striking dissociation between skipping and reading times. For word skipping, the overwhelming majority of explained variation could be explained—mostly uniquely explained—by a non-linguistic oculomotor model, that explained word skipping just as a function of a word’s distance to the prior fixation position and its length. These two low-level variables explained much more skipping variation than the degree to which a word was identifiable or predictable prior to fixation. For reading times, by contrast, we did find that factors determining a word’s lexical processing explained most variance. In line with dominant models, we found strong effects of both prediction and preview, matching effect sizes from controlled designs. Interestingly, prediction and parafoveal preview seem to operate independently: we found strong evidence against Bayes-optimal integration of the two. Together, these results support and extend the earlier conclusions of Brysbaert and col- leagues, while challenging dominant cognitive models of reading, showing that skipping (or the decision of where to fixate) and reading times (i.e., how long to fixate) are governed by different principles, and that for word skipping, the link between eye movements and cogni- tion is less direct than commonly thought. RESULTS We analysed eye movements from three large datasets of participants reading texts ranging from isolated paragraphs to an entire novel. Specifically, we considered three datasets: Dundee (Kennedy, 2003)(N = 10, 51.502 words per participant), Geco (Cop et al., 2017) (N = 14, 54.364 words per participant) and Provo (Luke & Christianson, 2018)(N = 84, OPEN MIND: Discoveries in Cognitive Science 759 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. 2.689 words per participant). In each corpus, we analysed both skipping and reading times (indexed by gaze duration), as they are thought to reflect separate processes: the decision of where vs. how long to fixate, respectively (Brysbaert et al., 2005; Reichle et al., 2003). For more descriptive details about the data across participants and datasets, see Methods and Figures A.5–A.7. To estimate the effect of linguistic prediction and parafoveal preview, we quantified the amount of information conveyed by both factors for each word in the corpus (for preview, this was tailored to each individual participant, since each word was previewed at a different eccentricity by each participant). To this end, we formalised both processes as a probabilistic belief about the identity of the next word, given either the preceding words (prediction) or a noisy parafoveal percept (preview; see Figure 1A). As such, we could describe these disparate cognitive processes using a common information-theoretic currency. To compute the proba- bility distributions, we used GPT-2 for prediction (Radford et al., 2019) and a Bayesian ideal observer for preview (Duan & Bicknell, 2020) (see Figure 1B and Methods). Note that we use both computational models as normative models; tools to estimate how much information is in principle available from linguistic context (prediction) or parafoveal perception (preview) on a moment-by-moment basis. In other words, we use these models much in the same way as we rely on the counting algorithms used to aggregate lexical frequency statistics: in both cases we are interested in the computed statistic (e.g., lexical surprisal or entropy, or lexical fre- quency) but we do not want to make any cognitive claim about the underlying algorithm that we happened to use to compute this statistic (e.g., GPT-2 for lexical surprisal, or a counting algorithm for lexical frequency). For more details on the exact choice, and relation to alternative metrics (e.g., GPT-3 or cloze probabilities) see Methods and Discussion. Prediction and Preview Increase Skipping Rates and Reduce Reading Times We first asked whether our formalisations allowed us to observe the expected effects of pre- diction and preview, while statistically controlling for other explanatory variables. This was Figure 1. Quantifying two types of context during natural reading. (A) Readers can infer the identity of the next word before fixation either by predicting it from context or by discerning it from the parafovea. Both can be cast as a probabilistic inference about the next word, either given the preceding words (prediction, blue) or given a parafoveal percept (preview, orange). (B) To model prediction, we use GPT-2, one of the most powerful publicly available language models (Radford et al., 2019). For preview, we use an ideal observer (Duan & Bicknell, 2020) based on well-established ‘Bayesian Reader’ models (Bicknell & Levy, 2010; Norris, 2006, 2009). Importantly, we do not use either model as a cognitive model per se, but rather as a tool to quantify how much information is in principle available from prediction or preview on a moment-by-moment basis. OPEN MIND: Discoveries in Cognitive Science 760 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. done by performing a multiple regression analysis, and statistically testing whether the coef- ficients were in the expected direction. Word skipping was modelled with a logistic regression, reading times (gaze durations) were predicted using ordinary least squares regression. Because the decisions of whether to skip and how long to fixate a word are made at different moments, when different types of information are available, we modeled each separately with a different set of explanatory variables. But in both cases, for inference on the coefficients, we considered the full model (variables motivated and detailed below; see Tables A.1 and A.2 for a tabular overview of all variables). As expected, we found in all datasets that words were more likely to be skipped if there was more information available from the linguistic prediction (Bootstrap: Dundee, p = 0.023; −5 GECO, p = 0.034; Provo, p <10 ) and/or the parafoveal preview (Bootstrap: Dundee, p = −5 −5 −5 4×10 ; GECO, p <10 ; Provo, p <10 ). Similarly, reading times were reduced for words −4 that were more predictable (all p’s < 3.2 × 10 ) or more identifiable from the parafovea (all −5 p’s<4 × 10 ). Together this confirms that our model-based approach can capture the expected effects of both prediction (Clifton et al., 2016) and preview (Schotter et al., 2012) in natural reading, while statistically controlling for other variables. Word Skipping is Largely Independent of Online Lexical Processing After confirming that prediction and preview had a statistically significant influence on word skipping and reading times, we went on to assess their relative explanatory power. That is, we asked how important these factors were, by examining how much variance was explained by each. To this end, we grouped the variables from the full regression model into different types of explanations, and assessed how well each type accounted for the data, in terms of the unique and overlapping amount of variation explained by each explanation. This in turn 2 2 was measured by the cross-validated R for reading times, and R for skipping, which both McF quantify the proportion of variation explained (see Methods). For skipping, we considered three explanations. First, a word might be skipped purely because it could be predicted from context—that is, purely as a function of the amount of information about word identity conveyed by the prediction. Secondly, a word might be skipped because its identity could be gleaned from a parafoveal preview—that is, purely as a function of the amount of information about word identity conveyed by the preview. Finally, a word might be skipped simply because it is so short or so close to the prior fixation location that an autonomously generated saccade will likely overshoot it, irrespective of its linguistic properties—in other words, purely as a function of length and eccentricity. Note that we did not include often-used lexical attributes like frequency to predict skipping, because using attri- butes of word already pre-supposes parafoveal identification. Moreover, to the extent that a n+1 lexical attribute like frequency might influence a word’s parafoveal identifiability, this should already be captured by the parafoveal entropy (see Figure A.3 and Methods for more details). For each word, we thus modelled the probability of skipping either as a function of predic- tion, preview, or oculomotor information (i.e., eccentricity and length), or by any combination of the three. Then we partitioned the unique and shared cross-validated variation explained by each account. Strikingly, this revealed that the overwhelming majority of explained skipping variation (94%) could be accounted for by the oculomotor baseline that consisted just of eccentricity and length (Figure 2). Moreover, the majority of the variation was only explained by the baseline, which explained 10 times more unique variation than prediction and preview combined. There was a large degree of overlap between preview and the oculomotor baseline, which is unsurprising since a word’s identifiability decreases as a function of its eccentricity OPEN MIND: Discoveries in Cognitive Science 761 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 2. Variation in skipping explained by predictive, parafoveal and autonomous oculomotor processing. (A) Proportions of cross-validated variation explained by prediction (blue), preview (orange) oculomotor baseline (grey) and their overlap; averaged across datasets (each dataset weighted equally). (B) Variation partitions for each individual dataset, including statistical signifi- cance of variation uniquely explained by predictive, parafoveal or oculomotor processing. Stars indicate significance-levels of the cross-validated unique variation explained (bootstrap t-test against zero): p <0.05(*), p < 0.05 (**), p < 0.001 (***) For results of individual participants, and their consistency, see Figure A.9. and length. Interestingly, there was even more overlap between the prediction and baseline model: almost all skipping variation that could be explained by contextual constraint could be equally well explained by the oculomotor baseline factors. Importantly, while the contribution of prediction and preview was small, it was significant both for prediction (Dundee: 0.015% bootstrap 95CI: 0.003–0.029%; bootstrap t-test com- pared to zero, p = 0.014; Geco: 0.039%, 95CI: 0.018–0.065%; p = 0.0001; Provo: 0.20%; −5 −5 95CI: 0.14–0.28%, p <10 ) and preview (Dundee: 2.14%, 95CI: 1.66–2.60%; p <10 ; −5 −5 Geco: 1.71%, 95CI: 1.20–2.29%, p <10 ; Provo: 0.56%, 95CI: 0.36–0.79%, p <10 ), con- firming that both factors do affect skipping. Crucially however, the vast majority of skipping that could be explained by either prediction or preview was equally well explained by the more low-level and computationally frugal oculomotor model—which also explained much more of the skipping data overall. This challenges the idea that word identification is the main driver behind skipping, instead pointing to a more low-level, computationally simpler strategy. What might this simpler strategy be? One possibility is a ‘blind’ random walk: generating saccades of some average length, plus oculomotor noise. However, we find that saccades are OPEN MIND: Discoveries in Cognitive Science 762 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. tailored to word length and exhibit a well-known preferred landing position, slightly left to a word’s centre (see Figure A.8; compare McConkie et al., 1988; Rayner, 1979). This suggests the decision of where to look next is not ‘blind’ but is based on a coarse low-level visual anal- ysis of the parafovea, for instance conveying just the location of the next word ‘blob’ within a preferred range (i.e., skipping words too close or short; cf. Brysbaert et al., 2005; Deubel et al., 2000; Reilly & O’Regan, 1998). Presumably, such a simple strategy would on average sample visual input conveniently, yielding saccades large enough to read efficiently but small enough for comprehension to keep track. However, if such an ‘autopilot’ is indeed largely independent of online comprehension, one would expect it occasionally go out of step, such that a skipped word cannot be recognised or guessed, derailing comprehension. In line with suggestion, we find evidence for a compensation strategy. The probability that an initially skipped words is subsequently (regressively) fixated is significantly, inversely related to its parafoveal identifia- bility before skipping (see Figure A.10; logistic regression to prior parafoveal entropy: all β’s> −5 0.15; bootstrap test on coefficients: all p’s<10 ). Together, this suggests that initial skipping decisions are primarily driven by a low-level oculomotor ‘autopilot’, which is kept in line with online comprehension by correcting saccades that outrun word recognition (much in line with the suggestions by Brysbaert et al., 2005). Reading Times are Strongly Modulated by Lexical Processing Difficulty For reading times (defined as gaze durations, so considering foveal reading time only), we similarly considered three broad explanations. First, a word might be read faster because it was predictable from the preceding context, which we formalised via lexical surprisal. Second, a word might be read faster if it could already be partly identified from the parafoveal preview (before fixation). This informativeness of the preview was again formalised via the parafoveal preview entropy. Finally, a word might be read faster due to non-contextual attributes of the fixated word itself, such as frequency or word-class or the viewing position. This last explan- atory factor functioned as a baseline that captured key non-contextual attributes, both linguis- tic and non-linguistic (see Methods). In all datasets, we again found that all explanations accounted for some unique variation: prediction (Dundee: 0.80% bootstrap 95CI: 0.55–1.09%, bootstrap t-test compared to zero: −5 p <6 ;Geco:0.68%,95CI: 0.55–0.83%; p = 0.0001; Provo: 0.35%, 95CI: 0.20–0.43%, −5 p <10 ), preview (Dundee: 1.91%, 95CI: 1.00–3.14%, p = 0.00012; Geco: 1.59%, 95CI: −5 −5 0.96–2.30%, p =5×10 ;Provo: 0.93%,95CI: 0.70–1.98%, p <10 ) and the non- −5 contextual word attributes (Dundee: 8.06%, 95CI: 5.84–10.32%, p =5 × 10 ;Geco: −5 −5 1.99%, 95CI: 1.32–2.81%, p <10 ; Provo: 5.38%, 95CI: 4.48–6.83%, p <10 ). The non-contextual baseline explained the most variance, which shows—unsurprisingly— that properties of the fixated word itself are more important than contextual factors in deter- mining how long a word is fixated. Critically however, compared to skipping the unique contribution of prediction and preview was more than three times higher (see Figure 3). Specifically, while prediction and preview could only uniquely account for 6% of explained word skipping variation, they uniquely accounted for more than 18% of explained variation in reading times. This suggests that while for skipping most explained variation can be accounted for by purely oculomotor variables, this is not the case for reading times. However, this comparison (between oculomotor and lexical processing based accounts) difficult to make based on the Figures 2 and 3 alone. This is because in the reading times analysis, the baseline model contained both oculomotor (i.e., viewing position) and lexical factors (notably lexical frequency). Therefore, we performed an additional analysis, grouping OPEN MIND: Discoveries in Cognitive Science 763 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 3. Variation in reading times explained by predictive, parafoveal and non-contextual information. (A) Grand average of partitions of cross-validated variance in reading times (indexed by gaze durations) across datasets (each dataset weighted equally) explained by non-contextual factors (grey), parafoveal preview (orange), and linguistic prediction (blue). (B) Variance partitions for each individual dataset, including statistical significance of the cross-validated variance explained uniquely by the predictive, parafoveal or non-contextual explanatory variables. Stars indicate significance levels of the cross-validated unique variance explained (bootstrap t test against zero): p < 0.05 (**), p < 0.001 (***). For results of individual participants, see Figure A.11. Note that the baseline model here both lexical attributes (e.g., frequency) and oculomotor factors (relative viewing/landing position). For a direct contrast between lexical processing-based explanations and purely oculomotor explanations, see Figure 4. the explanatory variables differently to contrast purely oculomotor explanatory variables ver- sus variables affecting lexical processing ease (such as predictability, parafoveal identifiability, and lexical frequency; see Tables A.3 and A.4). This shows that for skipping, purely oculomotor explanations can account for much more than a lexical processing-based explanation—but for reading times, it is exactly the other way around (Figure 4). Note that in Figure 4, the oculomotor model for reading times only contains variables quantifying viewing/landing position, because this is the primary oculomotor explanation for reading time differences (O’Regan, 1980, 1992; Vitu et al., 1990). If we also include word length in the oculomotor model for reading times, there is much more overlapping variance explained by the lexical and oculomotor model, presumably due to the correlation between word length and (log)frequency, which may inflate the importance of the oculomotor account (see Figure A.13). However, even with this potentially inflated estimate, the overall dissociation persists: if we compare the ratios unique variation explained by oculomotor vs. lexical processing-based models, there is still more than a 30-fold difference between the skipping and reading times analysis (in Figure A.13A). Together, this supports that for skipping, most explained variation is OPEN MIND: Discoveries in Cognitive Science 764 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 4. Comparing oculomotor and lexical processing-based explanations for skipping and reading times. Analysis with the same explan- atory variables as Figures 2 and 3, grouped differently to directly contrast purely oculomotor explanatory variables and those that affect lexical processing ease (such as predictability, parafoveal identifiability, and lexical frequency; see Methods). Venn diagrams represent the proportions 2 2 of unique and overlapping amount of explained variation (in R for reading times, and R for skipping) by each explanation (grand average McF across datasets). For partitions for individual datasets with statistics, see Figure A.12; for an alternative partitioning that includes word length in the oculomotor of reading times, see Figure A.13. captured by purely oculomotor rather than lexical processing-based explanations, whereas for reading times, it is the other way around. Model-Based Estimates of Naturalistic Prediction and Preview Benefits Match Experimental Effect Sizes The reading times results confirm that reading times are highly sensitive to factors influencing a word’s lexical processing ease, including contextual factors like linguistic and parafoveal con- text. This is in line with the scientific consensus and decades of experimental research on eye movements in reading (Rayner, 2009). But how well do our model-based, correlational results compare exactly to findings from the experimental literature? To directly address this question, we quantitatively derived, for each participant, the effect size of two well-established effects that would be expected to be obtained if we would con- duct a well-controlled factorial experiment. While we did not actually perform such a factorial experiment, we can derive this from the regression model, because we quantitatively esti- mated how much additional information from either prediction or preview (in bits) reduced reading times (in milliseconds). Therefore, the regression analyses allows us to estimate the expected difference in reading times for words that are expected vs. unexpected (predictability benefit; Rayner & Well, 1996;Staub, 2015) or have valid vs. invalid preview (i.e., preview benefit; Schotter et al., 2012). Interestingly, the model-derived effect sizes are very well in line with those observed in experimental studies (see Figure 5). This suggests that our analysis does not strongly underfit or otherwise underestimate the effect of prediction or preview. Moreover, it shows that the effect sizes, which are well-established in controlled designs, generalise to natural reading. This last point is especially interesting for the preview benefit, because it implies that the effect can be largely explained in terms of parafoveal lexical identifiability (Pan et al., 2021; Rayner, 2009), and that other factors such as low-level visual ‘preprocessing’, or interference between OPEN MIND: Discoveries in Cognitive Science 765 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 5. Model-derived effect sizes match experimentally observed effect sizes. Preview (left) and predictability benefits (right) inferred from our analysis of each dataset, and observed in a sample of studies (see Table A.5). In this analysis, preview benefit was derived from the regression model as the expected difference in gaze duration after a preview of average informativeness versus after no preview at all. Pre- dictability benefit was defined as the difference in gaze duration for high versus low probability words; ‘high’ and ‘low’ were defined by subdividing the cloze probabilities from Provo into equal thirds of ‘low’, ‘medium’ and ‘high’ probability (see Methods). In each plot, small dots with dark edges represent either individual subjects within one dataset or individual studies in the sample of the literature; larger dots with error bars represent the mean effect across individuals or studies, plus the bootstrapped 99%CI. the (invalid) parafoveal percept and foveal percept, may only play a minor role (cf. Reichle et al., 2003; Schotter et al., 2012). No Integration of Prediction and Preview So far, we have treated prediction and preview as being independent. However, it might be that these processes, while using different information, are integrated—such that a word is parafoveally more identifiable when it is also more predictable in context. Bayesian probabil- ity theory proposes an elegant and mathematically optimal way to integrate these sources of information: the prediction of the next word could be incorporated as a prior in perceptual inference. Such a contextual prior fits into hierarchical Bayesian models of vision (Lee & Mumford, 2003), and has been observed in speech perception, where a contextual prior guides the recognition of words from a partial sequence of phonemes (Brodbeck et al., 2022; Heilbron et al., 2022). Does such a prior also guide word recognition in reading, based on a partial parafoveal percept? To test this, we recomputed the parafoveal identifiability of each word for each participant, but now with an ideal observer using the prediction from GPT-2 as a prior. As expected, bayesian integration enhanced perceptual inference: on average, the observer using linguistic prediction as a prior extracted more information from the preview (± 6.25 bits) than the observer not taking the prediction into account (± 4.30 bits; T = 1.35 × 10 , p ≈ 0). 1.39×10 Interestingly however, it provided a worse fit to the human reading data. This was established by comparing two versions of the full regression model: one with parafoveal entropy from the (theoretically superior) contextual ideal observer and one from the non-contextual ideal observer. In all datasets both skipping and reading times were better explained by a model including parafoveal identifiability from the non-contextual observer (skipping: all p’s< −5 −5 10 ; reading times: p’s<10 ;see Figure 6). This replicates Duan and Bicknell (2020), who performed a similar analysis comparing a contextual (5-gram) and non-contextual prior in natural reading (Duan & Bicknell, 2020). Our findings replicate and significantly extend their findings, since Duan and Bicknell (2020) only analysed skipping in the Dundee corpus. Our OPEN MIND: Discoveries in Cognitive Science 766 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 6. Evidence against bayesian integration of linguistic prediction and parafoveal preview. Cross-validated prediction performance of the full reading times (top) and skipping (bottom) model (including all variables), equipped with parafoveal preview information either from the contextual observer or from the non-contextual observer. Dots with connecting lines indicate participants; stars indicate significance: p < 0.001 (***). analysis not only investigates additional datasets but it finds exactly the same result for reading times (for which the importance of both prediction and preview is decidedly larger). Together, this suggests while both linguistic prediction and parafoveal preview influence online reading behaviour, the two sources of information are not integrated, but instead oper- ate independently—highlighting a remarkable sub-optimality in reading. DISCUSSION Eye movements during reading are highly variable. Across three large datasets, we assessed the relative importance of different explanations for this variability. In particular, we quantified the importance of two major contextual determinants of a word’s lexical processing difficulty—linguistic prediction and parafoveal preview—and compared such lexical processing- based explanations to alternative (non-linguistic) explanations. This revealed a stark dissocia- tion: for word skipping, a simple low-level oculomotor model (using just word length and distance to prior fixation location) could account for much more (unique) variation than lex- ical processing-based explanations, whereas for reading times, it was exactly the other way around. Interestingly, preview effects were best captured by a non-contextual observer, sug- gesting that while readers use both linguistic prediction and preview, these do not appear to be integrated on-line. Together, the results underscore the dissociation between skipping and reading times, and show that for word skipping, the link between eye movements and cogni- tion is less direct than commonly thought. Our results on skipping strongly support the earlier findings and theoretical perspective by Brysbaert et al. (2005). They analysed effect sizes from studies on skipping and found a disproportionately large effect of length, compared to proxies of processing-difficulty like frequency and predictability. We significantly extend their findings by modelling skipping itself (rather than effect sizes from studies) and making a direct link to processing mechanisms. For instance, based on their analysis it was unclear how much of the length effect could be attrib- uted to the lower visibility of longer words—that is, how much of the length effect may be an identifiability effect (Brysbaert et al., 2005, p. 19). We show that length and eccentricity alone OPEN MIND: Discoveries in Cognitive Science 767 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. explained three times as much variation as parafoveal identifiability—and that most of the var- iation explained by identifiability was equally well explained by length and eccentricity. This demonstrates that length and eccentricity themselves—not just to the extent they reduce identifiability—are key drivers of skipping. This conclusion challenges dominant, cognitive models of eye movements, which describe lexical identification as the primary driver behind skipping (Engbert & Kliegl, 2003; Engbert et al., 2005; Reichle et al., 2003). Importantly, our results do not challenge predictive or par- afoveal word identification itself. Rather, they challenge the notion that moment-to-moment decisions of whether to skip individual words are primarily driven by the recognition of those words. Instead, our results suggest a simpler strategy in which a coarse (e.g., dorsal stream) visual representation is used to reflexively select the next saccade target following the simple heuristic to move forward to the next word ‘blob’ within a certain range (see also Brysbaert et al., 2005; Deubel et al., 2000; Reilly & O’Regan, 1998). Given that readers use both prediction and preview, why would they strongly affect reading times but hardly word skipping? We suggest this is because these different decisions—of where versus how long to fixate—are largely independent and are made at different moments (Findlay & Walker, 1999; Hanes & Schall, 1995; Schall & Cohen, 2011). Specifically, the deci- sion of where to fixate—and hence whether to skip the next word—is made early in saccade programming, which can take 100–150 ms (Becker & Jürgens, 1979; Brysbaert et al., 2005; Hanes & Schall, 1995). Although the exact sequence of operations leading to a saccade remains debated, given that readers on average only look some 250 ms at a word, it is clear that skipping decisions are made under strong time constraints, especially given the lower pro- cessing rate of parafoveal information. We suggest that the brain meets this constraint by resorting to a computationally frugal ‘move forward’ policy. How long to fixate, by contrast, depends on saccade initiation. This process is separate from target selection, as indicated by physiological evidence that variation in target selection time only weakly explains variation in initiation times, which are affected by more factors and can be adjusted later (Findlay & Walker, 1999; Schall & Cohen, 2011). This can allow initiation to be informed by foveal infor- mation, which is processed more rapidly and may thus more directly influence the decision to either keep dwelling or execute the saccade. One simplifying assumption we made is that during natural reading, the relative importance of identification-related processes (like prediction and preview) and oculomotor processsing are relatively stable within a single reader. However, it might be that underneath the average, aggregate relative importance we estimated, there is variability between specific moments in a text, or even within a sentence, during which the relative importance might be quite different. One such moment could be sentence transitions, where due to end-of-sentence ‘wrap up’ effects (Andrews & Veldre, 2021;Just & Carpenter, 1980), the relative importance of for instance preview may be reduced. Here we did not treat sentence transitions (or other such moments) as special, but looking into the possibility of moment-to-moment variability in rel- ative importance is an interesting avenue for future research. A distinctive feature of our analysis is that we focus on a few sets of computationally explicit variables, each forming a coherent explanation, and quantify the shared and unique variation accounted for by each explanation. The advantage of this approach is interpretability. How- ever, a limitation of the partitioning analysis is that it is not always possible to add all poten- tially statistically significant variables (or interactions) to the regression, because partitioning requires that each variable can be assigned to a single explanation. When this is not possible (e.g., when a variable is (indirectly) associated with multiple explanations) this requires making OPEN MIND: Discoveries in Cognitive Science 768 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. a decision. Either the variable is omitted and the regression may not capture all explainable variance. Alternatively, the variable is assigned to just one explanation, which may distort the results by inflating the importance of that explanation. A primary example of a variable requiring such a decision is (log)-frequency in the context of skipping. Frequency is sometimes used to predict skipping, as a proxy for a word’s parafo- veal identifiability. However, this relationship is indirect, and (log)frequency is also—and much more strongly—correlated with length, and is hence also associated with the oculomo- tor explanation. Therefore, if one uses frequency as a proxy for parafoveal identifiability, one may find apparent preview effects which are in fact length effects, and strongly overestimate preview importance (Brysbaert & Drieghe, 2003). To avoid such overestimates, especially because the effect of frequency on identifiability should already be captured by the Ideal Observer (see Figure A.3 and Methods), we did not include frequency in our skipping analysis, nor did we include any other attribute that sometimes used a ‘proxy’ for either prediction/constraint or preview. A conceptually related problem is posed by interactions between variables from different explanations, such as between prediction/preview entropy and oculomotor predictors. These are impossible to assign to a single explanation, and were hence excluded from the regression. As a result, the regression model did not include some variables or interactions that were used by prior regression analyses of skipping or reading times. This means that our regression may leave some explainable variation unexplained, and that our importance estimates are spe- cific to the variables we consider, and our modelling thereof. However, this is a limitation that we believe trades-off favourably against the advantages afforded by the analysis. In particular, because for both skipping and reading times (1) we included all the factors deemed most important by prior regression-based studies (e.g., Duan & Bicknell, 2020;Hahn & Keller, 2023;Kliegletal., 2006); (2) the amount of overall (cross-validated) explained variation is in line with prior regression-based analyses (e.g., Duan & Bicknell, 2020; Kliegl et al., 2006); and (3) our model-based effect sizes of prediction and preview effects are well in line with those from the experimental literature, suggesting our modelling of prediction or preview does not significantly fail to capture major aspects of either (Figure 5). In sum, we therefore do not believe that our selective and computationally explicit regression analysis significantly underestimates major factors of importance, and we are optimistic that our analysis yielded the comprehensive, interpretable picture that we aimed for. To quantify predictability (surprisal) and constraint (lexical entropy) we used a neural lan- guage model (GPT-2), instead of the more standard cloze procedure. The reason for this is that we are interested in natural texts, where many words will have relatively low predictability values (e.g., below p = 0.01) which are inherently difficult to estimate in a cloze task. Since the effect of word predictability is logarithmic (Shain et al., 2022; Smith & Levy, 2013) the differences between small probabilities (e.g., between p = 0.001 and p = 0.0001) can have non-negligible effects, which is why for natural texts language models are superior to cloze metrics to capture predictability effects. Since the PROVO corpus includes cloze probability for every word, we could confirm this empirically, finding that model-derived surprisal indeed predicts reading times much better (Figure A.1). We used this specific language model (GPT-2) simply because it was among the best pub- licly available ones, and prior work demonstrates that better language models (measured in perplexity) also predict human reading behaviour better (Goodkind & Bicknell, 2018; Wilcox et al., 2020). This raises the question whether an even better model (e.g., GPT-3, GPT-4, GPT-5, etc.) could predict human behaviour even better, and whether this might change the OPEN MIND: Discoveries in Cognitive Science 769 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. results. However, we do not believe this is likely. First, compared to the increases in quality (decreases in perplexity) from ngrams to GPT, further model improvements will be very subtle when quantified in the aggregate, and since reading behaviour is itself a noisy metric it is not obvious if such improvements will have a measurable impact. Second, one recent study even suggested that models larger than GPT-2 (GPT-J and GPT-3) predicted reading slightly worse, perhaps due to super-human memorisation capacities (Shain et al., 2022). In short, we used GPT-2 simply because it is a strong measure of lexical predictability (in English). Our analyses do not depend on GPT-2 specifically, in the same way we do not believe the results would change if we would have used different (but similar quality) lexical frequency statistics. One apparent complication is that the skipping and reading times analyses use different metrics for explained variation (R and R ). This is due to the difference between continuous McF and discrete variables. As a result, directly numerically comparing the two (e.g., interpreting 4% R as ‘less’ than 5% R ) is difficult. However, our comparisons between skipping and McF reading times are not based on such absolute, numerical comparisons. Instead, the conclu- sions only rely on comparing the relative importance of different explanations. In other words, comparing the relative size and overlap of Venn diagrams in Figures 2, 3 and 4 (and hence only directly comparing quantities of the same metric). If one does look at absolute numerical values across Figures 2, 3 and 6, the R values of the reading times regression may seem rather small. This could indicate a poor fit, which would potentially undermine our claim that reading times are to a large degree explained by cogni- tive factors. However, we do not believe this is the case, since our R ’s for gaze durations are not lower than R ’s reported by other regression analyses in natural reading (e.g., Kliegl et al., 2006); and because we find effect sizes in line with the experimental literature (Figure 5). Therefore, we do not believe we overfit or underfit gaze durations. Instead, what the relatively low R values indicate, we suggest, is that gaze durations are inherently noisy; that only a limited amount of the variation is systematic variation. While this noisiness might be interest- ing in itself (e.g., reflecting an autonomous timer; Engbert et al., 2005), it is not of interest in this study, which focusses on systematic variation, and hence only on relative importance of different explanations, not on absolute R values. A final notable finding is that preview was best explained by a non-contextual observer. This replicates the only other study that compared contextual and non-contextual models of preview (Duan & Bicknell, 2020). That study focussed on skipping; the fact that we obtain the same result for reading times and in different datasets strengthens the conclusion that context does not inform preview. This is also in line with a number of studies on skipping, suggesting no or a weak effect of contextual fit (Angele et al., 2014; Angele & Rayner, 2013; Hahn & Keller, 2023). However, it contradicts a possibly larger range of experimental studies on preview more broadly, that do find interactions between contextual constraint/prediction and preview (e.g., Balota et al., 1985; McClelland & O’Regan, 1981; Schotter et al., 2015; Veldre & Andrews, 2018). One explanation for this discrepancy stems from how the effect is measured. Experimental studies looked at the effect of context on the difference in reading time after valid versus invalid preview (Schotter et al., 2015; Veldre & Andrews, 2018). This may reveal a context effect not on recognition, but at a later stage (e.g., priming between context, preview and foveal word). Arguably, these yield different predictions. If context affects recognition it may allow identification of otherwise unidenti- fiable words. But if the interaction occurs later it may only amplify processing of recogni- sable words. Constructing a model that formally reconciles this discrepancy is an interesting challenge for future work. OPEN MIND: Discoveries in Cognitive Science 770 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Given that readers use both prediction and preview, why doesn’t contextual prediction inform preview? One explanation stems from time constraints imposed by eye movements. Given that readers on average only look some 250 ms at a word in which they have to rec- ognise the foveal word and process the parafoveal percept, this perhaps leaves too little time to fully let the foveal word and context inform parafoveal preview. On the other hand, word recognition based on partial input also occurs in speech perception under significant time-constraints. But despite those constraints, sentence context does influence auditory word recognition (McClelland & Elman, 1986; Zwitserlood, 1989), a process best modelled by a contextual prior (i.e., the opposite of what we find here; Brodbeck et al., 2022; Heilbron et al., 2022). Therefore, rather than being related to time-constraints per se, it might be also related to the underlying circuitry. More precisely, the fact that contrary to auditory word recognition, visual word recognition is a laboriously acquired skill that occurs throughout areas in the visual system that are repurposed (not evolved) for reading (Dehaene, 2009; Yeatman & White, 2021). Therefore, global sentence context might be able to dynamically influence the recognition of speech sounds in temporal cortex, but not that of words in visual cortex; there, context effects might be confined to simpler, more local context, like lexical context effects on letter perception (Heilbron et al., 2020; Reicher, 1969; Wheeler, 1970; Woolnough et al., 2021). In conclusion, we have found that two important contextual sources of information about next-word identity in reading, linguistic prediction and parafoveal preview, strongly drive var- iation in reading times, but hardly affect word skipping, which is largely based on low-level factors. Our results show that as readers, we do not always use all information available to us; and that we are, in a sense, of two minds: consulting complex inferences to decide how long to look at a word, while employing semi-mindless scanning routines to decide where to look next. It is striking that these disparate strategies operate mostly in harmony. Only occasionally they go out of step—then we notice that our eyes have moved too far and we have to look back, back to where our eyes left cognition behind. METHODS We analysed eye-tracking data from three, big, naturalistic reading corpora, in which native English speakers read texts while eye-movement data was recorded (Cop et al., 2017; Kennedy, 2003; Luke & Christianson, 2016). Stimulus Materials We considered the English-native portions of the Dundee, Geco and Provo corpora. The Dundee corpus comprises eye-movements from 10 native speakers from the UK (Kennedy, 2003), who read a total of 56.212 words across 20 long articles from The Independent news- paper. Secondly, the English portion of the Ghent Eye-tracking Corpus (Geco) (Cop et al., 2017) is a collection of eye movement data from 14 UK English speakers who each read Agathe Cristie’s The Mysterious Affair at Styles in full (54.364 words per participant). Lastly, the Provo corpus (Luke & Christianson, 2018) is a collection of eye movement data from 84 US English speakers, who each read a total of 55 paragraphs (extracted from diverse sources) for a total of 2.689 words. Eye Tracking Apparatus and Procedure In all datasets, eye movements were recorded monocularly, by recording the right eye. In Geco and Provo, recordings were made using an EyeLink 1000 (SR Research, Canada) with OPEN MIND: Discoveries in Cognitive Science 771 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. a spatial resolution of 0.01° and a temporal resolution of 1000 Hz. For Dundee, a Dr. Bouis oculometer (Dr. Bouis, Kalsruhe, Germany), with a spatial resolution of <0.1° and a temporal resolution of 1000 Hz was used. To minimize head movement, the participant’s heads were stabilised with a chinrest (Geco, Provo) or a bite bar (Dundee). In each experiment, texts were presented in ‘screens’ with either five lines (Dundee) or one paragraph per screen (Geco and Provo), presented using a font size of 0.33° per character. Each screen began with a fixation mark (gaze trigger) that was replaced by the initial word when stable fixation was achieved. In all datasets, a 9-point calibration was performed prior to the recording. In the longer experi- ments, a recalibration was performed every three screens (Dundee) or either every 10 minutes or whenever the drift correction exceeded 0.5° (Geco). For Dundee and Provo, the order of different texts were randomized across participants. In Geco, the entire novel was read start to finish with breaks between each chapter, during which participants answered comprehension questions. For each corpus the x, y-values per fixation position were converted into a word-by-word format. In Dundee, raw x, y-values were smoothed by rounding to single-character precision. In Geco and Provo, raw x, y-values for each within-word- or within-letter fixation were pre- served and available for each word. Across the three data sets we redefined the bounding boxes around each word, such that they subtended the area between the first to the last char- acter of the word, with the boundary set halfway to the neighbouring character (e.g., halfway the before and after the word). Punctuation before or after the word were left out, and words for which the bounding box was inconsistently defined were ignored. For distributions of saccade and fixation data, see Figures A.5–A.7. Language Model Contextual predictions were formalised using a language model—a model computing the probability of each word given the preceding words. Here, we used GPT-2 (XL)—currently among the best publicly released English language models. GPT-2 is a transformer-based model, that in a single pass turns a sequence of tokens U =(u , …, u ) into a sequence of 1 k conditional probabilities, (p (u ), p(u |u ), …, p(u |u , …, u )). 1 2 1 k 1 k−1 Roughly, this happens in three steps: first, an embedding encodes the sequence of symbolic tokens as a sequence of vectors, which are the first hidden state h . Then, a stack of n trans- former blocks each applies a series of operations resulting in a new set of hidden states h , for each block l. Finally, a (log-)softmax layer is applied to compute (log-)probabilities over target tokens. In other words, the model can be summarised as follows: h ¼ UW þ W (1) 0 e p h ¼ transformer blockðÞ h ∀i 2½ 1; n (2) l l−1 PuðÞ ¼ softmax h W ; (3) where W is the token embedding and W is the position embedding. e p The key component of the transformer-block is masked multi-headed self-attention.This transforms a sequence of input vectors (x , x , …, x ) into a sequence of output vectors (y , 1 2 k 1 y , …, y ). Fundamentally, each output vector y is simply a weighted average of the input 2 k i vectors: y = w x . Critically, the weight w is not a parameter, but is derived from a i ij j i,j j¼1 OPEN MIND: Discoveries in Cognitive Science 772 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. dot product between the input vectors x x , passed through a softmax and scaled by a constant T T 1 pffiffiffi determined by the dimensionality d : w = (exp x x / exp x x ) . Because this is k ij j j i j¼1 i done for each position, each input vector x is used in three ways: first, to derive the weights for its own output, y (as the query); second, to derive the weight for any other output y (as the key); finally, it is used in the weighted sum (as the value). Different linear transformations are applied to the vectors in each cases, resulting in Query, Key and Value matrices (Q, K, V ). Putting this all together, we obtain: QK self attentionðÞ Q; K ; V¼ softmax pffiffiffiffiffi V : (4) To be used as a language model, two elements are added. First, to make the operation position-sensitive, a position embedding W is added in the embedding step (Equation 1). Second, to enforce that the model only uses information from the past, attention from future vectors is masked out. To give the model more flexibility, each transformer block contains multiple instances (‘heads’) of the self-attention mechanisms from Equation 4. In total, GPT-2 (XL) contains n = 48 blocks, with 12 heads each; a dimensionality of d = 1600 and a context window of k = 1024, yielding a total of 1.5 × 10 parameters. We used the PyTorch implementation of GPT-2 from the Transformers package (Wolf et al., 2020). One complication of deriving word-probabilities from GPT-2 is that it doesn’t operate on words but on tokens. Tokens can be whole words (as with most common words) or sub- words. To derive word probabilities, we take the token probability for a single-token word, and the joint probability for words spanning multiple tokens, as is standard practice in psycholinguistics (Pimentel et al., 2022; Shain et al., 2022; Wilcox et al., 2020). However, because GPT marks word boundaries (i.e., spaces), at the beginning of a token, technically, the ‘end of word‘ decision is made at the next token. Defining word probabilities via (joint) constituent token probabilities doesn’t take this into account. Therefore, it will on average slightly over-estimate word probabilities (underestimate surprisal). However, this slight underestimation is not likely to affect any of our conclusions, since our regression analyses primarily depend on differences between relative predictabilities of different words, and since GPT-2, when used in this way, has been shown to result in probabilities that predict reading behaviour very well compared to other language models that do use whole-word tokenisation (Wilcox et al., 2020), and even compared to larger models like GPT-3 (Shain et al., 2022). We chose GPT-2 because it is a high-quality language model (measured in perplexity on English texts), and better language models generally predict reading behaviour better (Goodkind & Bicknell, 2018; Wilcox et al., 2020). Our analysis does not depend on GPT-2 specifically, it could be switched with any similarly high-quality language model, much in the same way as we do believe our results to be specific to the exact lexical frequency statistics estimates we used (see Discussion). Ideal Observer To compute parafoveal identifiability, we implemented an ideal observer based on the formal- ism by Duan and Bicknell (2020). This model formalises parafoveal word identification using Bayesian inference and builds on previous well-established ‘Bayesian Reader’ models (Bicknell & Levy, 2010; Norris, 2006, 2009). It computes the probability of the next word given OPEN MIND: Discoveries in Cognitive Science 773 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. a noisy percept by combining a prior over possible words with a likelihood of the noisy percept, given a word identity: pwðÞ jI ∝pwðÞpðÞ Ijw ; (5) where I represents the noisy visual input, and w represents a word identity. We considered two priors (see Figure 6): a non-contextual prior (the overall probability of words in English based on their frequency in Subtlex (Brysbaert & New, 2009), and a contextual prior based on GPT2 (see below). Below we describe how visual information is represented and percep- tual inference is performed. For a graphical schematic of the model, see Figure A.2; for some distinctive simulations showing how the model captures key effects of linguistic and visual characteristics on word recognition, see Figure A.3. Sampling Visual Information. Like in other Bayesian Readers (Bicknell & Levy, 2010; Norris, 2006, 2009), noisy visual input is accumulated by sampling from a multivariate Gaussian which is centred on a one-hot ‘true’ letter vector—here representedinanuncased −1/2 26-dimensional encoding—with a diagonal covariance matrix (ε)= λ() I. The shape of is thus scaled by the sensory quality λ(ε) for a letter at eccentricity ε. Sensory quality is com- puted as a function of the perceptual span: this uses a Gaussian integral based follows the perceptual span or processing rate function from the SWIFT model (Engbert et al., 2005). Spe- cifically, for a letter at eccentricity ε, λ is given by the integral within the bounding box of the letter: εþ:5 1 x λεðÞ ¼ pffiffiffiffiffiffiffiffiffiffiffi exp − dx; (6) 2σ 2πσ ε−:5 which, following Bicknell and Levy (2010) and Duan and Bicknell (2020), is scaled by a scal- ing factor Λ. Unlike SWIFT, the Gaussian in Equation 6 is symmetric, since we only perform inference on information about the next word. By using one-hot encoding and a diagonal covariance matrix, the ideal observer ignores similarity structure between letters. This is clearly a simplification, but one with significant computational benefits; moreover, it is a simplifica- tion shared by all Bayesian Reader-like models (Bicknell & Levy, 2010; Duan & Bicknell, 2020; Norris, 2006), which can nonetheless capture many important aspects of visual word recognition and reading. To determine parameters Λ and σ, we performed a grid search on a subset of Dundee and Geco (see Figure A.4), resulting in Λ = 1 and σ = 3. Note that this σ value is close to the average σ value of SWIFT and (3.075) and corresponds well to prior literature on the size of the perceptual span (±15 characters; Bicknell & Levy, 2010; Engbert et al., 2005; Schotter et al., 2012). Perceptual Inference. Inference is performed over the full vocabulary. This is represented as a matrix which can be seen as a stack of word vectors, y , y , …, y , obtained by concatenating 1 2 v the letter vectors. The vocabulary is thus a V × d matrix, with V the number of words in the vocabulary and d the dimensionality of the word vectors (determined by the length of the longest word: d =26× l ). max To perform inference, we use the belief-updating scheme from Duan and Bicknell (2020), in (t) which the posterior at sample t is expressed as a (V − 1) dimensional log-odds vector x ,in ðÞ t which each entry x represents the log-odds of y relative to the final word y . In this formu- i v ðÞ 0 lation, the initial value of x is thus simply the prior log odds, x = log p(w ) − log p(w ), and i v updating is done by summing prior log-odds and the log-odds likelihood. This procedure is OPEN MIND: Discoveries in Cognitive Science 774 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. repeated for T samples, each time taking the posterior of the previous timestep as the prior in the current timestep. Note that using log odds in this way avoids renormalization: ðÞ 0;…;t pw jI ðÞ t x ¼ log ðÞ 0;…;t pw jI ðÞ 0;…;t−1 ðÞ t pw jI p I jw i i ¼ log ðÞ 0;…;t−1 ðÞ t pw jI p I jw v v (7) ðÞ 0;…;t−1 ðÞ t pw jI p I jw i i ¼ log þ log ðÞ 0;…;t−1 ðÞ t pw jI p I jw v v ðÞ t−1 ðÞ t ¼ x þ Δx : i i (t) In other words, as visual sample I comes in, beliefs are updated by summing the prior log (t−1) (t) odds x and the log-odds likelihood of the new information x . For a given word w , the log-odds likelihood of each new sample is the difference of two multivariate Gaussian log-likelihoods, one centred on y and one on the last vector y . This can i v be formulated as a linear transformation of I: Δx ¼ log pðÞ Ij w − log pðÞ Ij w i i v P P ¼ log pðÞ Ij NðÞ y ; − log pðÞ Ij NðÞ y ; i v 1 P −1 ¼ −ðÞ I − y ðÞ I − y − i i (8) 1 P T −1 −ðÞ I − y ðÞ I − y v v P P −1 −1 T T y y − y y P v v i i −1 ¼ þðÞ y − y I ; i v which implies that updating can be implemented by sampling from a multivariate normal. To perform inference on a given word, we performed this sampling scheme until convergence (using T = 50), and then transformed the posterior log-odds into the log posterior, from which we computed the Shannon entropy as a metric of parafoveal identifiability. To compute the parafoveal entropy for each word in the corpus, we make the simplifying assumption that parafoveal preview only occurs during the last fixation prior to a saccade, thus computing the entropy as a function of the word itself and its distance to the last fixation location within the previously fixated word (which is not always the previous word). Because this distance is different for each participant, it was computed separately for each word, for each participant. Moreover, because the inference scheme is based on sampling, we repeated it 3 times, and averaged these to compute the posterior entropy of the word. The amount of information obtained from the preview is then simply the difference between prior and posterior entropy. The ideal observer was implemented in custom Python code, and can be found in the data sharing collection (see below). Contextual vs. Non-Contextual Prior We considered two observers: one with a non-contextual prior capturing the overall probabil- ity of a word in a language, and with a contextual prior, capturing the contextual probability of OPEN MIND: Discoveries in Cognitive Science 775 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. a word in a specific context. For the non-contextual prior, we simply used lexical frequencies from which we computed the (log)-odds prior used in Equation 7. For the contextual prior, we derived the contextual prior from log-probabilities from GPT-2. This effectively involves constructing a new Bayesian model for each word, for each participant, in each dataset. To simplify this process, we did not take the full predicted distribution of GPT-2, but only the ‘nucleus’ of the top k predicted words with a cumulative probability of 0.95, and truncated the (less reliable) tail of the distribution. Further, we simply assumed that the rest of the tail was ‘flat’ and had a uniform probability. Since the prior odds can be derived from relative frequencies, we can think of the probabilities in the flat tail as having a ‘pseudocount’ of 1. If we similarly express the prior probabilities in the nucleus as implied ‘pseudofrequencies’, the cumulative implied nucleus frequency is then complementary to the length of the tail, which is simply the difference between the vocabulary size and nucleus size (V − k). As such, for word i in the text, we can express the nucleus as implied frequencies as follows: V−k ðÞ i freqs ¼ P w j context (9) ψ tr ðÞ i 1 − Pw jcontext j¼1 j (i) ðÞ i where P (w j context) is the truncated lexical prediction, and P(w | context) is predicted prob- tr ability that word i in the text is word j in the sorted vocabulary. Note that using this flat tail not only simplifies the computation, but also deals with the fact that the vocabulary of GPT-2 is smaller than that of the ideal observer—usingthistailwecan stilluse thefullvocabulary (e.g., to capture orthographic uniqueness effects), while using 95% of the density from GPT-2. Data Selection In our analyses, we focus on first-pass reading (i.e., progressive eye movements), analysing only those fixations or skips when none of the subsequent words have been fixated before. Moreover, we exclude return sweeps (i.e., line transitions), which are very different from within-line sac- cades, and hence excluded. We extensively preprocessed the corpora so that we could include as many words as possible. However, we had to impose some additional restrictions. Specifi- cally we did not include words if they a) contained non-alphabetic characters; b) if they were adjacent to blinks; c) if the distance to the prior fixation location was more than 24 characters (±8); moreover, for the gaze duration we excluded d) words with implausibly short (< 70 ms) or long (> 900 ms) gaze durations. Criterion c) was chosen because some participants occasionally skipped long sequences of words, up to entire lines or more. Such ‘skipping’—indicated by sac- cades much larger than the perceptual span—is clearly different from the skipping of words during normal reading, and was therefore excluded. Note that these criteria are comparatively mild (cf. Duan & Bicknell, 2020; Smith & Levy, 2013), and leave approximately 1.1 million obser- vations for the skipping analysis, and 593.000 reading times observations. Regression Models: Skipping Skipping was modelled via logistic regression in scikit-learn (Pedregosa et al., 2011), with three sets of explanatory variables (or ’models’) each formalising a different explanation for why a word might be skipped. First, a word might be skipped because it could be confidently predicted from context. We formalise this via linguistic entropy, quantifying the information conveyed by the prediction from GPT-2. We used entropy, not (log) probability, because using the next word’s probability directly would presuppose that the word is identified, undermining the dissociation of OPEN MIND: Discoveries in Cognitive Science 776 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. prediction and preview. By contrast, prior entropy specifically probes the information avail- able from prediction only. Secondly, a word might be skipped because it could be identified from a parafoveal preview. This was formalised via parafoveal entropy, which quantifies the parafoveal preview uncertainty (or, inversely, the amount of information conveyed by the preview). This is a complex function integrating low-level visual (e.g., decreasing visibility as a function of eccentricity) and higher-level information (e.g., frequency or orthographic effects) and their interaction (see Figure A.3). Here, too we did not use lexical features (e.g., frequency) of the next word to model skipping directly, as this presupposes that the word is identified; and to the extent that these factors are expected to influence identifiability, this is already captured by the parafoveal entropy (Figure A.3). Finally, a word might be skipped simply because it is too short and/or too close to the prior fixation location, such that a fixation of average length would overshoot the word. This auton- omous oculomotor account was formalised by modelling skipping probability purely as a function of a word’s length and its distance to the previous fixation location. Note that these explanations are not mutually exclusive, so we also evaluated their com- binations (see below). Regression Models: Reading Time As an index of reading time, we analysed first-pass gaze duration, the sum of a word’s first-pass fixation durations. We analyse gaze durations as they arguably most comprehensively reflect how long a word is looked at, and are the focus of similar model-based analyses of contextual effects in reading (Goodkind & Bicknell, 2018; Smith & Levy, 2013). For reading times, we used linear regression, and again considered three sets of explanatory variables, each forma- lising a different kind of explanation. First, a word may be read more slowly because it is unexpected in context. We formalised this using surprisal −log( p), a metric of a word’s unexpectedness—or how much information is conveyed by a word’s identity in light of a prior expectation about the identity. To capture spillover (Rayner et al., 2006; Smith & Levy, 2013) we included not just the surprisal of the current word, but also that of the previous two words. Secondly, a word might be read more slowly because it was difficult to discern from the parafoveal preview. This was formalised using the parafoveal entropy (see above). Finally, a word might be read more slowly because of non-contextual factors of the word itself. This is an aggregate baseline explanation, aimed to capture all relevant non-contextual word attri- butes, which we contrast to the two major contextual sources of information about a word identity that might affect reading times (prediction and preview). We included word class, length, log- frequency, and the relative landing position (quantified as the distance to word centre, both in fraction and in characters). For log-frequency we used the UK or US version of SUBTLEX depend- ing on the corpus and included the log-frequency of the past two words to capture spillover effects. The full model was defined as the joint of all models. For a tabular overview of all explan- atory variables, see Tables A.1–A.3. Model Evaluation We compared the ability of each model to account for the variation in the data by probing prediction performance in a 10-fold cross-validation scheme, in which we quantified how much of the observed variation in skipping rates and gaze durations could be explained. OPEN MIND: Discoveries in Cognitive Science 777 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. For reading times, we did this using the coefficient of determination, defined via the ratio of SS SS res res residual and total sum of squares: R =1 − . The ratio relates the error of the model SS SS tot tot (SS )tothe errorofa ‘null’ model predicting just the mean (SS ), and gives the variance res tot 2 2 explained. For skipping, we use a tightly related metric, the McFadden R . Like the R it is computed by comparing the error of the model to the error of a null model with only an inter- 2 L cept: R =1 − , where L indicates the loss. McF L null While R and R are not identical, they are formally tightly related—critically, both are McF zero when the prediction is constant (no variation explained) and go towards one proportion- ally as the error decreases to zero (i.e., towards all variation explained). Note that in a cross-validated setting, both metrics can become negative when prediction of the model is worse than the prediction of a constant null-model. Variation Partitioning To assess relative importance, we used variation partitioning to estimate how much explained variation could be attributed to each set of explanatory variables. This is also known as var- iance partitioning, as it is originally based on partitioning sums of squares; here we use the more general term ‘variation’ following Legendre (2008). Variation partitioning builds on the insight that when two (groups of) explanatory variables (A and B) both explain some variation in the data y, and A and B are independent, then var- iation explained by combining A and B will be approximately additive. By contrast, when A and B are fully redundant (e.g., when B only has an apparent effect on y through its correlation with A), then a model combining A and B will not explain more than the two alone. Following de Heer et al. (2017), we generalise this logic to up to three (sets of) explanatory variables, by testing each individually and all combinations, and using set theory notation and graphical representation for its simplicity and clarity. A two-way partition of two sets of explanatory variables (Figures 4, A.12, A.13) involves (A and B) fitting three models: two partial models with their features alone (A and B), and a joint model with both (A ∪ B). The unique variation explained by either A (A*) is derived via the difference between the partial models and the joint model: A ¼ A∖B ¼ A ∪ B − B (10) B ¼ B∖A ¼ A ∪ B − A And the intersection is derived from the joint model and sum of the partial models: A ∩ B ¼ A þ B − A ∪ B (11) For three groups of explanatory variables (A, B, and C), the situation is a bit more complex. We first evaluate each separately and all combinations, resulting in 7 models: A; B; C ; A ∪ B; A ∪ C ; B ∪ C ; A ∪ B ∪ C : From these 7 models we obtain 7 ‘empirical’ scores (of variation explained), from which we derive the 7 ‘theoretical‘ partitions: 4 overlap partitions and 3 unique partitions. The first over- lap partition is the variation explained by all models, which we can derive as: A ∩ B ∩ C ¼ A ∪ B ∪ C þ A þ B þ C − A ∪ B − A ∪ C − B ∪ C : (12) OPEN MIND: Discoveries in Cognitive Science 778 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. The next three overlap partitions contain all pairwise intersections of models that did not include the other model: ðÞ A ∩ B ∖C ¼ A þ B − A ∪ B − A ∩ B ∩ C ðÞ A ∩ C ∖B ¼ A þ C − A ∪ C − A ∩ B ∩ C (13) ðÞ B ∩ C ∖A ¼ B þ C − B ∪ C − A ∩ B ∩ C : The last three partitions are those explained exclusively by each model. This is the relative RC complement: the partition unique to A is the relative complement of BC: BC . For simplicity we also use a star notation, indicating the unique partition of A as A*. These are derived as follows: RC A ¼ BC ¼ A ∪ B ∪ C − B ∪ C * RC (14) B ¼ AC ¼ A ∪ B ∪ C − A ∪ C RC C ¼ AB ¼ A ∪ B ∪ C − A ∪ B: Note that, in the cross-validated setting, the results can become paradoxical and depart from what is possible in classical statistical theory, such as partitioning sums of squares. For instance, due to over-fitting, a model that combines multiple EVs could explain less variance than all of the EVs alone, in which case some partitions would become negative. However, following de Heer et al. (2017), we believe that the advantages of using cross-validation out- weigh the risk of potentially paradoxical results in some subjects. Partitioning was carried out for each subject, allowing to statistically assess whether the additional variation explained by a given model was significant. On average, none of the partitions were paradoxical. Simulating Effect Sizes Regression-based preview benefits were defined as the expected difference in gaze duration after a preview of average informativeness versus after no preview at all. This best corresponds to an experiment in which the preceding preview was masked (e.g., XXXX) rather than invalid (see Discussion). To compute this we compared the took the difference in parafoveal entropy between an average preview and the prior entropy. Because we standardised our explanatory variables, this was transformed to subject-specific z-scores and then multiplied by the regres- sion weights to obtain an expected effect size. For the predictability benefit, we computed the expected difference in gaze duration between ‘high’ and ‘low’ probability words. ‘High’ and ‘low’ was empirically defined based on the human-normed cloze probabilities in Provo (using the ORTHOMATCHMODEL definition for additional granularity; Luke & Christianson, 2018), which we divided into thirds using percen- tiles. The resulting cutoff points (low < 0.02; high > 0.25) were log-transformed, applied to the surprisal values from GPT-2, and multiplied by the weights to predict effect sizes. Note that these definitions of ‘low’ and ‘high’ may appear low compared to those in the literature— however, most studies collect cloze only for specific ‘target’ words in relatively predictable contexts, which biases the definition of ‘low’ vs. ‘high’ probability. By contrast, we analysed cloze probabilities for all words, yielding these values. Statistical Testing Statistical testing was performed across participants within each dataset. Because two of the three corpora had a low number of participants (10 and 14 respectively) we used data-driven, non-analytical bootstrap t-tests, that involve resampling a null-distribution with zero mean (by OPEN MIND: Discoveries in Cognitive Science 779 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. removing the mean), counting across bootstraps how likely a t-value at least as extreme as the true t-value was to occur. Each test used at least 10 bootstraps; p values were computed with- out assuming symmetry (equal-tail bootstrap; Rousselet et al., 2019). Confidence intervals (in the figures and text) also based on bootstrapping. ACKNOWLEDGMENTS We thank Maria Barrett, Yunyan Duan, and Benedikt Ehinger for useful input and inspiring discussions during various stages of this project. FUNDING INFORMATION This work was supported by The Netherlands Organisation for Scientific Research (NWO Research Talent grant to M.H.; NWO Vidi 452-13-016 to F.P.d.L.; Gravitation Program Grant Lan- guage in Interaction no. 024.001.006 to P.H.) and the European Union Horizon 2020 Program (ERC Starting Grant 678286, “Contextvision” to F.P.d.L.). AUTHOR CONTRIBUTIONS Conceptualisation: MH. Data wrangling and preprocessing: JvH. Formal analysis: MH, JvH. Statistical analysis and visualisations: JvH, MH. Supervision: FPdL, PH. Initial draft: MH. Final draft: MH, JvH, PH, FPdL. DATA AND CODE AVAILABILITY STATEMENT The Provo and Geco corpora are freely available (Cop et al. 2017; Luke & Christianson, 2018). All additional data and code needed to reproduce the results will be made public on the Donders Repository at https://doi.org/10.34973/kgm8-6z09. REFERENCES Andrews, S., & Veldre, A. (2021). Wrapping up sentence compre- Bouma, H., & de Voogd, A. H. (1974). On the control of eye hension: The role of task demands and individual differences. saccades in reading. Vision Research, 14(4), 273–284. https:// Scientific Studies of Reading, 25(2), 123–140. https://doi.org/10 doi.org/10.1016/0042-6989(74)90077-7, PubMed: 4831591 .1080/10888438.2020.1817028 Brodbeck, C., Bhattasali, S., Cruz Heredia, A. A. L., Resnik, P., Angele, B., Laishley, A. E., Rayner, K., & Liversedge, S. P. Simon, J. Z., & Lau, E. (2022). Parallel processing in speech (2014). The effect of high- and low-frequency previews and perception with local and global representations of linguistic sentential fit on word skipping during reading. Journal of context. eLife, 11, Article e72056. https://doi.org/10.7554/eLife Experimental Psychology: Learning, Memory, and Cognition, .72056, PubMed: 35060904 40(4), 1181–1203. https://doi.org/10.1037/a0036396, Brysbaert, M., & Drieghe, D. (2003). Please stop using word fre- PubMed: 24707791 quency data that are likely to be word length effects in disguise. Angele, B., & Rayner, K. (2013). Processing the in the parafovea: Behavioral and Brain Sciences, 26(4), 479. https://doi.org/10 Are articles skipped automatically? Journal of Experimental Psy- .1017/S0140525X03240103 chology: Learning, Memory, and Cognition, 39(2), 649–662. Brysbaert, M., Drieghe, D., & Vitu, F. (2005). Word skipping: https://doi.org/10.1037/a0029294, PubMed: 22799285 Implications for theories of eye movement control in reading. Balota, D. A., Pollatsek, A., & Rayner, K. (1985). The interaction of In G. Underwood (Ed.), Cognitive processes in eye guidance contextual constraints and parafoveal visual information in read- (pp. 53–78). Oxford University Press. https://doi.org/10.1093 ing. Cognitive Psychology, 17(3), 364–390. https://doi.org/10 /acprof:oso/9780198566816.003.0003 .1016/0010-0285(85)90013-1, PubMed: 4053565 Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Becker, W., & Jürgens, R. (1979). An analysis of the saccadic system Francis: A critical evaluation of current word frequency norms by means of double step stimuli. Vision Research, 19(9), and the introduction of a new and improved word frequency 967–983. https://doi.org/10.1016/0042-6989(79)90222-0, measure for American English. Behavior Research Methods, PubMed: 532123 41(4), 977–990. https://doi.org/10.3758/ BRM.41.4.977, Bicknell, K., & Levy, R. (2010). A rational model of eye movement PubMed: 19897807 control in reading. In J. Hajič (Ed.), Proceedings of the 48th Buswell, G. T. (1920). An experimental study of the eye-voice span Annual Meeting of the Association for Computational Linguistics in reading. University of Chicago. (pp. 1168–1178). Association for Computational Linguistics. Clifton, Jr., C., Ferreira, F., Henderson, J. M., Inhoff, A. W., https://doi.org/10.1037/e520602012-979 Liversedge, S. P., Reichle, E. D., & Schotter, E. R. (2016). Eye OPEN MIND: Discoveries in Cognitive Science 780 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. movements in reading and information processing: Keith Heilbron, M., Armeni, K., Schoffelen, J.-M., Hagoort, P., & Rayner’s 40 year legacy. Journal of Memory and Language, 86, de Lange, F. P. (2022). A hierarchy of linguistic predictions 1–19. https://doi.org/10.1016/j.jml.2015.07.004 during natural language comprehension. Proceedings of Cop, U., Dirix, N., Drieghe, D., & Duyck, W. (2017). Presenting the National Academy of Sciences, 119(32), Article GECO: An eyetracking corpus of monolingual and bilingual sen- e2201968119. https://doi.org/10.1073/pnas.2201968119, tence reading. Behavior Research Methods, 49(2), 602–615. PubMed: 35921434 https://doi.org/10.3758/s13428-016-0734-0,PubMed: 27193157 Heilbron, M., Richter, D., Ekman, M., Hagoort, P., & de Lange, F. P. de Heer, W.A., Huth,A.G., Griffiths, T. L.,Gallant,J.L., & (2020). Word contexts enhance the neural representation of indi- Theunissen, F. E. (2017). The hierarchical cortical organization vidual letters in early visual cortex. Nature Communications, of human speech processing. Journal of Neuroscience, 37(27), 11(1), Article 321. https://doi.org/10.1038/s41467-019-13996-4, 6539–6557. https://doi.org/10.1523/JNEUROSCI.3267-16.2017, PubMed: 31949153 PubMed: 28588065 Hohenstein, S., & Kliegl, R. (2014). Semantic preview benefit dur- Dearborn, W. F. (1906). The psychology of reading: An experimental ing reading. Journal of Experimental Psychology: Learning, Mem- study of the reading pauses and movements of the eye. Columbia ory, and Cognition, 40(1), 166–190. https://doi.org/10.1037 University Contributions to Philosophy and Psychology, 4,1–134. /a0033670, PubMed: 23895448 Dehaene, S. (2009). Reading in the brain: The new science of how Inhoff, A. W. (1984). Two stages of word processing during eye we read. Penguin. fixations in the reading of prose. Journal of Verbal Learning and Deubel, H., O’Regan, J. K., & Radach, R. (2000). Commentary on Verbal Behavior, 23(5), 612–624. https://doi.org/10.1016/S0022 section 2—Attention, information processing, and eye movement -5371(84)90382-7 control. In A. Kennedy, R. Radach, D. Heller, & J. Pynte (Eds.), Just, M. A., & Carpenter, P. A. (1980). A theory of reading: From eye Reading as a perceptual process (pp. 355–374). North-Holland/ fixations to comprehension. Psychological Review, 87(4), Elsevier Science Publishers. https://doi.org/10.1016/B978 329–354. https://doi.org/10.1037/0033-295X.87.4.329, -008043642-5/50017-6 PubMed: 7413885 Duan, Y., & Bicknell, K. (2020). A rational model of word skipping Kennedy, A. (2003). The Dundee Corpus [CD-ROM]. Psychology in reading: Ideal integration of visual and linguistic information. Department, University of Dundee. Topics in Cognitive Science, 12(1), 387–401. https://doi.org/10 Kennedy, A., Pynte, J., Murray, W. S., & Paul, S.-A. (2013). .1111/tops.12485, PubMed: 31823454 Frequency and predictability effects in the Dundee Corpus: An Ehrlich, S. F., & Rayner, K. (1981). Contextual effects on word per- eye movement analysis. Quarterly Journal of Experimental ception and eye movements during reading. Journal of Verbal Psychology, 66(3), 601–618. https://doi.org/10.1080/17470218 Learning and Verbal Behavior, 20(6), 641–655. https://doi.org .2012.676054, PubMed: 22643118 /10.1016/S0022-5371(81)90220-6 Kliegl,R., Grabner, E.,Rolfs,M., &Engbert,R.(2004).Length, Engbert, R., & Kliegl, R. (2003). The game of word skipping: frequency, and predictability effects of words on eye movements Who are the competitors? Behavioral and Brain Sciences, in reading. European Journal of Cognitive Psychology, 16(1–2), 26(4), 481–482. https://doi.org/10.1017/S0140525X03270102 262–284. https://doi.org/10.1080/09541440340000213 Engbert, R., Nuthmann, A., Richter, E. M., & Kliegl, R. (2005). Kliegl, R., Nuthmann, A., & Engbert, R. (2006). Tracking the mind SWIFT: A dynamical model of saccade generation during read- during reading: The influence of past, present, and future words ing. Psychological Review, 112(4), 777–813. https://doi.org/10 on fixation durations. Journal of Experimental Psychology: .1037/0033-295X.112.4.777, PubMed: 16262468 General, 135(1), 12–35. https://doi.org/10.1037/0096-3445.135 Findlay, J. M., & Walker, R. (1999). A model of saccade generation .1.12, PubMed: 16478314 based on parallel processing and competitive inhibition. Behav- Kliegl, R., Risse, S., & Laubrock, J. (2007). Preview benefit and ioral and Brain Sciences, 22(4), 661–721. https://doi.org/10.1017 parafoveal-on-foveal effects from word n + 2. Journal of Experi- /S0140525X99002150, PubMed: 11301526 mental Psychology: Human Perception and Performance, 33(5), Frank, S. L., Fernandez Monsalve, I., Thompson, R. L., & Vigliocco, 1250–1255. https://doi.org/10.1037/0096-1523.33.5.1250, G. (2013). Reading time data for evaluating broad-coverage PubMed: 17924820 models of English sentence processing. Behavior Research Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference Methods, 45(4), 1182–1190. https://doi.org/10.3758/s13428 in the visual cortex. Journal of the Optical Society of America A, -012-0313-y, PubMed: 23404612 20(7), 1434–1448. https://doi.org/10.1364/JOSAA.20.001434, Goodkind, A., & Bicknell, K. (2018). Predictive power of word sur- PubMed: 12868647 prisal for reading times is a linear function of language model Legendre, P. (2008). Studying beta diversity: Ecological variation par- quality. In Proceedings of the 8th Workshop on Cognitive Model- titioning by multiple regression and canonical analysis. Journal of ing and Computational Linguistics (CMCL 2018) (pp. 10–18). Plant Ecology, 1(1), 3–8. https://doi.org/10.1093/jpe/rtm001 Association for Computational Linguistics. https://doi.org/10 Luke, S. G., & Christianson, K. (2016). Limits on lexical prediction .18653/v1/ W18-0102 during reading. Cognitive Psychology, 88,22–60. https://doi.org Goodman, K. S. (1967). Reading: A psycholinguistic guessing /10.1016/j.cogpsych.2016.06.002, PubMed: 27376659 game. Journal of the Reading Specialist, 6(4), 126–135. https:// Luke, S. G., & Christianson, K. (2018). The Provo Corpus: A large doi.org/10.1080/19388076709556976 eye-tracking corpus with predictability norms. Behavior Research Hahn, M., & Keller, F. (2023). Modeling task effects in human Methods, 50(2), 826–833. https://doi.org/10.3758/s13428-017 reading with neural network-based attention. Cognition, 230, -0908-4, PubMed: 28523601 Article 105289. https://doi.org/10.1016/j.cognition.2022 McClelland, J. L., & Elman, J. L. (1986). The TRACE model of .105289, PubMed: 36208565 speech perception. Cognitive Psychology, 18(1), 1–86. https:// Hanes, D. P., & Schall, J. D. (1995). Countermanding saccades in doi.org/10.1016/0010-0285(86)90015-0, PubMed: 3753912 macaque. Visual Neuroscience, 12(5), 929–937. https://doi.org McClelland, J. L., & O’Regan, J. K. (1981). Expectations increase /10.1017/S0952523800009482, PubMed: 8924416 the benefit derived from parafoveal visual information in reading OPEN MIND: Discoveries in Cognitive Science 781 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. words aloud. Journal of Experimental Psychology: Human Psychology, 62(8), 1457–1506. https://doi.org/10.1080 Perception and Performance, 7(3), 634–644. https://doi.org/10 /17470210902816461, PubMed: 19449261 .1037/0096-1523.7.3.634 Rayner, K., Juhasz, B. J., & Brown, S. J. (2007). Do readers obtain McConkie, G. W., Kerr, P. W., Reddix, M. D., & Zola, D. (1988). Eye preview benefit from word N + 2? A test of serial attention shift movement control during reading: I. The location of initial eye versus distributed lexical processing models of eye movement fixations on words. Vision Research, 28(10), 1107–1118. https:// control in reading. Journal of Experimental Psychology: Human doi.org/10.1016/0042-6989(88)90137-X,PubMed: 3257013 Perception and Performance, 33(1), 230–245. https://doi.org/10 McConkie, G. W., & Rayner, K. (1975). The span of the effective .1037/0096-1523.33.1.230, PubMed: 17311490 stimulus during a fixation in reading. Perception & Psychophys- Rayner, K., & Pollatsek, A. (1987). Eye movements in reading: A ics, 17(6), 578–586. https://doi.org/10.3758/BF03203972 tutorial review. In M. Coltheart (Ed.), Attention and performance Morton, J. (1964). The effects of context upon speed of reading, XII: The psychology of reading (pp. 327–362). Lawrence Erlbaum eye movements and eye-voice span. Quarterly Journal of Exper- Associates, Inc. imental Psychology, 16(4), 340–354. https://doi.org/10.1080 Rayner, K., Reichle, E. D., Stroud, M. J., Williams, C. C., & /17470216408416390 Pollatsek, A. (2006). The effect of word frequency, word pre- Norris, D. (2006). The Bayesian reader: Explaining word recogni- dictability, and font difficulty on the eye movements of young tion as an optimal Bayesian decision process. Psychological and older readers. Psychology and Aging, 21(3), 448–465. Review, 113(2), 327–357. https://doi.org/10.1037/0033-295X https://doi.org/10.1037/0882-7974.21.3.448, PubMed: .113.2.327, PubMed: 16637764 16953709 Norris, D. (2009). Putting it all together: A unified account of word Rayner, K., & Well, A. D. (1996). Effects of contextual constraint on recognition and reaction-time distributions. Psychological eye movements in reading: A further examination. Psychonomic Review, 116(1), 207–219. https://doi.org/10.1037/a0014259, Bulletin & Review, 3(4), 504–509. https://doi.org/10.3758 PubMed: 19159154 /BF03214555, PubMed: 24213985 O’Regan, J. K. (1980). The control of saccade size and fixation Reicher, G. M. (1969). Perceptual recognition as a function of duration in reading: The limits of linguistic control. Perception meaningfulness of stimulus material. Journal of Experimental & Psychophysics, 28(2), 112–117. https://doi.org/10.3758 Psychology, 81(2), 275–280. https://doi.org/10.1037/h0027768, /BF03204335, PubMed: 7432983 PubMed: 5811803 O’Regan, J. K. (1992). Optimal viewing position in words and the Reichle, E. D., Rayner, K., & Pollatsek, A. (2003). The E-Z reader strategy-tactics theory of eye movements in reading. In K. Rayner model of eye-movement control in reading: Comparisons to (Ed.), Eye movements and visual cognition: Scene perception and other models. Behavioral and Brain Sciences, 26(4), 445–526. reading (pp. 333–354). Springer. https://doi.org/10.1007/978-1 https://doi.org/10.1017/S0140525X03000104,PubMed: -4612-2852-3_20 15067951 Pan, Y., Frisson, S., & Jensen, O. (2021). Neural evidence for lex- Reilly, R. G., & O’Regan, J. K. (1998). Eye movement control during ical parafoveal processing. Nature Communications, 12(1), reading: A simulation of some word-targeting strategies. Vision Article 5234. https://doi.org/10.1038/s41467-021-25571-x, Research, 38(2), 303–317. https://doi.org/10.1016/S0042 PubMed: 34475391 -6989(97)87710-3, PubMed: 9536356 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2019). An introduc- Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., tion to the bootstrap: A versatile method to make inferences by Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, using data-driven simulations. PsyArXiv. https://doi.org/10.31234 M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in /osf.io/h8ft7 Python. Journal of Machine Learning Research, 12, 2825–2830. Schall, J. D., & Cohen, J. Y. (2011). The neural basis of saccade tar- Pimentel, T., Meister, C., Wilcox, E. G., Levy, R., & Cotterell, R. get selection. In S. Liversedge, I. Gilchrist, & S. Everling (Eds.), (2022). On the effect of anticipation on reading times. The Oxford handbook of eye movements (pp. 357–381). Oxford arXiv:2211.14301. https://doi.org/10.48550/arXiv.2211.14301 University Press. https://doi.org/10.1093/oxfordhb Pynte, J., & Kennedy, A. (2006). An influence over eye movements /9780199539789.013.0019 in reading exerted from beyond the level of the word: Evidence Schotter, E. R., Angele, B., & Rayner, K. (2012). Parafoveal process- from reading English and French. Vision Research, 46(22), ing in reading. Attention, Perception, & Psychophysics, 74(1), 3786–3801. https://doi.org/10.1016/j.visres.2006.07.004, 5–35. https://doi.org/10.3758/s13414-011-0219-2, PubMed: PubMed: 16938333 22042596 Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. Schotter, E. R., Lee, M., Reiderman, M., & Rayner, K. (2015). The (2019). Language models are unsupervised multitask learners. effect of contextual constraint on parafoveal processing in read- OpenAI Blog, 1(8), 9. ing. Journal of Memory and Language, 83,118–139. https://doi Rayner, K. (1975). The perceptual span and peripheral cues in read- .org/10.1016/j.jml.2015.04.005, PubMed: 26257469 ing. Cognitive Psychology, 7(1), 65–81. https://doi.org/10.1016 Shain, C., Meister, C., Pimentel, T., Cotterell, R., & Levy, R. (2022). /0010-0285(75)90005-5 Large-scale evidence for logarithmic effects of word predictabil- Rayner, K. (1977). Visual attention in reading: Eye movements ity on reading time. PsyArXiv. https://doi.org/10.31234/osf.io reflect cognitive processes. Memory & Cognition, 5(4), /4hyna 443–448. https://doi.org/10.3758/BF03197383, PubMed: Smith, N. J., & Levy, R. (2013). The effect of word predictability on 24203011 reading time is logarithmic. Cognition, 128(3), 302–319. https:// Rayner, K. (1979). Eye guidance in reading: Fixation locations doi.org/10.1016/j.cognition.2013.02.013, PubMed: 23747651 within words. Perception, 8(1), 21–30. https://doi.org/10.1068 Staub, A. (2015). The effect of lexical predictability on eye move- /p080021, PubMed: 432075 ments in reading: Critical review and theoretical interpretation. Rayner, K. (2009). Eye movements and attention in reading, scene Language and Linguistics Compass, 9(8), 311–327. https://doi perception, and visual search. Quarterly Journal of Experimental .org/10.1111/lnc3.12151 OPEN MIND: Discoveries in Cognitive Science 782 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Tiffin-Richards, S. P., & Schroeder, S. (2015). Children’s and adults’ Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., parafoveal processes in German: Phonological and orthographic Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, effects. Journal of Cognitive Psychology, 27(5), 531–548. https:// S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., doi.org/10.1080/20445911.2014.999076 Gugger, S., … Rush, A. M. (2020). HuggingFace’s Transformers: Veldre, A., & Andrews, S. (2018). Parafoveal preview effects State-of-the-art natural language processing. arXiv:1910.03771. depend on both preview plausibility and target predictability. https://doi.org/10.48550/arXiv.1910.03771 Quarterly Journal of Experimental Psychology, 71(1), 64–74. Woolnough, O., Donos, C., Rollo, P. S., Forseth, K. J., Lakretz, Y., https://doi.org/10.1080/17470218.2016.1247894, PubMed: Crone, N. E., Fischer-Baum, S., Dehaene, S., & Tandon, N. 27734767 (2021). Spatiotemporal dynamics of orthographic and lexical Vitu, F., O’Regan, J. K., Inhoff, A. W., & Topolski, R. (1995). Mindless processing in the ventral visual pathway. Nature Human Behaviour, reading: Eye-movement characteristics are similar in scanning letter 5(3), 389–398. https://doi.org/10.1038/s41562-020-00982-w, strings and reading texts. Perception & Psychophysics, 57(3), 352–364. PubMed: 33257877 https://doi.org/10.3758/BF03213060, PubMed: 7770326 Yan, M., Kliegl, R., Shu, H., Pan, J., & Zhou, X. (2010). Parafoveal Vitu, F., O’Regan, J. K., & Mittau, M. (1990). Optimal landing posi- load of word N + 1 modulates preprocessing effectiveness of tion in reading isolated words and continuous text. Perception & word N + 2 in Chinese reading. Journal of Experimental Psychol- Psychophysics, 47(6), 583–600. https://doi.org/10.3758 ogy: Human Perception and Performance, 36(6), 1669–1676. /BF03203111, PubMed: 2367179 https://doi.org/10.1037/a0019329, PubMed: 20731511 Wheeler, D. D. (1970). Processes in word recognition. Cognitive Yeatman, J. D., & White, A. L. (2021). Reading: The confluence Psychology, 1(1), 59–85. https://doi.org/10.1016/0010 of vision and language. Annual Review of Vision Science, 7, -0285(70)90005-8 487–517. https://doi.org/10.1146/annurev-vision-093019 Wilcox, E. G., Gauthier, J., Hu, J., Qian, P., & Levy, R. (2020). On -113509, PubMed: 34166065 the predictive power of neural language models for human Zwitserlood, P. (1989). The locus of the effects of sentential-semantic real-time comprehension behavior. arXiv:2006.01912. https:// context in spoken-word processing. Cognition, 32(1), 25–64. https:// doi.org/10.48550/arXiv.2006.01912 doi.org/10.1016/0010-0277(89)90013-9, PubMed: 2752705 OPEN MIND: Discoveries in Cognitive Science 783 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Open Mind MIT Press

Lexical Processing Strongly Affects Reading Times But Not Skipping During Natural Reading

Loading next page...
 
/lp/mit-press/lexical-processing-strongly-affects-reading-times-but-not-skipping-cbQ72wsUPB

References (98)

Publisher
MIT Press
Copyright
© 2023 Massachusetts Institute of Technology. Published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
eISSN
2470-2986
DOI
10.1162/opmi_a_00099
Publisher site
See Article on Publisher Site

Abstract

REPORT Lexical Processing Strongly Affects Reading Times But Not Skipping During Natural Reading 1,2,3 1 1,2 1 Micha Heilbron , Jorie van Haren , Peter Hagoort , and Floris P. de Lange Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands University of Amsterdam, Amsterdam, The Netherlands Keywords: reading, eye movements, prediction, preview, Bayesian reader, neural networks an open access journal ABSTRACT In a typical text, readers look much longer at some words than at others, even skipping many altogether. Historically, researchers explained this variation via low-level visual or oculomotor factors, but today it is primarily explained via factors determining a word’s lexical processing ease, such as how well word identity can be predicted from context or discerned from parafoveal preview. While the existence of these effects is well established in controlled experiments, the relative importance of prediction, preview and low-level factors in natural reading remains unclear. Here, we address this question in three large naturalistic reading corpora (n = 104, 1.5 million words), using deep neural networks and Bayesian ideal observers to model linguistic prediction and parafoveal preview from moment to moment in natural reading. Strikingly, neither prediction nor preview was important for explaining word skipping—the vast majority of explained variation was explained by a simple oculomotor model, using just fixation position and word length. For reading times, by contrast, we found Citation: Heilbron, M., van Haren, J., strong but independent contributions of prediction and preview, with effect sizes matching Hagoort, P., & de Lange, F. P. (2023). Lexical Processing Strongly Affects those from controlled experiments. Together, these results challenge dominant models of eye Reading Times But Not Skipping During Natural Reading. Open Mind: movements in reading, and instead support alternative models that describe skipping (but not Discoveries in Cognitive Science, 7, 757–783. https://doi.org/10.1162/opmi reading times) as largely autonomous from word identification, and mostly determined by _a_00099 low-level oculomotor information. DOI: https://doi.org/10.1162/opmi_a_00099 Supplemental Materials: https://doi.org/10.1162/opmi_a_00099 INTRODUCTION Received: 18 April 2023 When reading a text, readers move their eyes across the page to bring new information to the Accepted: 27 July 2023 centre of the visual field, where perceptual sensitivity is highest. While it may subjectively feel Competing Interests: The authors as if the eyes smoothly slide along the text, they in fact traverse the words with rapid jerky declare no conflict of interests. movements called saccades, followed by brief stationary periods called fixations. Across a text, Corresponding Author: saccades and fixations are highly variable and seemingly erratic: Some fixations last less than Micha Heilbron m.heilbron@uva.nl 100 ms, others more than 400; and while some words are fixated multiple times, many other words are skipped altogether (Dearborn, 1906; Rayner & Pollatsek, 1987). What explains this striking variation? Copyright: © 2023 Massachusetts Institute of Technology Historically, researchers have pointed to low-level non-linguistic factors like word length, Published under a Creative Commons Attribution 4.0 International oculomotor noise, or the relative position where the eyes happen to land (Bouma & de Voogd, (CC BY 4.0) license 1974; Buswell, 1920; Dearborn, 1906;O’Regan, 1980). Such explanations were motivated by the idea that oculomotor control was largely autonomous. In this view, readers can adjust The MIT Press Lexical Processing Affects Reading Times Not Skipping Heilbron et al. saccade lengths and fixation durations to global characteristics like text difficulty or reading strategy, but not to subtle word-by-word differences in language processing (Bouma & de Voogd, 1974; Buswell, 1920; Dearborn, 1906; Morton, 1964). As reading was studied in more detail, however, it became clear that the link between eye movements and cognition was more direct. For instance, it was found that fixation durations were shorter for words with higher frequency (Inhoff, 1984; Rayner, 1977). Eye movements were even shown to depend on how well a word’s identity could be inferred before fixation. Specifically, researchers found that words are read faster and skipped more often if they are predictable from linguistic context (Balota et al., 1985; Ehrlich & Rayner, 1981) or if they are identifiable from a parafoveal preview (McConkie & Rayner, 1975; Rayner, 1975; Schotter et al., 2012). These demonstrations of a direct link between eye movements and language processing overturned the autonomous view, replacing it by cognitive accounts describing eye move- ments during reading as largely, if not entirely, controlled by linguistic processing (Clifton et al., 2016; Reichle et al., 2003). Today, many studies still build on the powerful techniques like gaze-contingent displays that helped overturn the autonomous view, but now ask much more detailed questions, like whether word identification is a distributed or sequential process (Kliegl et al., 2006, 2007); how many words can be processed in the parafovea (Rayner et al., 2007); at which level they are analysed (Hohenstein & Kliegl, 2014; Pan et al., 2021), and how this may differ between writing systems or orthographies (Tiffin-Richards & Schroeder, 2015; Yan et al., 2010). Here, we ask a different, perhaps more elemental question: how much of the variation in eye movements do linguistic prediction, parafoveal preview, and non-linguistic factors each explain? That is, how important are these factors for determining how the eyes move during reading? Dominant, cognitive models explain eye movement variation primarily as a function of lexical processing. Skipping, for instance, is modelled as the probability that a word is iden- tified before fixation (Engbert & Kliegl, 2003; Engbert et al., 2005; Reichle et al., 2003). Some, however, have questioned this purely cognitive view, suggesting that low-level features like word eccentricity or length might be more important (Brysbaert et al., 2005; Reilly & O’Regan, 1998; Vitu et al., 1995). One particularly relevant analysis comes from Brysbaert et al. (2005). Presenting a meta-analysis on the aggregate effect sizes on word skipping, they argue that the effect of length and distance is so large that skipping may not just be driven by ongoing word identification, but also—and indeed perhaps primarily—by low-level heuristics part of a sim- ple, scanning strategy (Brysbaert et al., 2005). Similarly, one may ask what drives next-word identification: is identifying the next word mostly driven by linguistic predictions (Goodman, 1967) or by parafoveal perception? Remarkably, while it is well-established that both linguistic and oculomotor, and both predictive and parafoveal processing, all affect eye-movements (Brysbaert et al., 2005; Kliegl et al., 2004; Schotter et al., 2012; Staub, 2015), a comprehensive picture of their relative explanatory power is currently missing, perhaps because they are sel- dom studied all at the same time. To arrive at such a comprehensive picture we focus on natural reading, analysing three large datasets of participants reading passages, long articles, and even an entire novel—together encompassing 1.5 million (un)fixated words, across 108 individuals (Cop et al., 2017; Kennedy, 2003; Luke & Christianson, 2018). We use a model-based approach: instead of manipulating word predictability or perturbing parafoveal perceptibility, we com- bine deep neural language modelling (Radford et al., 2019) and Bayesian ideal observer anal- ysis (Duan & Bicknell, 2020) to quantify how much information about next-word identity is OPEN MIND: Discoveries in Cognitive Science 758 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. conveyed by both prediction and preview, on a moment-by-moment basis. Our model-based analysis is quite different from the experimental approach, especially in the case of parafoveal preview which is generally studied with a boundary paradigm. However, the underlying logic is the same: in the boundary paradigm, eye movements are compared between conditions in which the preview is informative (valid) and when it conveys no (or incorrect) information about word identity. We—following (Bicknell & Levy, 2010; Duan & Bicknell, 2020)—simply replace this categorical contrast with a more continuous analysis, quantifying the subtle word-by-word variation in the amount of information conveyed by the prior preview. In this sense, our approach can be seen as an extension and refinement of the seminal analyses by Brysbaert and colleagues, allowing for instance to quantify not just the effect of word length on skipping—but also, simultaneously, estimate and control for the effect word length has on a word’s prior parafoveal identifiability. In this way, our word-by-word, information-theoretic analysis brings us closer to the under- lying mechanisms than analysing effect sizes in the aggregate. However, we want to stress we use these models as normative models to estimate how much information is in principle avail- able from prediction and preview at each moment, but do not take these as processing models of human cognition (see Methods and Discussion for a more extensive comparison of our model-based approach and traditional methods). Such a broad-coverage model-based approach has been applied to predictability effects on reading before (Frank et al., 2013; Goodkind & Bicknell, 2018; Kliegl et al., 2004; Luke & Christianson, 2016;Shain et al., 2022; Smith & Levy, 2013), but either without considering preview or only through coarse heuristics such as using word frequency as a proxy for parafoveal identifiability (Kennedy et al., 2013; Kliegl et al., 2006; Pynte & Kennedy, 2006) (but see Duan & Bicknell, 2020). By contrast, we explicitly model both, in addition to low-level explanations like autonomous oculomotor control. To assess explanatory power, we use set theory to derive the unique and shared variation in eye movements explained by each model. To preview the results, this revealed a striking dissociation between skipping and reading times. For word skipping, the overwhelming majority of explained variation could be explained—mostly uniquely explained—by a non-linguistic oculomotor model, that explained word skipping just as a function of a word’s distance to the prior fixation position and its length. These two low-level variables explained much more skipping variation than the degree to which a word was identifiable or predictable prior to fixation. For reading times, by contrast, we did find that factors determining a word’s lexical processing explained most variance. In line with dominant models, we found strong effects of both prediction and preview, matching effect sizes from controlled designs. Interestingly, prediction and parafoveal preview seem to operate independently: we found strong evidence against Bayes-optimal integration of the two. Together, these results support and extend the earlier conclusions of Brysbaert and col- leagues, while challenging dominant cognitive models of reading, showing that skipping (or the decision of where to fixate) and reading times (i.e., how long to fixate) are governed by different principles, and that for word skipping, the link between eye movements and cogni- tion is less direct than commonly thought. RESULTS We analysed eye movements from three large datasets of participants reading texts ranging from isolated paragraphs to an entire novel. Specifically, we considered three datasets: Dundee (Kennedy, 2003)(N = 10, 51.502 words per participant), Geco (Cop et al., 2017) (N = 14, 54.364 words per participant) and Provo (Luke & Christianson, 2018)(N = 84, OPEN MIND: Discoveries in Cognitive Science 759 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. 2.689 words per participant). In each corpus, we analysed both skipping and reading times (indexed by gaze duration), as they are thought to reflect separate processes: the decision of where vs. how long to fixate, respectively (Brysbaert et al., 2005; Reichle et al., 2003). For more descriptive details about the data across participants and datasets, see Methods and Figures A.5–A.7. To estimate the effect of linguistic prediction and parafoveal preview, we quantified the amount of information conveyed by both factors for each word in the corpus (for preview, this was tailored to each individual participant, since each word was previewed at a different eccentricity by each participant). To this end, we formalised both processes as a probabilistic belief about the identity of the next word, given either the preceding words (prediction) or a noisy parafoveal percept (preview; see Figure 1A). As such, we could describe these disparate cognitive processes using a common information-theoretic currency. To compute the proba- bility distributions, we used GPT-2 for prediction (Radford et al., 2019) and a Bayesian ideal observer for preview (Duan & Bicknell, 2020) (see Figure 1B and Methods). Note that we use both computational models as normative models; tools to estimate how much information is in principle available from linguistic context (prediction) or parafoveal perception (preview) on a moment-by-moment basis. In other words, we use these models much in the same way as we rely on the counting algorithms used to aggregate lexical frequency statistics: in both cases we are interested in the computed statistic (e.g., lexical surprisal or entropy, or lexical fre- quency) but we do not want to make any cognitive claim about the underlying algorithm that we happened to use to compute this statistic (e.g., GPT-2 for lexical surprisal, or a counting algorithm for lexical frequency). For more details on the exact choice, and relation to alternative metrics (e.g., GPT-3 or cloze probabilities) see Methods and Discussion. Prediction and Preview Increase Skipping Rates and Reduce Reading Times We first asked whether our formalisations allowed us to observe the expected effects of pre- diction and preview, while statistically controlling for other explanatory variables. This was Figure 1. Quantifying two types of context during natural reading. (A) Readers can infer the identity of the next word before fixation either by predicting it from context or by discerning it from the parafovea. Both can be cast as a probabilistic inference about the next word, either given the preceding words (prediction, blue) or given a parafoveal percept (preview, orange). (B) To model prediction, we use GPT-2, one of the most powerful publicly available language models (Radford et al., 2019). For preview, we use an ideal observer (Duan & Bicknell, 2020) based on well-established ‘Bayesian Reader’ models (Bicknell & Levy, 2010; Norris, 2006, 2009). Importantly, we do not use either model as a cognitive model per se, but rather as a tool to quantify how much information is in principle available from prediction or preview on a moment-by-moment basis. OPEN MIND: Discoveries in Cognitive Science 760 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. done by performing a multiple regression analysis, and statistically testing whether the coef- ficients were in the expected direction. Word skipping was modelled with a logistic regression, reading times (gaze durations) were predicted using ordinary least squares regression. Because the decisions of whether to skip and how long to fixate a word are made at different moments, when different types of information are available, we modeled each separately with a different set of explanatory variables. But in both cases, for inference on the coefficients, we considered the full model (variables motivated and detailed below; see Tables A.1 and A.2 for a tabular overview of all variables). As expected, we found in all datasets that words were more likely to be skipped if there was more information available from the linguistic prediction (Bootstrap: Dundee, p = 0.023; −5 GECO, p = 0.034; Provo, p <10 ) and/or the parafoveal preview (Bootstrap: Dundee, p = −5 −5 −5 4×10 ; GECO, p <10 ; Provo, p <10 ). Similarly, reading times were reduced for words −4 that were more predictable (all p’s < 3.2 × 10 ) or more identifiable from the parafovea (all −5 p’s<4 × 10 ). Together this confirms that our model-based approach can capture the expected effects of both prediction (Clifton et al., 2016) and preview (Schotter et al., 2012) in natural reading, while statistically controlling for other variables. Word Skipping is Largely Independent of Online Lexical Processing After confirming that prediction and preview had a statistically significant influence on word skipping and reading times, we went on to assess their relative explanatory power. That is, we asked how important these factors were, by examining how much variance was explained by each. To this end, we grouped the variables from the full regression model into different types of explanations, and assessed how well each type accounted for the data, in terms of the unique and overlapping amount of variation explained by each explanation. This in turn 2 2 was measured by the cross-validated R for reading times, and R for skipping, which both McF quantify the proportion of variation explained (see Methods). For skipping, we considered three explanations. First, a word might be skipped purely because it could be predicted from context—that is, purely as a function of the amount of information about word identity conveyed by the prediction. Secondly, a word might be skipped because its identity could be gleaned from a parafoveal preview—that is, purely as a function of the amount of information about word identity conveyed by the preview. Finally, a word might be skipped simply because it is so short or so close to the prior fixation location that an autonomously generated saccade will likely overshoot it, irrespective of its linguistic properties—in other words, purely as a function of length and eccentricity. Note that we did not include often-used lexical attributes like frequency to predict skipping, because using attri- butes of word already pre-supposes parafoveal identification. Moreover, to the extent that a n+1 lexical attribute like frequency might influence a word’s parafoveal identifiability, this should already be captured by the parafoveal entropy (see Figure A.3 and Methods for more details). For each word, we thus modelled the probability of skipping either as a function of predic- tion, preview, or oculomotor information (i.e., eccentricity and length), or by any combination of the three. Then we partitioned the unique and shared cross-validated variation explained by each account. Strikingly, this revealed that the overwhelming majority of explained skipping variation (94%) could be accounted for by the oculomotor baseline that consisted just of eccentricity and length (Figure 2). Moreover, the majority of the variation was only explained by the baseline, which explained 10 times more unique variation than prediction and preview combined. There was a large degree of overlap between preview and the oculomotor baseline, which is unsurprising since a word’s identifiability decreases as a function of its eccentricity OPEN MIND: Discoveries in Cognitive Science 761 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 2. Variation in skipping explained by predictive, parafoveal and autonomous oculomotor processing. (A) Proportions of cross-validated variation explained by prediction (blue), preview (orange) oculomotor baseline (grey) and their overlap; averaged across datasets (each dataset weighted equally). (B) Variation partitions for each individual dataset, including statistical signifi- cance of variation uniquely explained by predictive, parafoveal or oculomotor processing. Stars indicate significance-levels of the cross-validated unique variation explained (bootstrap t-test against zero): p <0.05(*), p < 0.05 (**), p < 0.001 (***) For results of individual participants, and their consistency, see Figure A.9. and length. Interestingly, there was even more overlap between the prediction and baseline model: almost all skipping variation that could be explained by contextual constraint could be equally well explained by the oculomotor baseline factors. Importantly, while the contribution of prediction and preview was small, it was significant both for prediction (Dundee: 0.015% bootstrap 95CI: 0.003–0.029%; bootstrap t-test com- pared to zero, p = 0.014; Geco: 0.039%, 95CI: 0.018–0.065%; p = 0.0001; Provo: 0.20%; −5 −5 95CI: 0.14–0.28%, p <10 ) and preview (Dundee: 2.14%, 95CI: 1.66–2.60%; p <10 ; −5 −5 Geco: 1.71%, 95CI: 1.20–2.29%, p <10 ; Provo: 0.56%, 95CI: 0.36–0.79%, p <10 ), con- firming that both factors do affect skipping. Crucially however, the vast majority of skipping that could be explained by either prediction or preview was equally well explained by the more low-level and computationally frugal oculomotor model—which also explained much more of the skipping data overall. This challenges the idea that word identification is the main driver behind skipping, instead pointing to a more low-level, computationally simpler strategy. What might this simpler strategy be? One possibility is a ‘blind’ random walk: generating saccades of some average length, plus oculomotor noise. However, we find that saccades are OPEN MIND: Discoveries in Cognitive Science 762 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. tailored to word length and exhibit a well-known preferred landing position, slightly left to a word’s centre (see Figure A.8; compare McConkie et al., 1988; Rayner, 1979). This suggests the decision of where to look next is not ‘blind’ but is based on a coarse low-level visual anal- ysis of the parafovea, for instance conveying just the location of the next word ‘blob’ within a preferred range (i.e., skipping words too close or short; cf. Brysbaert et al., 2005; Deubel et al., 2000; Reilly & O’Regan, 1998). Presumably, such a simple strategy would on average sample visual input conveniently, yielding saccades large enough to read efficiently but small enough for comprehension to keep track. However, if such an ‘autopilot’ is indeed largely independent of online comprehension, one would expect it occasionally go out of step, such that a skipped word cannot be recognised or guessed, derailing comprehension. In line with suggestion, we find evidence for a compensation strategy. The probability that an initially skipped words is subsequently (regressively) fixated is significantly, inversely related to its parafoveal identifia- bility before skipping (see Figure A.10; logistic regression to prior parafoveal entropy: all β’s> −5 0.15; bootstrap test on coefficients: all p’s<10 ). Together, this suggests that initial skipping decisions are primarily driven by a low-level oculomotor ‘autopilot’, which is kept in line with online comprehension by correcting saccades that outrun word recognition (much in line with the suggestions by Brysbaert et al., 2005). Reading Times are Strongly Modulated by Lexical Processing Difficulty For reading times (defined as gaze durations, so considering foveal reading time only), we similarly considered three broad explanations. First, a word might be read faster because it was predictable from the preceding context, which we formalised via lexical surprisal. Second, a word might be read faster if it could already be partly identified from the parafoveal preview (before fixation). This informativeness of the preview was again formalised via the parafoveal preview entropy. Finally, a word might be read faster due to non-contextual attributes of the fixated word itself, such as frequency or word-class or the viewing position. This last explan- atory factor functioned as a baseline that captured key non-contextual attributes, both linguis- tic and non-linguistic (see Methods). In all datasets, we again found that all explanations accounted for some unique variation: prediction (Dundee: 0.80% bootstrap 95CI: 0.55–1.09%, bootstrap t-test compared to zero: −5 p <6 ;Geco:0.68%,95CI: 0.55–0.83%; p = 0.0001; Provo: 0.35%, 95CI: 0.20–0.43%, −5 p <10 ), preview (Dundee: 1.91%, 95CI: 1.00–3.14%, p = 0.00012; Geco: 1.59%, 95CI: −5 −5 0.96–2.30%, p =5×10 ;Provo: 0.93%,95CI: 0.70–1.98%, p <10 ) and the non- −5 contextual word attributes (Dundee: 8.06%, 95CI: 5.84–10.32%, p =5 × 10 ;Geco: −5 −5 1.99%, 95CI: 1.32–2.81%, p <10 ; Provo: 5.38%, 95CI: 4.48–6.83%, p <10 ). The non-contextual baseline explained the most variance, which shows—unsurprisingly— that properties of the fixated word itself are more important than contextual factors in deter- mining how long a word is fixated. Critically however, compared to skipping the unique contribution of prediction and preview was more than three times higher (see Figure 3). Specifically, while prediction and preview could only uniquely account for 6% of explained word skipping variation, they uniquely accounted for more than 18% of explained variation in reading times. This suggests that while for skipping most explained variation can be accounted for by purely oculomotor variables, this is not the case for reading times. However, this comparison (between oculomotor and lexical processing based accounts) difficult to make based on the Figures 2 and 3 alone. This is because in the reading times analysis, the baseline model contained both oculomotor (i.e., viewing position) and lexical factors (notably lexical frequency). Therefore, we performed an additional analysis, grouping OPEN MIND: Discoveries in Cognitive Science 763 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 3. Variation in reading times explained by predictive, parafoveal and non-contextual information. (A) Grand average of partitions of cross-validated variance in reading times (indexed by gaze durations) across datasets (each dataset weighted equally) explained by non-contextual factors (grey), parafoveal preview (orange), and linguistic prediction (blue). (B) Variance partitions for each individual dataset, including statistical significance of the cross-validated variance explained uniquely by the predictive, parafoveal or non-contextual explanatory variables. Stars indicate significance levels of the cross-validated unique variance explained (bootstrap t test against zero): p < 0.05 (**), p < 0.001 (***). For results of individual participants, see Figure A.11. Note that the baseline model here both lexical attributes (e.g., frequency) and oculomotor factors (relative viewing/landing position). For a direct contrast between lexical processing-based explanations and purely oculomotor explanations, see Figure 4. the explanatory variables differently to contrast purely oculomotor explanatory variables ver- sus variables affecting lexical processing ease (such as predictability, parafoveal identifiability, and lexical frequency; see Tables A.3 and A.4). This shows that for skipping, purely oculomotor explanations can account for much more than a lexical processing-based explanation—but for reading times, it is exactly the other way around (Figure 4). Note that in Figure 4, the oculomotor model for reading times only contains variables quantifying viewing/landing position, because this is the primary oculomotor explanation for reading time differences (O’Regan, 1980, 1992; Vitu et al., 1990). If we also include word length in the oculomotor model for reading times, there is much more overlapping variance explained by the lexical and oculomotor model, presumably due to the correlation between word length and (log)frequency, which may inflate the importance of the oculomotor account (see Figure A.13). However, even with this potentially inflated estimate, the overall dissociation persists: if we compare the ratios unique variation explained by oculomotor vs. lexical processing-based models, there is still more than a 30-fold difference between the skipping and reading times analysis (in Figure A.13A). Together, this supports that for skipping, most explained variation is OPEN MIND: Discoveries in Cognitive Science 764 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 4. Comparing oculomotor and lexical processing-based explanations for skipping and reading times. Analysis with the same explan- atory variables as Figures 2 and 3, grouped differently to directly contrast purely oculomotor explanatory variables and those that affect lexical processing ease (such as predictability, parafoveal identifiability, and lexical frequency; see Methods). Venn diagrams represent the proportions 2 2 of unique and overlapping amount of explained variation (in R for reading times, and R for skipping) by each explanation (grand average McF across datasets). For partitions for individual datasets with statistics, see Figure A.12; for an alternative partitioning that includes word length in the oculomotor of reading times, see Figure A.13. captured by purely oculomotor rather than lexical processing-based explanations, whereas for reading times, it is the other way around. Model-Based Estimates of Naturalistic Prediction and Preview Benefits Match Experimental Effect Sizes The reading times results confirm that reading times are highly sensitive to factors influencing a word’s lexical processing ease, including contextual factors like linguistic and parafoveal con- text. This is in line with the scientific consensus and decades of experimental research on eye movements in reading (Rayner, 2009). But how well do our model-based, correlational results compare exactly to findings from the experimental literature? To directly address this question, we quantitatively derived, for each participant, the effect size of two well-established effects that would be expected to be obtained if we would con- duct a well-controlled factorial experiment. While we did not actually perform such a factorial experiment, we can derive this from the regression model, because we quantitatively esti- mated how much additional information from either prediction or preview (in bits) reduced reading times (in milliseconds). Therefore, the regression analyses allows us to estimate the expected difference in reading times for words that are expected vs. unexpected (predictability benefit; Rayner & Well, 1996;Staub, 2015) or have valid vs. invalid preview (i.e., preview benefit; Schotter et al., 2012). Interestingly, the model-derived effect sizes are very well in line with those observed in experimental studies (see Figure 5). This suggests that our analysis does not strongly underfit or otherwise underestimate the effect of prediction or preview. Moreover, it shows that the effect sizes, which are well-established in controlled designs, generalise to natural reading. This last point is especially interesting for the preview benefit, because it implies that the effect can be largely explained in terms of parafoveal lexical identifiability (Pan et al., 2021; Rayner, 2009), and that other factors such as low-level visual ‘preprocessing’, or interference between OPEN MIND: Discoveries in Cognitive Science 765 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 5. Model-derived effect sizes match experimentally observed effect sizes. Preview (left) and predictability benefits (right) inferred from our analysis of each dataset, and observed in a sample of studies (see Table A.5). In this analysis, preview benefit was derived from the regression model as the expected difference in gaze duration after a preview of average informativeness versus after no preview at all. Pre- dictability benefit was defined as the difference in gaze duration for high versus low probability words; ‘high’ and ‘low’ were defined by subdividing the cloze probabilities from Provo into equal thirds of ‘low’, ‘medium’ and ‘high’ probability (see Methods). In each plot, small dots with dark edges represent either individual subjects within one dataset or individual studies in the sample of the literature; larger dots with error bars represent the mean effect across individuals or studies, plus the bootstrapped 99%CI. the (invalid) parafoveal percept and foveal percept, may only play a minor role (cf. Reichle et al., 2003; Schotter et al., 2012). No Integration of Prediction and Preview So far, we have treated prediction and preview as being independent. However, it might be that these processes, while using different information, are integrated—such that a word is parafoveally more identifiable when it is also more predictable in context. Bayesian probabil- ity theory proposes an elegant and mathematically optimal way to integrate these sources of information: the prediction of the next word could be incorporated as a prior in perceptual inference. Such a contextual prior fits into hierarchical Bayesian models of vision (Lee & Mumford, 2003), and has been observed in speech perception, where a contextual prior guides the recognition of words from a partial sequence of phonemes (Brodbeck et al., 2022; Heilbron et al., 2022). Does such a prior also guide word recognition in reading, based on a partial parafoveal percept? To test this, we recomputed the parafoveal identifiability of each word for each participant, but now with an ideal observer using the prediction from GPT-2 as a prior. As expected, bayesian integration enhanced perceptual inference: on average, the observer using linguistic prediction as a prior extracted more information from the preview (± 6.25 bits) than the observer not taking the prediction into account (± 4.30 bits; T = 1.35 × 10 , p ≈ 0). 1.39×10 Interestingly however, it provided a worse fit to the human reading data. This was established by comparing two versions of the full regression model: one with parafoveal entropy from the (theoretically superior) contextual ideal observer and one from the non-contextual ideal observer. In all datasets both skipping and reading times were better explained by a model including parafoveal identifiability from the non-contextual observer (skipping: all p’s< −5 −5 10 ; reading times: p’s<10 ;see Figure 6). This replicates Duan and Bicknell (2020), who performed a similar analysis comparing a contextual (5-gram) and non-contextual prior in natural reading (Duan & Bicknell, 2020). Our findings replicate and significantly extend their findings, since Duan and Bicknell (2020) only analysed skipping in the Dundee corpus. Our OPEN MIND: Discoveries in Cognitive Science 766 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Figure 6. Evidence against bayesian integration of linguistic prediction and parafoveal preview. Cross-validated prediction performance of the full reading times (top) and skipping (bottom) model (including all variables), equipped with parafoveal preview information either from the contextual observer or from the non-contextual observer. Dots with connecting lines indicate participants; stars indicate significance: p < 0.001 (***). analysis not only investigates additional datasets but it finds exactly the same result for reading times (for which the importance of both prediction and preview is decidedly larger). Together, this suggests while both linguistic prediction and parafoveal preview influence online reading behaviour, the two sources of information are not integrated, but instead oper- ate independently—highlighting a remarkable sub-optimality in reading. DISCUSSION Eye movements during reading are highly variable. Across three large datasets, we assessed the relative importance of different explanations for this variability. In particular, we quantified the importance of two major contextual determinants of a word’s lexical processing difficulty—linguistic prediction and parafoveal preview—and compared such lexical processing- based explanations to alternative (non-linguistic) explanations. This revealed a stark dissocia- tion: for word skipping, a simple low-level oculomotor model (using just word length and distance to prior fixation location) could account for much more (unique) variation than lex- ical processing-based explanations, whereas for reading times, it was exactly the other way around. Interestingly, preview effects were best captured by a non-contextual observer, sug- gesting that while readers use both linguistic prediction and preview, these do not appear to be integrated on-line. Together, the results underscore the dissociation between skipping and reading times, and show that for word skipping, the link between eye movements and cogni- tion is less direct than commonly thought. Our results on skipping strongly support the earlier findings and theoretical perspective by Brysbaert et al. (2005). They analysed effect sizes from studies on skipping and found a disproportionately large effect of length, compared to proxies of processing-difficulty like frequency and predictability. We significantly extend their findings by modelling skipping itself (rather than effect sizes from studies) and making a direct link to processing mechanisms. For instance, based on their analysis it was unclear how much of the length effect could be attrib- uted to the lower visibility of longer words—that is, how much of the length effect may be an identifiability effect (Brysbaert et al., 2005, p. 19). We show that length and eccentricity alone OPEN MIND: Discoveries in Cognitive Science 767 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. explained three times as much variation as parafoveal identifiability—and that most of the var- iation explained by identifiability was equally well explained by length and eccentricity. This demonstrates that length and eccentricity themselves—not just to the extent they reduce identifiability—are key drivers of skipping. This conclusion challenges dominant, cognitive models of eye movements, which describe lexical identification as the primary driver behind skipping (Engbert & Kliegl, 2003; Engbert et al., 2005; Reichle et al., 2003). Importantly, our results do not challenge predictive or par- afoveal word identification itself. Rather, they challenge the notion that moment-to-moment decisions of whether to skip individual words are primarily driven by the recognition of those words. Instead, our results suggest a simpler strategy in which a coarse (e.g., dorsal stream) visual representation is used to reflexively select the next saccade target following the simple heuristic to move forward to the next word ‘blob’ within a certain range (see also Brysbaert et al., 2005; Deubel et al., 2000; Reilly & O’Regan, 1998). Given that readers use both prediction and preview, why would they strongly affect reading times but hardly word skipping? We suggest this is because these different decisions—of where versus how long to fixate—are largely independent and are made at different moments (Findlay & Walker, 1999; Hanes & Schall, 1995; Schall & Cohen, 2011). Specifically, the deci- sion of where to fixate—and hence whether to skip the next word—is made early in saccade programming, which can take 100–150 ms (Becker & Jürgens, 1979; Brysbaert et al., 2005; Hanes & Schall, 1995). Although the exact sequence of operations leading to a saccade remains debated, given that readers on average only look some 250 ms at a word, it is clear that skipping decisions are made under strong time constraints, especially given the lower pro- cessing rate of parafoveal information. We suggest that the brain meets this constraint by resorting to a computationally frugal ‘move forward’ policy. How long to fixate, by contrast, depends on saccade initiation. This process is separate from target selection, as indicated by physiological evidence that variation in target selection time only weakly explains variation in initiation times, which are affected by more factors and can be adjusted later (Findlay & Walker, 1999; Schall & Cohen, 2011). This can allow initiation to be informed by foveal infor- mation, which is processed more rapidly and may thus more directly influence the decision to either keep dwelling or execute the saccade. One simplifying assumption we made is that during natural reading, the relative importance of identification-related processes (like prediction and preview) and oculomotor processsing are relatively stable within a single reader. However, it might be that underneath the average, aggregate relative importance we estimated, there is variability between specific moments in a text, or even within a sentence, during which the relative importance might be quite different. One such moment could be sentence transitions, where due to end-of-sentence ‘wrap up’ effects (Andrews & Veldre, 2021;Just & Carpenter, 1980), the relative importance of for instance preview may be reduced. Here we did not treat sentence transitions (or other such moments) as special, but looking into the possibility of moment-to-moment variability in rel- ative importance is an interesting avenue for future research. A distinctive feature of our analysis is that we focus on a few sets of computationally explicit variables, each forming a coherent explanation, and quantify the shared and unique variation accounted for by each explanation. The advantage of this approach is interpretability. How- ever, a limitation of the partitioning analysis is that it is not always possible to add all poten- tially statistically significant variables (or interactions) to the regression, because partitioning requires that each variable can be assigned to a single explanation. When this is not possible (e.g., when a variable is (indirectly) associated with multiple explanations) this requires making OPEN MIND: Discoveries in Cognitive Science 768 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. a decision. Either the variable is omitted and the regression may not capture all explainable variance. Alternatively, the variable is assigned to just one explanation, which may distort the results by inflating the importance of that explanation. A primary example of a variable requiring such a decision is (log)-frequency in the context of skipping. Frequency is sometimes used to predict skipping, as a proxy for a word’s parafo- veal identifiability. However, this relationship is indirect, and (log)frequency is also—and much more strongly—correlated with length, and is hence also associated with the oculomo- tor explanation. Therefore, if one uses frequency as a proxy for parafoveal identifiability, one may find apparent preview effects which are in fact length effects, and strongly overestimate preview importance (Brysbaert & Drieghe, 2003). To avoid such overestimates, especially because the effect of frequency on identifiability should already be captured by the Ideal Observer (see Figure A.3 and Methods), we did not include frequency in our skipping analysis, nor did we include any other attribute that sometimes used a ‘proxy’ for either prediction/constraint or preview. A conceptually related problem is posed by interactions between variables from different explanations, such as between prediction/preview entropy and oculomotor predictors. These are impossible to assign to a single explanation, and were hence excluded from the regression. As a result, the regression model did not include some variables or interactions that were used by prior regression analyses of skipping or reading times. This means that our regression may leave some explainable variation unexplained, and that our importance estimates are spe- cific to the variables we consider, and our modelling thereof. However, this is a limitation that we believe trades-off favourably against the advantages afforded by the analysis. In particular, because for both skipping and reading times (1) we included all the factors deemed most important by prior regression-based studies (e.g., Duan & Bicknell, 2020;Hahn & Keller, 2023;Kliegletal., 2006); (2) the amount of overall (cross-validated) explained variation is in line with prior regression-based analyses (e.g., Duan & Bicknell, 2020; Kliegl et al., 2006); and (3) our model-based effect sizes of prediction and preview effects are well in line with those from the experimental literature, suggesting our modelling of prediction or preview does not significantly fail to capture major aspects of either (Figure 5). In sum, we therefore do not believe that our selective and computationally explicit regression analysis significantly underestimates major factors of importance, and we are optimistic that our analysis yielded the comprehensive, interpretable picture that we aimed for. To quantify predictability (surprisal) and constraint (lexical entropy) we used a neural lan- guage model (GPT-2), instead of the more standard cloze procedure. The reason for this is that we are interested in natural texts, where many words will have relatively low predictability values (e.g., below p = 0.01) which are inherently difficult to estimate in a cloze task. Since the effect of word predictability is logarithmic (Shain et al., 2022; Smith & Levy, 2013) the differences between small probabilities (e.g., between p = 0.001 and p = 0.0001) can have non-negligible effects, which is why for natural texts language models are superior to cloze metrics to capture predictability effects. Since the PROVO corpus includes cloze probability for every word, we could confirm this empirically, finding that model-derived surprisal indeed predicts reading times much better (Figure A.1). We used this specific language model (GPT-2) simply because it was among the best pub- licly available ones, and prior work demonstrates that better language models (measured in perplexity) also predict human reading behaviour better (Goodkind & Bicknell, 2018; Wilcox et al., 2020). This raises the question whether an even better model (e.g., GPT-3, GPT-4, GPT-5, etc.) could predict human behaviour even better, and whether this might change the OPEN MIND: Discoveries in Cognitive Science 769 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. results. However, we do not believe this is likely. First, compared to the increases in quality (decreases in perplexity) from ngrams to GPT, further model improvements will be very subtle when quantified in the aggregate, and since reading behaviour is itself a noisy metric it is not obvious if such improvements will have a measurable impact. Second, one recent study even suggested that models larger than GPT-2 (GPT-J and GPT-3) predicted reading slightly worse, perhaps due to super-human memorisation capacities (Shain et al., 2022). In short, we used GPT-2 simply because it is a strong measure of lexical predictability (in English). Our analyses do not depend on GPT-2 specifically, in the same way we do not believe the results would change if we would have used different (but similar quality) lexical frequency statistics. One apparent complication is that the skipping and reading times analyses use different metrics for explained variation (R and R ). This is due to the difference between continuous McF and discrete variables. As a result, directly numerically comparing the two (e.g., interpreting 4% R as ‘less’ than 5% R ) is difficult. However, our comparisons between skipping and McF reading times are not based on such absolute, numerical comparisons. Instead, the conclu- sions only rely on comparing the relative importance of different explanations. In other words, comparing the relative size and overlap of Venn diagrams in Figures 2, 3 and 4 (and hence only directly comparing quantities of the same metric). If one does look at absolute numerical values across Figures 2, 3 and 6, the R values of the reading times regression may seem rather small. This could indicate a poor fit, which would potentially undermine our claim that reading times are to a large degree explained by cogni- tive factors. However, we do not believe this is the case, since our R ’s for gaze durations are not lower than R ’s reported by other regression analyses in natural reading (e.g., Kliegl et al., 2006); and because we find effect sizes in line with the experimental literature (Figure 5). Therefore, we do not believe we overfit or underfit gaze durations. Instead, what the relatively low R values indicate, we suggest, is that gaze durations are inherently noisy; that only a limited amount of the variation is systematic variation. While this noisiness might be interest- ing in itself (e.g., reflecting an autonomous timer; Engbert et al., 2005), it is not of interest in this study, which focusses on systematic variation, and hence only on relative importance of different explanations, not on absolute R values. A final notable finding is that preview was best explained by a non-contextual observer. This replicates the only other study that compared contextual and non-contextual models of preview (Duan & Bicknell, 2020). That study focussed on skipping; the fact that we obtain the same result for reading times and in different datasets strengthens the conclusion that context does not inform preview. This is also in line with a number of studies on skipping, suggesting no or a weak effect of contextual fit (Angele et al., 2014; Angele & Rayner, 2013; Hahn & Keller, 2023). However, it contradicts a possibly larger range of experimental studies on preview more broadly, that do find interactions between contextual constraint/prediction and preview (e.g., Balota et al., 1985; McClelland & O’Regan, 1981; Schotter et al., 2015; Veldre & Andrews, 2018). One explanation for this discrepancy stems from how the effect is measured. Experimental studies looked at the effect of context on the difference in reading time after valid versus invalid preview (Schotter et al., 2015; Veldre & Andrews, 2018). This may reveal a context effect not on recognition, but at a later stage (e.g., priming between context, preview and foveal word). Arguably, these yield different predictions. If context affects recognition it may allow identification of otherwise unidenti- fiable words. But if the interaction occurs later it may only amplify processing of recogni- sable words. Constructing a model that formally reconciles this discrepancy is an interesting challenge for future work. OPEN MIND: Discoveries in Cognitive Science 770 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Given that readers use both prediction and preview, why doesn’t contextual prediction inform preview? One explanation stems from time constraints imposed by eye movements. Given that readers on average only look some 250 ms at a word in which they have to rec- ognise the foveal word and process the parafoveal percept, this perhaps leaves too little time to fully let the foveal word and context inform parafoveal preview. On the other hand, word recognition based on partial input also occurs in speech perception under significant time-constraints. But despite those constraints, sentence context does influence auditory word recognition (McClelland & Elman, 1986; Zwitserlood, 1989), a process best modelled by a contextual prior (i.e., the opposite of what we find here; Brodbeck et al., 2022; Heilbron et al., 2022). Therefore, rather than being related to time-constraints per se, it might be also related to the underlying circuitry. More precisely, the fact that contrary to auditory word recognition, visual word recognition is a laboriously acquired skill that occurs throughout areas in the visual system that are repurposed (not evolved) for reading (Dehaene, 2009; Yeatman & White, 2021). Therefore, global sentence context might be able to dynamically influence the recognition of speech sounds in temporal cortex, but not that of words in visual cortex; there, context effects might be confined to simpler, more local context, like lexical context effects on letter perception (Heilbron et al., 2020; Reicher, 1969; Wheeler, 1970; Woolnough et al., 2021). In conclusion, we have found that two important contextual sources of information about next-word identity in reading, linguistic prediction and parafoveal preview, strongly drive var- iation in reading times, but hardly affect word skipping, which is largely based on low-level factors. Our results show that as readers, we do not always use all information available to us; and that we are, in a sense, of two minds: consulting complex inferences to decide how long to look at a word, while employing semi-mindless scanning routines to decide where to look next. It is striking that these disparate strategies operate mostly in harmony. Only occasionally they go out of step—then we notice that our eyes have moved too far and we have to look back, back to where our eyes left cognition behind. METHODS We analysed eye-tracking data from three, big, naturalistic reading corpora, in which native English speakers read texts while eye-movement data was recorded (Cop et al., 2017; Kennedy, 2003; Luke & Christianson, 2016). Stimulus Materials We considered the English-native portions of the Dundee, Geco and Provo corpora. The Dundee corpus comprises eye-movements from 10 native speakers from the UK (Kennedy, 2003), who read a total of 56.212 words across 20 long articles from The Independent news- paper. Secondly, the English portion of the Ghent Eye-tracking Corpus (Geco) (Cop et al., 2017) is a collection of eye movement data from 14 UK English speakers who each read Agathe Cristie’s The Mysterious Affair at Styles in full (54.364 words per participant). Lastly, the Provo corpus (Luke & Christianson, 2018) is a collection of eye movement data from 84 US English speakers, who each read a total of 55 paragraphs (extracted from diverse sources) for a total of 2.689 words. Eye Tracking Apparatus and Procedure In all datasets, eye movements were recorded monocularly, by recording the right eye. In Geco and Provo, recordings were made using an EyeLink 1000 (SR Research, Canada) with OPEN MIND: Discoveries in Cognitive Science 771 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. a spatial resolution of 0.01° and a temporal resolution of 1000 Hz. For Dundee, a Dr. Bouis oculometer (Dr. Bouis, Kalsruhe, Germany), with a spatial resolution of <0.1° and a temporal resolution of 1000 Hz was used. To minimize head movement, the participant’s heads were stabilised with a chinrest (Geco, Provo) or a bite bar (Dundee). In each experiment, texts were presented in ‘screens’ with either five lines (Dundee) or one paragraph per screen (Geco and Provo), presented using a font size of 0.33° per character. Each screen began with a fixation mark (gaze trigger) that was replaced by the initial word when stable fixation was achieved. In all datasets, a 9-point calibration was performed prior to the recording. In the longer experi- ments, a recalibration was performed every three screens (Dundee) or either every 10 minutes or whenever the drift correction exceeded 0.5° (Geco). For Dundee and Provo, the order of different texts were randomized across participants. In Geco, the entire novel was read start to finish with breaks between each chapter, during which participants answered comprehension questions. For each corpus the x, y-values per fixation position were converted into a word-by-word format. In Dundee, raw x, y-values were smoothed by rounding to single-character precision. In Geco and Provo, raw x, y-values for each within-word- or within-letter fixation were pre- served and available for each word. Across the three data sets we redefined the bounding boxes around each word, such that they subtended the area between the first to the last char- acter of the word, with the boundary set halfway to the neighbouring character (e.g., halfway the before and after the word). Punctuation before or after the word were left out, and words for which the bounding box was inconsistently defined were ignored. For distributions of saccade and fixation data, see Figures A.5–A.7. Language Model Contextual predictions were formalised using a language model—a model computing the probability of each word given the preceding words. Here, we used GPT-2 (XL)—currently among the best publicly released English language models. GPT-2 is a transformer-based model, that in a single pass turns a sequence of tokens U =(u , …, u ) into a sequence of 1 k conditional probabilities, (p (u ), p(u |u ), …, p(u |u , …, u )). 1 2 1 k 1 k−1 Roughly, this happens in three steps: first, an embedding encodes the sequence of symbolic tokens as a sequence of vectors, which are the first hidden state h . Then, a stack of n trans- former blocks each applies a series of operations resulting in a new set of hidden states h , for each block l. Finally, a (log-)softmax layer is applied to compute (log-)probabilities over target tokens. In other words, the model can be summarised as follows: h ¼ UW þ W (1) 0 e p h ¼ transformer blockðÞ h ∀i 2½ 1; n (2) l l−1 PuðÞ ¼ softmax h W ; (3) where W is the token embedding and W is the position embedding. e p The key component of the transformer-block is masked multi-headed self-attention.This transforms a sequence of input vectors (x , x , …, x ) into a sequence of output vectors (y , 1 2 k 1 y , …, y ). Fundamentally, each output vector y is simply a weighted average of the input 2 k i vectors: y = w x . Critically, the weight w is not a parameter, but is derived from a i ij j i,j j¼1 OPEN MIND: Discoveries in Cognitive Science 772 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. dot product between the input vectors x x , passed through a softmax and scaled by a constant T T 1 pffiffiffi determined by the dimensionality d : w = (exp x x / exp x x ) . Because this is k ij j j i j¼1 i done for each position, each input vector x is used in three ways: first, to derive the weights for its own output, y (as the query); second, to derive the weight for any other output y (as the key); finally, it is used in the weighted sum (as the value). Different linear transformations are applied to the vectors in each cases, resulting in Query, Key and Value matrices (Q, K, V ). Putting this all together, we obtain: QK self attentionðÞ Q; K ; V¼ softmax pffiffiffiffiffi V : (4) To be used as a language model, two elements are added. First, to make the operation position-sensitive, a position embedding W is added in the embedding step (Equation 1). Second, to enforce that the model only uses information from the past, attention from future vectors is masked out. To give the model more flexibility, each transformer block contains multiple instances (‘heads’) of the self-attention mechanisms from Equation 4. In total, GPT-2 (XL) contains n = 48 blocks, with 12 heads each; a dimensionality of d = 1600 and a context window of k = 1024, yielding a total of 1.5 × 10 parameters. We used the PyTorch implementation of GPT-2 from the Transformers package (Wolf et al., 2020). One complication of deriving word-probabilities from GPT-2 is that it doesn’t operate on words but on tokens. Tokens can be whole words (as with most common words) or sub- words. To derive word probabilities, we take the token probability for a single-token word, and the joint probability for words spanning multiple tokens, as is standard practice in psycholinguistics (Pimentel et al., 2022; Shain et al., 2022; Wilcox et al., 2020). However, because GPT marks word boundaries (i.e., spaces), at the beginning of a token, technically, the ‘end of word‘ decision is made at the next token. Defining word probabilities via (joint) constituent token probabilities doesn’t take this into account. Therefore, it will on average slightly over-estimate word probabilities (underestimate surprisal). However, this slight underestimation is not likely to affect any of our conclusions, since our regression analyses primarily depend on differences between relative predictabilities of different words, and since GPT-2, when used in this way, has been shown to result in probabilities that predict reading behaviour very well compared to other language models that do use whole-word tokenisation (Wilcox et al., 2020), and even compared to larger models like GPT-3 (Shain et al., 2022). We chose GPT-2 because it is a high-quality language model (measured in perplexity on English texts), and better language models generally predict reading behaviour better (Goodkind & Bicknell, 2018; Wilcox et al., 2020). Our analysis does not depend on GPT-2 specifically, it could be switched with any similarly high-quality language model, much in the same way as we do believe our results to be specific to the exact lexical frequency statistics estimates we used (see Discussion). Ideal Observer To compute parafoveal identifiability, we implemented an ideal observer based on the formal- ism by Duan and Bicknell (2020). This model formalises parafoveal word identification using Bayesian inference and builds on previous well-established ‘Bayesian Reader’ models (Bicknell & Levy, 2010; Norris, 2006, 2009). It computes the probability of the next word given OPEN MIND: Discoveries in Cognitive Science 773 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. a noisy percept by combining a prior over possible words with a likelihood of the noisy percept, given a word identity: pwðÞ jI ∝pwðÞpðÞ Ijw ; (5) where I represents the noisy visual input, and w represents a word identity. We considered two priors (see Figure 6): a non-contextual prior (the overall probability of words in English based on their frequency in Subtlex (Brysbaert & New, 2009), and a contextual prior based on GPT2 (see below). Below we describe how visual information is represented and percep- tual inference is performed. For a graphical schematic of the model, see Figure A.2; for some distinctive simulations showing how the model captures key effects of linguistic and visual characteristics on word recognition, see Figure A.3. Sampling Visual Information. Like in other Bayesian Readers (Bicknell & Levy, 2010; Norris, 2006, 2009), noisy visual input is accumulated by sampling from a multivariate Gaussian which is centred on a one-hot ‘true’ letter vector—here representedinanuncased −1/2 26-dimensional encoding—with a diagonal covariance matrix (ε)= λ() I. The shape of is thus scaled by the sensory quality λ(ε) for a letter at eccentricity ε. Sensory quality is com- puted as a function of the perceptual span: this uses a Gaussian integral based follows the perceptual span or processing rate function from the SWIFT model (Engbert et al., 2005). Spe- cifically, for a letter at eccentricity ε, λ is given by the integral within the bounding box of the letter: εþ:5 1 x λεðÞ ¼ pffiffiffiffiffiffiffiffiffiffiffi exp − dx; (6) 2σ 2πσ ε−:5 which, following Bicknell and Levy (2010) and Duan and Bicknell (2020), is scaled by a scal- ing factor Λ. Unlike SWIFT, the Gaussian in Equation 6 is symmetric, since we only perform inference on information about the next word. By using one-hot encoding and a diagonal covariance matrix, the ideal observer ignores similarity structure between letters. This is clearly a simplification, but one with significant computational benefits; moreover, it is a simplifica- tion shared by all Bayesian Reader-like models (Bicknell & Levy, 2010; Duan & Bicknell, 2020; Norris, 2006), which can nonetheless capture many important aspects of visual word recognition and reading. To determine parameters Λ and σ, we performed a grid search on a subset of Dundee and Geco (see Figure A.4), resulting in Λ = 1 and σ = 3. Note that this σ value is close to the average σ value of SWIFT and (3.075) and corresponds well to prior literature on the size of the perceptual span (±15 characters; Bicknell & Levy, 2010; Engbert et al., 2005; Schotter et al., 2012). Perceptual Inference. Inference is performed over the full vocabulary. This is represented as a matrix which can be seen as a stack of word vectors, y , y , …, y , obtained by concatenating 1 2 v the letter vectors. The vocabulary is thus a V × d matrix, with V the number of words in the vocabulary and d the dimensionality of the word vectors (determined by the length of the longest word: d =26× l ). max To perform inference, we use the belief-updating scheme from Duan and Bicknell (2020), in (t) which the posterior at sample t is expressed as a (V − 1) dimensional log-odds vector x ,in ðÞ t which each entry x represents the log-odds of y relative to the final word y . In this formu- i v ðÞ 0 lation, the initial value of x is thus simply the prior log odds, x = log p(w ) − log p(w ), and i v updating is done by summing prior log-odds and the log-odds likelihood. This procedure is OPEN MIND: Discoveries in Cognitive Science 774 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. repeated for T samples, each time taking the posterior of the previous timestep as the prior in the current timestep. Note that using log odds in this way avoids renormalization: ðÞ 0;…;t pw jI ðÞ t x ¼ log ðÞ 0;…;t pw jI ðÞ 0;…;t−1 ðÞ t pw jI p I jw i i ¼ log ðÞ 0;…;t−1 ðÞ t pw jI p I jw v v (7) ðÞ 0;…;t−1 ðÞ t pw jI p I jw i i ¼ log þ log ðÞ 0;…;t−1 ðÞ t pw jI p I jw v v ðÞ t−1 ðÞ t ¼ x þ Δx : i i (t) In other words, as visual sample I comes in, beliefs are updated by summing the prior log (t−1) (t) odds x and the log-odds likelihood of the new information x . For a given word w , the log-odds likelihood of each new sample is the difference of two multivariate Gaussian log-likelihoods, one centred on y and one on the last vector y . This can i v be formulated as a linear transformation of I: Δx ¼ log pðÞ Ij w − log pðÞ Ij w i i v P P ¼ log pðÞ Ij NðÞ y ; − log pðÞ Ij NðÞ y ; i v 1 P −1 ¼ −ðÞ I − y ðÞ I − y − i i (8) 1 P T −1 −ðÞ I − y ðÞ I − y v v P P −1 −1 T T y y − y y P v v i i −1 ¼ þðÞ y − y I ; i v which implies that updating can be implemented by sampling from a multivariate normal. To perform inference on a given word, we performed this sampling scheme until convergence (using T = 50), and then transformed the posterior log-odds into the log posterior, from which we computed the Shannon entropy as a metric of parafoveal identifiability. To compute the parafoveal entropy for each word in the corpus, we make the simplifying assumption that parafoveal preview only occurs during the last fixation prior to a saccade, thus computing the entropy as a function of the word itself and its distance to the last fixation location within the previously fixated word (which is not always the previous word). Because this distance is different for each participant, it was computed separately for each word, for each participant. Moreover, because the inference scheme is based on sampling, we repeated it 3 times, and averaged these to compute the posterior entropy of the word. The amount of information obtained from the preview is then simply the difference between prior and posterior entropy. The ideal observer was implemented in custom Python code, and can be found in the data sharing collection (see below). Contextual vs. Non-Contextual Prior We considered two observers: one with a non-contextual prior capturing the overall probabil- ity of a word in a language, and with a contextual prior, capturing the contextual probability of OPEN MIND: Discoveries in Cognitive Science 775 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. a word in a specific context. For the non-contextual prior, we simply used lexical frequencies from which we computed the (log)-odds prior used in Equation 7. For the contextual prior, we derived the contextual prior from log-probabilities from GPT-2. This effectively involves constructing a new Bayesian model for each word, for each participant, in each dataset. To simplify this process, we did not take the full predicted distribution of GPT-2, but only the ‘nucleus’ of the top k predicted words with a cumulative probability of 0.95, and truncated the (less reliable) tail of the distribution. Further, we simply assumed that the rest of the tail was ‘flat’ and had a uniform probability. Since the prior odds can be derived from relative frequencies, we can think of the probabilities in the flat tail as having a ‘pseudocount’ of 1. If we similarly express the prior probabilities in the nucleus as implied ‘pseudofrequencies’, the cumulative implied nucleus frequency is then complementary to the length of the tail, which is simply the difference between the vocabulary size and nucleus size (V − k). As such, for word i in the text, we can express the nucleus as implied frequencies as follows: V−k ðÞ i freqs ¼ P w j context (9) ψ tr ðÞ i 1 − Pw jcontext j¼1 j (i) ðÞ i where P (w j context) is the truncated lexical prediction, and P(w | context) is predicted prob- tr ability that word i in the text is word j in the sorted vocabulary. Note that using this flat tail not only simplifies the computation, but also deals with the fact that the vocabulary of GPT-2 is smaller than that of the ideal observer—usingthistailwecan stilluse thefullvocabulary (e.g., to capture orthographic uniqueness effects), while using 95% of the density from GPT-2. Data Selection In our analyses, we focus on first-pass reading (i.e., progressive eye movements), analysing only those fixations or skips when none of the subsequent words have been fixated before. Moreover, we exclude return sweeps (i.e., line transitions), which are very different from within-line sac- cades, and hence excluded. We extensively preprocessed the corpora so that we could include as many words as possible. However, we had to impose some additional restrictions. Specifi- cally we did not include words if they a) contained non-alphabetic characters; b) if they were adjacent to blinks; c) if the distance to the prior fixation location was more than 24 characters (±8); moreover, for the gaze duration we excluded d) words with implausibly short (< 70 ms) or long (> 900 ms) gaze durations. Criterion c) was chosen because some participants occasionally skipped long sequences of words, up to entire lines or more. Such ‘skipping’—indicated by sac- cades much larger than the perceptual span—is clearly different from the skipping of words during normal reading, and was therefore excluded. Note that these criteria are comparatively mild (cf. Duan & Bicknell, 2020; Smith & Levy, 2013), and leave approximately 1.1 million obser- vations for the skipping analysis, and 593.000 reading times observations. Regression Models: Skipping Skipping was modelled via logistic regression in scikit-learn (Pedregosa et al., 2011), with three sets of explanatory variables (or ’models’) each formalising a different explanation for why a word might be skipped. First, a word might be skipped because it could be confidently predicted from context. We formalise this via linguistic entropy, quantifying the information conveyed by the prediction from GPT-2. We used entropy, not (log) probability, because using the next word’s probability directly would presuppose that the word is identified, undermining the dissociation of OPEN MIND: Discoveries in Cognitive Science 776 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. prediction and preview. By contrast, prior entropy specifically probes the information avail- able from prediction only. Secondly, a word might be skipped because it could be identified from a parafoveal preview. This was formalised via parafoveal entropy, which quantifies the parafoveal preview uncertainty (or, inversely, the amount of information conveyed by the preview). This is a complex function integrating low-level visual (e.g., decreasing visibility as a function of eccentricity) and higher-level information (e.g., frequency or orthographic effects) and their interaction (see Figure A.3). Here, too we did not use lexical features (e.g., frequency) of the next word to model skipping directly, as this presupposes that the word is identified; and to the extent that these factors are expected to influence identifiability, this is already captured by the parafoveal entropy (Figure A.3). Finally, a word might be skipped simply because it is too short and/or too close to the prior fixation location, such that a fixation of average length would overshoot the word. This auton- omous oculomotor account was formalised by modelling skipping probability purely as a function of a word’s length and its distance to the previous fixation location. Note that these explanations are not mutually exclusive, so we also evaluated their com- binations (see below). Regression Models: Reading Time As an index of reading time, we analysed first-pass gaze duration, the sum of a word’s first-pass fixation durations. We analyse gaze durations as they arguably most comprehensively reflect how long a word is looked at, and are the focus of similar model-based analyses of contextual effects in reading (Goodkind & Bicknell, 2018; Smith & Levy, 2013). For reading times, we used linear regression, and again considered three sets of explanatory variables, each forma- lising a different kind of explanation. First, a word may be read more slowly because it is unexpected in context. We formalised this using surprisal −log( p), a metric of a word’s unexpectedness—or how much information is conveyed by a word’s identity in light of a prior expectation about the identity. To capture spillover (Rayner et al., 2006; Smith & Levy, 2013) we included not just the surprisal of the current word, but also that of the previous two words. Secondly, a word might be read more slowly because it was difficult to discern from the parafoveal preview. This was formalised using the parafoveal entropy (see above). Finally, a word might be read more slowly because of non-contextual factors of the word itself. This is an aggregate baseline explanation, aimed to capture all relevant non-contextual word attri- butes, which we contrast to the two major contextual sources of information about a word identity that might affect reading times (prediction and preview). We included word class, length, log- frequency, and the relative landing position (quantified as the distance to word centre, both in fraction and in characters). For log-frequency we used the UK or US version of SUBTLEX depend- ing on the corpus and included the log-frequency of the past two words to capture spillover effects. The full model was defined as the joint of all models. For a tabular overview of all explan- atory variables, see Tables A.1–A.3. Model Evaluation We compared the ability of each model to account for the variation in the data by probing prediction performance in a 10-fold cross-validation scheme, in which we quantified how much of the observed variation in skipping rates and gaze durations could be explained. OPEN MIND: Discoveries in Cognitive Science 777 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. For reading times, we did this using the coefficient of determination, defined via the ratio of SS SS res res residual and total sum of squares: R =1 − . The ratio relates the error of the model SS SS tot tot (SS )tothe errorofa ‘null’ model predicting just the mean (SS ), and gives the variance res tot 2 2 explained. For skipping, we use a tightly related metric, the McFadden R . Like the R it is computed by comparing the error of the model to the error of a null model with only an inter- 2 L cept: R =1 − , where L indicates the loss. McF L null While R and R are not identical, they are formally tightly related—critically, both are McF zero when the prediction is constant (no variation explained) and go towards one proportion- ally as the error decreases to zero (i.e., towards all variation explained). Note that in a cross-validated setting, both metrics can become negative when prediction of the model is worse than the prediction of a constant null-model. Variation Partitioning To assess relative importance, we used variation partitioning to estimate how much explained variation could be attributed to each set of explanatory variables. This is also known as var- iance partitioning, as it is originally based on partitioning sums of squares; here we use the more general term ‘variation’ following Legendre (2008). Variation partitioning builds on the insight that when two (groups of) explanatory variables (A and B) both explain some variation in the data y, and A and B are independent, then var- iation explained by combining A and B will be approximately additive. By contrast, when A and B are fully redundant (e.g., when B only has an apparent effect on y through its correlation with A), then a model combining A and B will not explain more than the two alone. Following de Heer et al. (2017), we generalise this logic to up to three (sets of) explanatory variables, by testing each individually and all combinations, and using set theory notation and graphical representation for its simplicity and clarity. A two-way partition of two sets of explanatory variables (Figures 4, A.12, A.13) involves (A and B) fitting three models: two partial models with their features alone (A and B), and a joint model with both (A ∪ B). The unique variation explained by either A (A*) is derived via the difference between the partial models and the joint model: A ¼ A∖B ¼ A ∪ B − B (10) B ¼ B∖A ¼ A ∪ B − A And the intersection is derived from the joint model and sum of the partial models: A ∩ B ¼ A þ B − A ∪ B (11) For three groups of explanatory variables (A, B, and C), the situation is a bit more complex. We first evaluate each separately and all combinations, resulting in 7 models: A; B; C ; A ∪ B; A ∪ C ; B ∪ C ; A ∪ B ∪ C : From these 7 models we obtain 7 ‘empirical’ scores (of variation explained), from which we derive the 7 ‘theoretical‘ partitions: 4 overlap partitions and 3 unique partitions. The first over- lap partition is the variation explained by all models, which we can derive as: A ∩ B ∩ C ¼ A ∪ B ∪ C þ A þ B þ C − A ∪ B − A ∪ C − B ∪ C : (12) OPEN MIND: Discoveries in Cognitive Science 778 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. The next three overlap partitions contain all pairwise intersections of models that did not include the other model: ðÞ A ∩ B ∖C ¼ A þ B − A ∪ B − A ∩ B ∩ C ðÞ A ∩ C ∖B ¼ A þ C − A ∪ C − A ∩ B ∩ C (13) ðÞ B ∩ C ∖A ¼ B þ C − B ∪ C − A ∩ B ∩ C : The last three partitions are those explained exclusively by each model. This is the relative RC complement: the partition unique to A is the relative complement of BC: BC . For simplicity we also use a star notation, indicating the unique partition of A as A*. These are derived as follows: RC A ¼ BC ¼ A ∪ B ∪ C − B ∪ C * RC (14) B ¼ AC ¼ A ∪ B ∪ C − A ∪ C RC C ¼ AB ¼ A ∪ B ∪ C − A ∪ B: Note that, in the cross-validated setting, the results can become paradoxical and depart from what is possible in classical statistical theory, such as partitioning sums of squares. For instance, due to over-fitting, a model that combines multiple EVs could explain less variance than all of the EVs alone, in which case some partitions would become negative. However, following de Heer et al. (2017), we believe that the advantages of using cross-validation out- weigh the risk of potentially paradoxical results in some subjects. Partitioning was carried out for each subject, allowing to statistically assess whether the additional variation explained by a given model was significant. On average, none of the partitions were paradoxical. Simulating Effect Sizes Regression-based preview benefits were defined as the expected difference in gaze duration after a preview of average informativeness versus after no preview at all. This best corresponds to an experiment in which the preceding preview was masked (e.g., XXXX) rather than invalid (see Discussion). To compute this we compared the took the difference in parafoveal entropy between an average preview and the prior entropy. Because we standardised our explanatory variables, this was transformed to subject-specific z-scores and then multiplied by the regres- sion weights to obtain an expected effect size. For the predictability benefit, we computed the expected difference in gaze duration between ‘high’ and ‘low’ probability words. ‘High’ and ‘low’ was empirically defined based on the human-normed cloze probabilities in Provo (using the ORTHOMATCHMODEL definition for additional granularity; Luke & Christianson, 2018), which we divided into thirds using percen- tiles. The resulting cutoff points (low < 0.02; high > 0.25) were log-transformed, applied to the surprisal values from GPT-2, and multiplied by the weights to predict effect sizes. Note that these definitions of ‘low’ and ‘high’ may appear low compared to those in the literature— however, most studies collect cloze only for specific ‘target’ words in relatively predictable contexts, which biases the definition of ‘low’ vs. ‘high’ probability. By contrast, we analysed cloze probabilities for all words, yielding these values. Statistical Testing Statistical testing was performed across participants within each dataset. Because two of the three corpora had a low number of participants (10 and 14 respectively) we used data-driven, non-analytical bootstrap t-tests, that involve resampling a null-distribution with zero mean (by OPEN MIND: Discoveries in Cognitive Science 779 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. removing the mean), counting across bootstraps how likely a t-value at least as extreme as the true t-value was to occur. Each test used at least 10 bootstraps; p values were computed with- out assuming symmetry (equal-tail bootstrap; Rousselet et al., 2019). Confidence intervals (in the figures and text) also based on bootstrapping. ACKNOWLEDGMENTS We thank Maria Barrett, Yunyan Duan, and Benedikt Ehinger for useful input and inspiring discussions during various stages of this project. FUNDING INFORMATION This work was supported by The Netherlands Organisation for Scientific Research (NWO Research Talent grant to M.H.; NWO Vidi 452-13-016 to F.P.d.L.; Gravitation Program Grant Lan- guage in Interaction no. 024.001.006 to P.H.) and the European Union Horizon 2020 Program (ERC Starting Grant 678286, “Contextvision” to F.P.d.L.). AUTHOR CONTRIBUTIONS Conceptualisation: MH. Data wrangling and preprocessing: JvH. Formal analysis: MH, JvH. Statistical analysis and visualisations: JvH, MH. Supervision: FPdL, PH. Initial draft: MH. Final draft: MH, JvH, PH, FPdL. DATA AND CODE AVAILABILITY STATEMENT The Provo and Geco corpora are freely available (Cop et al. 2017; Luke & Christianson, 2018). All additional data and code needed to reproduce the results will be made public on the Donders Repository at https://doi.org/10.34973/kgm8-6z09. REFERENCES Andrews, S., & Veldre, A. (2021). Wrapping up sentence compre- Bouma, H., & de Voogd, A. H. (1974). On the control of eye hension: The role of task demands and individual differences. saccades in reading. Vision Research, 14(4), 273–284. https:// Scientific Studies of Reading, 25(2), 123–140. https://doi.org/10 doi.org/10.1016/0042-6989(74)90077-7, PubMed: 4831591 .1080/10888438.2020.1817028 Brodbeck, C., Bhattasali, S., Cruz Heredia, A. A. L., Resnik, P., Angele, B., Laishley, A. E., Rayner, K., & Liversedge, S. P. Simon, J. Z., & Lau, E. (2022). Parallel processing in speech (2014). The effect of high- and low-frequency previews and perception with local and global representations of linguistic sentential fit on word skipping during reading. Journal of context. eLife, 11, Article e72056. https://doi.org/10.7554/eLife Experimental Psychology: Learning, Memory, and Cognition, .72056, PubMed: 35060904 40(4), 1181–1203. https://doi.org/10.1037/a0036396, Brysbaert, M., & Drieghe, D. (2003). Please stop using word fre- PubMed: 24707791 quency data that are likely to be word length effects in disguise. Angele, B., & Rayner, K. (2013). Processing the in the parafovea: Behavioral and Brain Sciences, 26(4), 479. https://doi.org/10 Are articles skipped automatically? Journal of Experimental Psy- .1017/S0140525X03240103 chology: Learning, Memory, and Cognition, 39(2), 649–662. Brysbaert, M., Drieghe, D., & Vitu, F. (2005). Word skipping: https://doi.org/10.1037/a0029294, PubMed: 22799285 Implications for theories of eye movement control in reading. Balota, D. A., Pollatsek, A., & Rayner, K. (1985). The interaction of In G. Underwood (Ed.), Cognitive processes in eye guidance contextual constraints and parafoveal visual information in read- (pp. 53–78). Oxford University Press. https://doi.org/10.1093 ing. Cognitive Psychology, 17(3), 364–390. https://doi.org/10 /acprof:oso/9780198566816.003.0003 .1016/0010-0285(85)90013-1, PubMed: 4053565 Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Becker, W., & Jürgens, R. (1979). An analysis of the saccadic system Francis: A critical evaluation of current word frequency norms by means of double step stimuli. Vision Research, 19(9), and the introduction of a new and improved word frequency 967–983. https://doi.org/10.1016/0042-6989(79)90222-0, measure for American English. Behavior Research Methods, PubMed: 532123 41(4), 977–990. https://doi.org/10.3758/ BRM.41.4.977, Bicknell, K., & Levy, R. (2010). A rational model of eye movement PubMed: 19897807 control in reading. In J. Hajič (Ed.), Proceedings of the 48th Buswell, G. T. (1920). An experimental study of the eye-voice span Annual Meeting of the Association for Computational Linguistics in reading. University of Chicago. (pp. 1168–1178). Association for Computational Linguistics. Clifton, Jr., C., Ferreira, F., Henderson, J. M., Inhoff, A. W., https://doi.org/10.1037/e520602012-979 Liversedge, S. P., Reichle, E. D., & Schotter, E. R. (2016). Eye OPEN MIND: Discoveries in Cognitive Science 780 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. movements in reading and information processing: Keith Heilbron, M., Armeni, K., Schoffelen, J.-M., Hagoort, P., & Rayner’s 40 year legacy. Journal of Memory and Language, 86, de Lange, F. P. (2022). A hierarchy of linguistic predictions 1–19. https://doi.org/10.1016/j.jml.2015.07.004 during natural language comprehension. Proceedings of Cop, U., Dirix, N., Drieghe, D., & Duyck, W. (2017). Presenting the National Academy of Sciences, 119(32), Article GECO: An eyetracking corpus of monolingual and bilingual sen- e2201968119. https://doi.org/10.1073/pnas.2201968119, tence reading. Behavior Research Methods, 49(2), 602–615. PubMed: 35921434 https://doi.org/10.3758/s13428-016-0734-0,PubMed: 27193157 Heilbron, M., Richter, D., Ekman, M., Hagoort, P., & de Lange, F. P. de Heer, W.A., Huth,A.G., Griffiths, T. L.,Gallant,J.L., & (2020). Word contexts enhance the neural representation of indi- Theunissen, F. E. (2017). The hierarchical cortical organization vidual letters in early visual cortex. Nature Communications, of human speech processing. Journal of Neuroscience, 37(27), 11(1), Article 321. https://doi.org/10.1038/s41467-019-13996-4, 6539–6557. https://doi.org/10.1523/JNEUROSCI.3267-16.2017, PubMed: 31949153 PubMed: 28588065 Hohenstein, S., & Kliegl, R. (2014). Semantic preview benefit dur- Dearborn, W. F. (1906). The psychology of reading: An experimental ing reading. Journal of Experimental Psychology: Learning, Mem- study of the reading pauses and movements of the eye. Columbia ory, and Cognition, 40(1), 166–190. https://doi.org/10.1037 University Contributions to Philosophy and Psychology, 4,1–134. /a0033670, PubMed: 23895448 Dehaene, S. (2009). Reading in the brain: The new science of how Inhoff, A. W. (1984). Two stages of word processing during eye we read. Penguin. fixations in the reading of prose. Journal of Verbal Learning and Deubel, H., O’Regan, J. K., & Radach, R. (2000). Commentary on Verbal Behavior, 23(5), 612–624. https://doi.org/10.1016/S0022 section 2—Attention, information processing, and eye movement -5371(84)90382-7 control. In A. Kennedy, R. Radach, D. Heller, & J. Pynte (Eds.), Just, M. A., & Carpenter, P. A. (1980). A theory of reading: From eye Reading as a perceptual process (pp. 355–374). North-Holland/ fixations to comprehension. Psychological Review, 87(4), Elsevier Science Publishers. https://doi.org/10.1016/B978 329–354. https://doi.org/10.1037/0033-295X.87.4.329, -008043642-5/50017-6 PubMed: 7413885 Duan, Y., & Bicknell, K. (2020). A rational model of word skipping Kennedy, A. (2003). The Dundee Corpus [CD-ROM]. Psychology in reading: Ideal integration of visual and linguistic information. Department, University of Dundee. Topics in Cognitive Science, 12(1), 387–401. https://doi.org/10 Kennedy, A., Pynte, J., Murray, W. S., & Paul, S.-A. (2013). .1111/tops.12485, PubMed: 31823454 Frequency and predictability effects in the Dundee Corpus: An Ehrlich, S. F., & Rayner, K. (1981). Contextual effects on word per- eye movement analysis. Quarterly Journal of Experimental ception and eye movements during reading. Journal of Verbal Psychology, 66(3), 601–618. https://doi.org/10.1080/17470218 Learning and Verbal Behavior, 20(6), 641–655. https://doi.org .2012.676054, PubMed: 22643118 /10.1016/S0022-5371(81)90220-6 Kliegl,R., Grabner, E.,Rolfs,M., &Engbert,R.(2004).Length, Engbert, R., & Kliegl, R. (2003). The game of word skipping: frequency, and predictability effects of words on eye movements Who are the competitors? Behavioral and Brain Sciences, in reading. European Journal of Cognitive Psychology, 16(1–2), 26(4), 481–482. https://doi.org/10.1017/S0140525X03270102 262–284. https://doi.org/10.1080/09541440340000213 Engbert, R., Nuthmann, A., Richter, E. M., & Kliegl, R. (2005). Kliegl, R., Nuthmann, A., & Engbert, R. (2006). Tracking the mind SWIFT: A dynamical model of saccade generation during read- during reading: The influence of past, present, and future words ing. Psychological Review, 112(4), 777–813. https://doi.org/10 on fixation durations. Journal of Experimental Psychology: .1037/0033-295X.112.4.777, PubMed: 16262468 General, 135(1), 12–35. https://doi.org/10.1037/0096-3445.135 Findlay, J. M., & Walker, R. (1999). A model of saccade generation .1.12, PubMed: 16478314 based on parallel processing and competitive inhibition. Behav- Kliegl, R., Risse, S., & Laubrock, J. (2007). Preview benefit and ioral and Brain Sciences, 22(4), 661–721. https://doi.org/10.1017 parafoveal-on-foveal effects from word n + 2. Journal of Experi- /S0140525X99002150, PubMed: 11301526 mental Psychology: Human Perception and Performance, 33(5), Frank, S. L., Fernandez Monsalve, I., Thompson, R. L., & Vigliocco, 1250–1255. https://doi.org/10.1037/0096-1523.33.5.1250, G. (2013). Reading time data for evaluating broad-coverage PubMed: 17924820 models of English sentence processing. Behavior Research Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference Methods, 45(4), 1182–1190. https://doi.org/10.3758/s13428 in the visual cortex. Journal of the Optical Society of America A, -012-0313-y, PubMed: 23404612 20(7), 1434–1448. https://doi.org/10.1364/JOSAA.20.001434, Goodkind, A., & Bicknell, K. (2018). Predictive power of word sur- PubMed: 12868647 prisal for reading times is a linear function of language model Legendre, P. (2008). Studying beta diversity: Ecological variation par- quality. In Proceedings of the 8th Workshop on Cognitive Model- titioning by multiple regression and canonical analysis. Journal of ing and Computational Linguistics (CMCL 2018) (pp. 10–18). Plant Ecology, 1(1), 3–8. https://doi.org/10.1093/jpe/rtm001 Association for Computational Linguistics. https://doi.org/10 Luke, S. G., & Christianson, K. (2016). Limits on lexical prediction .18653/v1/ W18-0102 during reading. Cognitive Psychology, 88,22–60. https://doi.org Goodman, K. S. (1967). Reading: A psycholinguistic guessing /10.1016/j.cogpsych.2016.06.002, PubMed: 27376659 game. Journal of the Reading Specialist, 6(4), 126–135. https:// Luke, S. G., & Christianson, K. (2018). The Provo Corpus: A large doi.org/10.1080/19388076709556976 eye-tracking corpus with predictability norms. Behavior Research Hahn, M., & Keller, F. (2023). Modeling task effects in human Methods, 50(2), 826–833. https://doi.org/10.3758/s13428-017 reading with neural network-based attention. Cognition, 230, -0908-4, PubMed: 28523601 Article 105289. https://doi.org/10.1016/j.cognition.2022 McClelland, J. L., & Elman, J. L. (1986). The TRACE model of .105289, PubMed: 36208565 speech perception. Cognitive Psychology, 18(1), 1–86. https:// Hanes, D. P., & Schall, J. D. (1995). Countermanding saccades in doi.org/10.1016/0010-0285(86)90015-0, PubMed: 3753912 macaque. Visual Neuroscience, 12(5), 929–937. https://doi.org McClelland, J. L., & O’Regan, J. K. (1981). Expectations increase /10.1017/S0952523800009482, PubMed: 8924416 the benefit derived from parafoveal visual information in reading OPEN MIND: Discoveries in Cognitive Science 781 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. words aloud. Journal of Experimental Psychology: Human Psychology, 62(8), 1457–1506. https://doi.org/10.1080 Perception and Performance, 7(3), 634–644. https://doi.org/10 /17470210902816461, PubMed: 19449261 .1037/0096-1523.7.3.634 Rayner, K., Juhasz, B. J., & Brown, S. J. (2007). Do readers obtain McConkie, G. W., Kerr, P. W., Reddix, M. D., & Zola, D. (1988). Eye preview benefit from word N + 2? A test of serial attention shift movement control during reading: I. The location of initial eye versus distributed lexical processing models of eye movement fixations on words. Vision Research, 28(10), 1107–1118. https:// control in reading. Journal of Experimental Psychology: Human doi.org/10.1016/0042-6989(88)90137-X,PubMed: 3257013 Perception and Performance, 33(1), 230–245. https://doi.org/10 McConkie, G. W., & Rayner, K. (1975). The span of the effective .1037/0096-1523.33.1.230, PubMed: 17311490 stimulus during a fixation in reading. Perception & Psychophys- Rayner, K., & Pollatsek, A. (1987). Eye movements in reading: A ics, 17(6), 578–586. https://doi.org/10.3758/BF03203972 tutorial review. In M. Coltheart (Ed.), Attention and performance Morton, J. (1964). The effects of context upon speed of reading, XII: The psychology of reading (pp. 327–362). Lawrence Erlbaum eye movements and eye-voice span. Quarterly Journal of Exper- Associates, Inc. imental Psychology, 16(4), 340–354. https://doi.org/10.1080 Rayner, K., Reichle, E. D., Stroud, M. J., Williams, C. C., & /17470216408416390 Pollatsek, A. (2006). The effect of word frequency, word pre- Norris, D. (2006). The Bayesian reader: Explaining word recogni- dictability, and font difficulty on the eye movements of young tion as an optimal Bayesian decision process. Psychological and older readers. Psychology and Aging, 21(3), 448–465. Review, 113(2), 327–357. https://doi.org/10.1037/0033-295X https://doi.org/10.1037/0882-7974.21.3.448, PubMed: .113.2.327, PubMed: 16637764 16953709 Norris, D. (2009). Putting it all together: A unified account of word Rayner, K., & Well, A. D. (1996). Effects of contextual constraint on recognition and reaction-time distributions. Psychological eye movements in reading: A further examination. Psychonomic Review, 116(1), 207–219. https://doi.org/10.1037/a0014259, Bulletin & Review, 3(4), 504–509. https://doi.org/10.3758 PubMed: 19159154 /BF03214555, PubMed: 24213985 O’Regan, J. K. (1980). The control of saccade size and fixation Reicher, G. M. (1969). Perceptual recognition as a function of duration in reading: The limits of linguistic control. Perception meaningfulness of stimulus material. Journal of Experimental & Psychophysics, 28(2), 112–117. https://doi.org/10.3758 Psychology, 81(2), 275–280. https://doi.org/10.1037/h0027768, /BF03204335, PubMed: 7432983 PubMed: 5811803 O’Regan, J. K. (1992). Optimal viewing position in words and the Reichle, E. D., Rayner, K., & Pollatsek, A. (2003). The E-Z reader strategy-tactics theory of eye movements in reading. In K. Rayner model of eye-movement control in reading: Comparisons to (Ed.), Eye movements and visual cognition: Scene perception and other models. Behavioral and Brain Sciences, 26(4), 445–526. reading (pp. 333–354). Springer. https://doi.org/10.1007/978-1 https://doi.org/10.1017/S0140525X03000104,PubMed: -4612-2852-3_20 15067951 Pan, Y., Frisson, S., & Jensen, O. (2021). Neural evidence for lex- Reilly, R. G., & O’Regan, J. K. (1998). Eye movement control during ical parafoveal processing. Nature Communications, 12(1), reading: A simulation of some word-targeting strategies. Vision Article 5234. https://doi.org/10.1038/s41467-021-25571-x, Research, 38(2), 303–317. https://doi.org/10.1016/S0042 PubMed: 34475391 -6989(97)87710-3, PubMed: 9536356 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2019). An introduc- Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., tion to the bootstrap: A versatile method to make inferences by Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, using data-driven simulations. PsyArXiv. https://doi.org/10.31234 M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in /osf.io/h8ft7 Python. Journal of Machine Learning Research, 12, 2825–2830. Schall, J. D., & Cohen, J. Y. (2011). The neural basis of saccade tar- Pimentel, T., Meister, C., Wilcox, E. G., Levy, R., & Cotterell, R. get selection. In S. Liversedge, I. Gilchrist, & S. Everling (Eds.), (2022). On the effect of anticipation on reading times. The Oxford handbook of eye movements (pp. 357–381). Oxford arXiv:2211.14301. https://doi.org/10.48550/arXiv.2211.14301 University Press. https://doi.org/10.1093/oxfordhb Pynte, J., & Kennedy, A. (2006). An influence over eye movements /9780199539789.013.0019 in reading exerted from beyond the level of the word: Evidence Schotter, E. R., Angele, B., & Rayner, K. (2012). Parafoveal process- from reading English and French. Vision Research, 46(22), ing in reading. Attention, Perception, & Psychophysics, 74(1), 3786–3801. https://doi.org/10.1016/j.visres.2006.07.004, 5–35. https://doi.org/10.3758/s13414-011-0219-2, PubMed: PubMed: 16938333 22042596 Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. Schotter, E. R., Lee, M., Reiderman, M., & Rayner, K. (2015). The (2019). Language models are unsupervised multitask learners. effect of contextual constraint on parafoveal processing in read- OpenAI Blog, 1(8), 9. ing. Journal of Memory and Language, 83,118–139. https://doi Rayner, K. (1975). The perceptual span and peripheral cues in read- .org/10.1016/j.jml.2015.04.005, PubMed: 26257469 ing. Cognitive Psychology, 7(1), 65–81. https://doi.org/10.1016 Shain, C., Meister, C., Pimentel, T., Cotterell, R., & Levy, R. (2022). /0010-0285(75)90005-5 Large-scale evidence for logarithmic effects of word predictabil- Rayner, K. (1977). Visual attention in reading: Eye movements ity on reading time. PsyArXiv. https://doi.org/10.31234/osf.io reflect cognitive processes. Memory & Cognition, 5(4), /4hyna 443–448. https://doi.org/10.3758/BF03197383, PubMed: Smith, N. J., & Levy, R. (2013). The effect of word predictability on 24203011 reading time is logarithmic. Cognition, 128(3), 302–319. https:// Rayner, K. (1979). Eye guidance in reading: Fixation locations doi.org/10.1016/j.cognition.2013.02.013, PubMed: 23747651 within words. Perception, 8(1), 21–30. https://doi.org/10.1068 Staub, A. (2015). The effect of lexical predictability on eye move- /p080021, PubMed: 432075 ments in reading: Critical review and theoretical interpretation. Rayner, K. (2009). Eye movements and attention in reading, scene Language and Linguistics Compass, 9(8), 311–327. https://doi perception, and visual search. Quarterly Journal of Experimental .org/10.1111/lnc3.12151 OPEN MIND: Discoveries in Cognitive Science 782 Lexical Processing Affects Reading Times Not Skipping Heilbron et al. Tiffin-Richards, S. P., & Schroeder, S. (2015). Children’s and adults’ Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., parafoveal processes in German: Phonological and orthographic Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, effects. Journal of Cognitive Psychology, 27(5), 531–548. https:// S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., doi.org/10.1080/20445911.2014.999076 Gugger, S., … Rush, A. M. (2020). HuggingFace’s Transformers: Veldre, A., & Andrews, S. (2018). Parafoveal preview effects State-of-the-art natural language processing. arXiv:1910.03771. depend on both preview plausibility and target predictability. https://doi.org/10.48550/arXiv.1910.03771 Quarterly Journal of Experimental Psychology, 71(1), 64–74. Woolnough, O., Donos, C., Rollo, P. S., Forseth, K. J., Lakretz, Y., https://doi.org/10.1080/17470218.2016.1247894, PubMed: Crone, N. E., Fischer-Baum, S., Dehaene, S., & Tandon, N. 27734767 (2021). Spatiotemporal dynamics of orthographic and lexical Vitu, F., O’Regan, J. K., Inhoff, A. W., & Topolski, R. (1995). Mindless processing in the ventral visual pathway. Nature Human Behaviour, reading: Eye-movement characteristics are similar in scanning letter 5(3), 389–398. https://doi.org/10.1038/s41562-020-00982-w, strings and reading texts. Perception & Psychophysics, 57(3), 352–364. PubMed: 33257877 https://doi.org/10.3758/BF03213060, PubMed: 7770326 Yan, M., Kliegl, R., Shu, H., Pan, J., & Zhou, X. (2010). Parafoveal Vitu, F., O’Regan, J. K., & Mittau, M. (1990). Optimal landing posi- load of word N + 1 modulates preprocessing effectiveness of tion in reading isolated words and continuous text. Perception & word N + 2 in Chinese reading. Journal of Experimental Psychol- Psychophysics, 47(6), 583–600. https://doi.org/10.3758 ogy: Human Perception and Performance, 36(6), 1669–1676. /BF03203111, PubMed: 2367179 https://doi.org/10.1037/a0019329, PubMed: 20731511 Wheeler, D. D. (1970). Processes in word recognition. Cognitive Yeatman, J. D., & White, A. L. (2021). Reading: The confluence Psychology, 1(1), 59–85. https://doi.org/10.1016/0010 of vision and language. Annual Review of Vision Science, 7, -0285(70)90005-8 487–517. https://doi.org/10.1146/annurev-vision-093019 Wilcox, E. G., Gauthier, J., Hu, J., Qian, P., & Levy, R. (2020). On -113509, PubMed: 34166065 the predictive power of neural language models for human Zwitserlood, P. (1989). The locus of the effects of sentential-semantic real-time comprehension behavior. arXiv:2006.01912. https:// context in spoken-word processing. Cognition, 32(1), 25–64. https:// doi.org/10.48550/arXiv.2006.01912 doi.org/10.1016/0010-0277(89)90013-9, PubMed: 2752705 OPEN MIND: Discoveries in Cognitive Science 783

Journal

Open MindMIT Press

Published: Oct 1, 2023

There are no references for this article.