Does insufficient sleep affect how you learn from reward or punishment? Reinforcement learning after 2 nights of sleep restriction

Abstract To learn from feedback (trial and error) is essential for all species. Insufficient sleep has been found to reduce the sensitivity to feedback as well as increase reward sensitivity. To determine whether insufficient sleep alters learning from positive and negative feedback, healthy participants (n = 32, mean age 29.0 years, 18 women) were tested once after normal sleep (8 hr time in bed for 2 nights) and once after 2 nights of sleep restriction (4 hr/night) on a probabilistic selection task where learning behaviour was evaluated in three ways: as generalised learning, short‐term win–stay/lose–shift learning strategies, and trial‐by‐trial learning rate. Sleep restriction did not alter the sensitivity to either positive or negative feedback on generalised learning. Also, short‐term win–stay/lose–shift strategies were not affected by sleep restriction. Similarly, results from computational models that assess the trial‐by‐trial update of stimuli value demonstrated no difference between sleep conditions after the first block. However, a slower learning rate from negative feedback when evaluating all learning blocks was found after sleep restriction. Despite a marked increase in sleepiness and slowed learning rate for negative feedback, sleep restriction did not appear to alter strategies and generalisation of learning from positive or negative feedback.


Participants
To measure visual and auditory acuity we used Snellen's visual acuity evaluation (Snellen, 1862) and a computerized whispered voice test (Pirozzo, 2003). Wash-out period between the tests was 7-8 days for all participants except two for whom the period was 12 and 17 days. Two participants stand-out with wake-up times on the day of the normal sleep test session at around 09:30. To evaluate the influence of delayed wake-up times and the fact that test times differed between individuals we ran supplementary analyses including hours awake before test session (test timesleep-end) as population level and group varying effect (see sections 3.2 and 4). Although hours awake had some influence on the results, keeping the variable constant did not change the conclusions of no meaningful difference between the sleep conditions.

Statistical analyses details
For all statistical analyses we used Stan (Stan Development Team, 2018) via R (R Core Team, 2018) to fit Bayesian generalized linear mixed-effects models (GLMM) using Markov Chain Monte Carlo (MCMC) sampling. For all behavioral analyses we used the brms-package (Bürkner, 2017).
Weakly informative priors, intended to aid the model fitting but to have little weight on the posterior, were set on the intercept and fixed coefficients to a student t-distribution with 3 degrees of freedom, a mean of 0 and standard deviation of 2.5 (Ghosh et al., 2017), and a Cauchy prior on the sigma with a scale of 1. All model predictors were dummy coded and centered around zero. Posterior distributions, the highest maximum a posteriori probability estimate (MAP) and 95% highest density intervals (HDI) were calculated for each parameter, together with a Bayes Factors. The MAP and the HDI summarizes the peak and the uncertainty in the measure by indicating the most probable values given the data (Kruschke, 2015). To directly test the hypotheses, we calculated Bayes Factors (Makowski et al., 2019) -likelihood ratios of the experimental over the null hypothesis BF10 and reversed (BF01 = 1/ BF10). A region of practical equivalence (ROPE), was used as proxy for the null hypothesis with limits set to reflect half of a small effect size (Kruschke, 2018). A BF10 > 1 or BF01 < 1 indicates evidence for the experimental and BF10 < 1 or BF01 > 1 indicates evidence for the null hypothesis, with level of evidence considered moderate if above 3 or below 1/3, strong if above 10 or below 1/10, and extreme if above 100 or below 1/100 (Beard et al., 2016).

Sleepiness and stress
To analyze sleepiness according to the Karolinska Sleepiness Scale (KSS: Åkerstedt and Gillberg, 1990) and self-rated stress (see Table S1 for observed data), we fitted a Bayesian GLMM with a gaussian link function. Priors were set to a student-t distribution with df = 3, mu = 0 and sigma = 2.5 on the intercept and slope. A Cauchy prior with location = 0 and scale 1 was used on the SD. We used 4 chains with 4000 iterations (including 1000 for warm-up). Posterior predictive checks indicated no divergence and no parameters exceeding Rhat limit (< 1.1). The model was fitted with KSS and stress as dependent variables respectively, sleep (centered) as within participant predictor and a varying intercept for each participant. Model estimations was drawn from the posterior distributions (Table S2).

Number of blocks
To leave the learning phase and proceed to the test phase, the participants had to reach the learning criteria for all the symbol pairs after a block (≥ 65% A choices for A/B trials, ≥ 60% C choices for C/D trials and ≥ 40% E choices for E/F trials). However, to reduce the time on task we restricted the maximum number of learning blocks to 6. After normal sleep, 2 out of 32 individuals did not reach the criteria and after sleep restriction 6 out of 32 individuals did not reach the criteria. The mean number of blocks needed to reach the learning criteria (excluding the individuals that failed to reach the criteria) was 1.97 ± 1.07 after normal sleep and 1.85 ± 1.41 after sleep restriction.
To analyze the number of blocks completed in the learning phase (see Table S3 for observed data) we fitted a Bayesian generalized linear censored (Hilbe, 2011) model with a Poisson family log-link function. The censoring was applied due to the maximum possible number of blocks were limited to 6, thus the true number of blocks needed to reach learning criteria was unknown. Priors were set to a student-t distribution with df = 3, mu = 0 and sigma = 2.5 on the intercept and slope. A Cauchy prior with location = 0 and scale 1 was used on the SD. We used 4 chains with 4000 iterations (including 1000 for warm-up). Posterior predictive checks indicated no divergence and no parameters exceeding Rhat limit (< 1.1). After comparing different models, the best fitted model included number of blocks as dependent variable and passing the criteria as censored variable, sleep (centered) as within participant predictor order as between participant covariate and a varying intercept and slope for each participant by sleep condition. Model estimations was drawn from the posterior distributions (Table S4 and Figure S1). In addition, we analyzed if the number of individuals passing the criteria differed, using a Bayesian GLMM with a Bernoulli family log-link function with the same priors and MCMC settings as in the number of blocks model. The posterior distributions predicted a decrease in the probability of passing the criteria after sleep restriction (Table S4 and Figure S1).  Figure S1. Boxplot with observed number of blocks until reaching learning criteria and bar plot of percent passing criteria per sleep condition. Histogram show posterior distributions of the difference between the sleep conditions, with highest density intervals (HDI; thick black horizontal line), highest maximum a posteriori probability estimates (MAP; grey solid vertical line) and the regions of practical equivalence (ROPE; red shading) including zero (dotted line), supporting no meaningful difference between the sleep conditions. Bars above histogram show Bayes factors with level of support for either hypothesis (BF10: red; BF01: grey) indicated by length of the bar, black vertical lines mark the level of evidence from moderate (BF > 3), strong (BF > 10) to very strong (BF > 100).

Win-stay/Lose-shift
Win-stay indicates the tendency to select the same symbol that rendered positive feedback from the previously presented (one trial back in the sequence) symbol pair and lose-shift the to select the opposite symbol as the one leading to negative feedback in the previous symbol pair. Win-stay and lose-shift was coded as binomial (1, 0) outcome variables and a generalized linear model with a Bernoulli logit link function was fitted for each outcome. Observed data is presented in Table S5.
Priors were set to a student-t distribution with df = 3, mu = 0 and sigma = 2.5 on the intercept and slope. A Cauchy prior with location = 0 and scale 1 was used on the SD. We used 4 chains with 4000 iterations (including 1000 for warm-up) for the MCMC sampling. Posterior predictive checks indicated no divergence, no parameters exceeding Rhat limit (< 1.1) and good fit to the data ( Figure  S2). The models included win-stay or lose-shift as dependent variables, sleep (centered) as within participant predictor, order as between participant covariate and a varying intercept and slope for each participant by sleep condition. Model estimations was drawn from the posterior distributions (Tables S6 and Figure S3). Bayes Factors was estimated from the Log-Odds distributions.   To evaluate the influence of wake-up time, test time and time awake before the test, we fitted a model accounting for time awake before the test (test time -wake-up time) by adding it as a population level effect and a group varying slopes for each participant for win-stay and lose shift.
Keeping the time awake constant had no meaningful impact on the results (see Table S7).

Response times
We also ran an exploratory analysis on response times for the learning phase. First, response times were normalized and centered around zero. Then we fitted a generalized linear model with an exgaussian link function to adjust for the expected positive skew. Otherwise the procedure was similar to the Win-stay/Lose-shift modelling. Figure S4 shows the results, indicating no effect of sleep restriction on response times (BF10 = 0.017, BF01 = 58.59) in the learning phase.

Computational analyses
To estimate trial-by-trial learning behavior we modelled the data from the learning phase using a Qlearning algorithm via rstan (Stan Development Team, 2018). Overall we followed the analytic procedure described in McCoy et al., (2019). Group and individual means were fitted using a weakly informative prior of a normal distribution with a mean of 0 and a standard deviation of 1 and for standard deviations we used half-Cauchy priors (location = 0, scale = 5). Q-values were initialized at 0.5, assuming equal expected value of the symbols at first trial for each pair. Inverse probit transformed parameter estimates are presented in Table S8 and posterior distributions of the sleep conditions for non-transformed data are visualized in Figure S5. We also compared the two learning rates model with a single learning rate model and for both the first block data and all block data the two learning rates model was a better fit than the single learning rate model (Table S9).  Figure S5. Posterior distribution of computational mode data with Normal sleep (green) and Sleep restriction (yellow). Top panels show data estimated from first learning block and bottom panels show data estimated from all learning blocks. Table S9. Model comparison of models with single learning rate and two learning rates for data derived from the first learning block or all learning blocks using leave-one-out cross-validation (LOO). Both for first block and all blocks the two-alpha model indicates a better fit as indicated by the difference in ELPD (expected log pointwise estimate). (Vehtari et al., 2020;LOO: Vehtari et al., 2017). To further evaluate the predictive quality of the two selected models (with two learning rates) we used the parameters ( Positive, Negative and ) maximum a posteriori probability (MAP) estimates from each individual's posterior distribution and fitted each parameter in a new model, with the parameter estimate as a dependent variable and a sleep × subject interaction as a fixed predictor. For the learning rates we used a beta family logit link function and for the beta we used a gaussian family link function. From these posterior distributions we sampled 200 datasets and compared each individual's MAP from those data sets with the original model MAP estimate. All correlations were between 0.70 and 1.00 ( Figure S6).

Test phase
In the test phase a correct response was defined as a response made to the more probable winner symbol in a pair. To measure generalized learning, we used the number of choices made of symbol A, which gave positive feedback 80% of the time when presented paired with more neutral symbols (C, D, E, F) during the learning phase, and the number of choices of the same neutral symbols when paired with the symbol B which gave positive feedback only 20% of the trials during the learning phase. That is, a correct response was to choose symbol A over C, D, E and F and to not choose symbol B over C, D, E and F. Observed data are presented in Table S10. We used a binomial outcome (1, 0) and fitted a Bayesian GLMM with a Bernoulli logit link function. Priors were set to a student-t distribution with df = 3, mu = 0 and sigma = 2.5 on the intercept and slope. A Cauchy prior with location = 0 and scale 1 was used on the SD. We used 4 chains with 4000 iterations (including 1000 for warm-up) for the MCMC sampling. Posterior predictive checks indicated no divergence, no parameters exceeding Rhat limit (< 1.1) and good fit to data, see Figure S8. The model included correct response as dependent variable, sleep (centered), symbol (centered) and sleep-symbol interaction as within participant predictor, order as between participant covariate and a varying intercept and slope for each participant by sleep, symbol and the interaction. Model estimations was drawn from the posterior distributions (Tables S11). Bayes Factors was estimated from the Log-Odds distributions ( Figure S9). Supplementary analysis excluding the individuals that did not reach the criteria after six blocks did not change the results (Tables S12).    Table S13).

Response times
We explored the response times in the test phase using the same setup as for the response times in the learning phase but included Chose A-Avoid B as a fixed parameter. Response times were normalized and centered around zero and an exgaussian link function applied to the generalized linear model. Figure S9 show the results and just like the learning phase there was no effect of sleep restriction on the response times. See Table S14 for posterior estimates. Figure S9. Response times during test phase separate for Choose A and Avoid B. Histogram next to the boxplot show posterior distributions of the response time difference between the sleep conditions, with highest density intervals (HDI; thick black horizontal line), highest maximum a posteriori probability estimates (MAP; grey solid vertical line) and the regions of practical equivalence (ROPE; red shading) including zero (dotted line), supporting no meaningful difference between the sleep conditions. Bars above histogram show Bayes factors with level of support for either hypothesis (BF10: red; BF01: grey) indicated by length of the bar, black vertical lines mark the level of evidence from moderate (BF > 3), strong (BF > 10) to very strong (BF > 100). Histograms to the right show posterior distribution by sleep condition.