Improving sensory representations using episodic memory

The medial temporal lobe (MTL) is well known to be essential for declarative memory. However, a growing body of research suggests that MTL structures might be involved in perceptual processes as well. Our previous modeling work suggests that sensory representations in cortex influence the accuracy of episodic memory retrieved from the MTL. We adopt that model here to show that, conversely, episodic memory can also influence the quality of sensory representations. We model the effect of episodic memory as (a) repeatedly replaying episodes from memory and (b) recombining episode fragments to form novel sequences that are more informative for learning sensory representations than the original episodes. We demonstrate that the performance in visual discrimination tasks is superior when episodic memory is present and that this difference is due to episodic memory driving the learning of a more optimized sensory representation. We conclude that the MTL can, even if it has only a purely mnemonic function, influence perceptual discrimination indirectly.


| INTRODUCTION
We use episodic memory to remember events that we have experienced ourselves (Tulving, 1972). However, while we may remember the basic events and their sequential relation in an episode, we cannot recall the detailed sensory information that we experienced during the episode. Indeed, experimental studies have found that episodic memory in humans preserves mostly the gist of the experienced episode and few of the details (Koutstaal & Schacter, 1997;Sachs, 1967).
We have suggested previously that this property of the episodic memory system results from the fact that episodes are stored in terms of a higher order representation of sensory input and not the sensory input itself (Cheng, Werning, & Suddendorf, 2016;Fang, Demic, & Cheng, 2018;Fang, Rüther, Bellebaum, Wiskott, & Cheng, 2018), which is compatible with the indexing theory by Teyler and DiScenna (1986). Moreover, this is reminiscent of Tulving's SPI model, which posits that sensory information has to pass through the semantic system before being stored by the episodic system (Tulving, 1995). SPI stands for serial encoding, parallel storage and independent retrieval, which describes the phase-dependent information flow among the perceptual, semantic, and episodic components.
This higher order representation is generated during the processing of sensory information, as the information is highdimensional and has to be represented by patterns of neural activity in cortex. Due to biophysical constraints (Ganguli & Sompolinsky, 2012), for example, metabolic costs (Lennie, 2003) or the space required for neuronal connections, the dimensionality of the cortical representation is reduced along the stream of sensory processing (Beyeler, Rounds, Carlson, Dutt, & Krichmar, 2019). At every stage of processing the sensory representation has to contain meaningful features that are informative of the content in the input. The nature of these features can be learned from statistical regularities in the input.
For example, consider the sensory representation of the visual system, which has been shown to process information hierarchically (Felleman & Van Essen, 1991). In primary visual cortex, input is represented by simple and complex cells (Hubel & Wiesel, 1959), hence represented by location, orientation and spatial frequency of elements in view. On the other end of the abstraction spectrum, in inferior temporal areas, shape, and identity of entire objects are represented (Gross, 1992).
The higher-level visual areas directly project to medial temporal lobe (MTL) structures, which are well known to be crucial for episodic memory (Scoville & Milner, 1957;Squire & Zola, 1998). We have previously suggested that episodic memories are best represented by sequences of neural activity patterns (Cheng, 2013;. This representational format seems well supported by the hippocampus, which has been shown to be important for storing and retrieving sequences . Specifically, previous modeling work has suggested that the recurrent network in the hippocampal region CA3 encodes sequences during the experience and replays them during retrieval (Bayati et al., 2018;Buhry, Azizi, & Cheng, 2011;Cheng, 2013;Levy, 1996;Lisman, 1999). In summary, we have suggested that a perceptual-semantic representational network in the neocortex provides higher order representations of the sensory information, while the episodic memory trace only stores the gist of scenes and their temporal evolution.
According to the traditional view, the MTL exclusively subserves mnemonic processes (Squire, Stark, & Clark, 2004;Squire & Zola-Morgan, 1991). However, a growing body of results from the last two decades suggests that the MTL may also play a critical role in high-level perception (perceptual-mnemonic hypothesis; Buckley, Booth, Rolls, & Gaffan, 2001;Bussey & Saksida, 2007). Most of these studies apply variants of two basic types of perceptual tasks in lesion and fMRI experiments to test that hypothesis .
1. Discrimination tasks, in which participants have to compare images and judge similarity. Lee et al. applied a morphing technique to generate image pairs with five different similarity levels containing faces, objects, natural scenes, or art (A. C. H. Lee, Fischer, et al., 2005), but see (Shrager, Gold, Hopkins, & Squire, 2006). Participants had to decide which of the two images is more similar to a reference image. Patients with broad MTL damage (including hippocampus and perirhinal cortex) were strongly impaired in scene discrimination and less so in face and object discrimination.
Patients with specific hippocampal damage were only impaired in scene discrimination and only slightly.
2. Oddity judgment tasks, in which the participant has to pick the oddone-out of a number of shown objects. These objects can be simple geometrical shapes, faces, familiar or novel objects, artificial scenes, often shown from different angles. Buckley et al. (2001) conducted such an oddity judgment study with monkeys and found that subjects with perirhinal cortex lesions were impaired. While Stark and Squire (2000) were not able to replicate this result in humans, Lee, Buckley et al. (2005) found similar impairments in patients with MTL damage, especially when the stimuli were shown from differing viewing angles.
A different task worth mentioning was used in a study by A. C.
H. Lee and Rudebeck (2010), in which participants had to judge whether or not a line drawing of a novel object is geometrically possible. The results show that, firstly, a patient with broad MTL lesions performed poorer on the task than controls. Secondly, fixation patterns of the MTL patient, when responding incorrectly, differed from those of controls and hence the authors concluded that a deficit of visual processing and not a memory deficit is responsible.
Overall, the studies on MTL function in visual perception have been interpreted to suggest that the perirhinal cortex is involved in the visual perception of complex objects and faces by processing complex conjunctions of features, and that the hippocampus is involved in the visual processing of scenes, although there are alternative theories and contradicting evidence.
Based on results of a computational study and a preliminary experiment, it has recently been suggested that episodic memory retrieval is facilitated by an appropriate sensory representation (Fang, Rüther, et al., 2018). We adopt this model and propose that, conversely, episodic memory also leads to more optimized sensory representations. We do not aim to provide a detailed model of the MTL memory system, but focus on how episodic memory can serve to indirectly improve perception. We hypothesize that the sensory representations are initially learned through sensory experience but can be improved further by replaying experiences from memory, perhaps during a process of systems consolidation (Cheng, 2017). Using these sensory representations, we model a simple visual discrimination task and show that, after training with episodic memory, performance is better than without episodic memory. We discuss the modeling results with respect to the studies mentioned above, and explain how the model can account for experimental results.

| METHODS
The model consists of three components: Sensory input, the representational system, and the episodic memory system. In the following, we first give a brief overview of the model and then provide more details about the individual components below. For the sensory input, we use a stream of images. We use slow feature analysis (SFA; Wiskott & Sejnowski, 2002) to train the representational system to extract more abstract representations of the input images. SFA is an unsupervised learning algorithm that extracts slowly varying features (e.g., identity and position of an object in the input) from quickly varying data (e.g., pixel values). SFA training is based on changes of the input in time, but the learned feature representation is extracted from a single input image by an instantaneous function. This is consistent with the assumption in our modeling framework, that semantic memory is represented as near-instantaneous patterns of neural activity, as opposed to episodic memory, which is represented as sequences of activity patterns that are stored in hippocampus (Cheng, 2013;. A sequential episodic memory system can be used to perform at least two manipulations on the original data: (a) Episodes can be replayed verbatim from memory multiple times. While the resulting data do not contain more information than the original data, learning systems can profit from a simple repetition of the training data.
Indeed, neural activity in the rat hippocampus has been shown to replay previously experienced sequences during sleep (A. K. Lee & Wilson, 2002;Louie & Wilson, 2001) or even during periods of rest in the awake state (Diba & Buzsáki, 2007;Foster & Wilson, 2006). Furthermore, there is evidence for coordinated replay in hippocampus and visual cortex (Ji & Wilson, 2007). (b) Episode fragments are recombined to form novel sequences that have smoother transitions than the original episodes. It is well conceivable that the retrieval of memories leads to the retrieval of other memories that are related. Indeed, offline hippocampal sequences corresponding to never-experienced paths have been observed in experiments (Gupta, van der Meer, Touretzky, & Redish, 2010). These novel paths were stitched together from fragments that individually were experienced before. We hypothesize that the novel, recombined sequences are more informative for learning sensory representations than the original episodes.
While in a biological system repeated replay and the generation of novel sequences certainly are intermingled, we model both processes separately in order to disambiguate the effects. In the replay condition, the total number of training patterns is increased by repetition and the original episodes are not altered. In the novel sequences condition, the total number of patterns is held constant, but their sequential ordering is different from the sequences in the stored episodes.
We do not store sensory input directly in episodic memory but a lower-dimensional representation of it (SFA lo , Figure 1). To study the potential effect of episodic memory on tuning sensory representations, we use a second layer of sensory representation (SFA hi , Figure 1) that processes the output of SFA lo . The structure of the representational system in our model can be viewed as a simplified model of a hierarchical visual system. Before the actual experiment takes place, SFA lo is pretrained (double dashed line in Figure 1) and then fixed for the remainder of the study. Then SFA hi is trained on a set of training data. These training episodes are first fed through SFA lo , with subsequent whitening, and the resulting data are used to train SFA hi (dashed lines in Figure 1).
There are two different instances of SFA hi , which we will compare. One instance is trained directly on the output of SFA lo (SFA hi [S]-"simple") and the other on sequences that were stored and retrieved from episodic memory (SFA hi [E]-"episodic") after passing through SFA lo . This notation is used for both memory models, verbatim replay and generation of novel sequences. After training SFA hi , a set of test data is used to assess the quality of the sensory representation that the two SFA hi instances have extracted (solid lines in Figure 1). The assessment criteria are described further below (Section 2.4). A different set of test data is used to simulate a visual discrimination task.
Whitening, which is performed on the output of SFA lo , is a linear transformation that normalizes the data to have zero mean and variance one in all directions. The output of SFA lo on the pretraining data are already whitened (Equations 2 and 3). However, because the training data are slightly different from the pretraining data, the output of F I G U R E 1 Structure of the representational system. The diagram depicts the information flow in the model, illustrating how each of the three data sets (bottom) are used in the three different stages of the simulation. A rhombus at the end of a line denotes that the data are used for training the particular module. An arrowhead indicates that the data are fed through the module. Pretraining: Before the actual experiments, SFA lo is pretrained (double dashed line). Training: Training data are fed through SFA lo , which extracts low-level visual features that are used to train SFA hi (dashed lines). In our study, we contrast two SFA hi instances. In the simple scenario SFA hi [S] is trained on the output of SFA lo directly, while in the episodic scenario data are stored in episodic memory first and then retrieved to train SFA hi [E]. Test: Finally, the quality of the features the SFA hi instances extract is evaluated by feeding a set of testing data first through SFA lo and then through SFA hi (solid lines). For the role of the whitening see the main text SFA lo can have a mean and variance different from zero and one, respectively, when fed with training data. In our experiments we found that training of incremental SFA worked more consistently on whitened data, so we included the additional whitening step. The whitening matrix is trained on the output of SFA lo on the training data and the same whitening matrix is used during testing.

| Sensory input
As input, we use streams of grayscale images with 30 × 30 pixels containing a single object, which is either the letter "T" or the letter "L". The black objects with smoothed edges are moving and rotating on a white background according to a random walk. The square images are represented in x-and y-coordinates ranging from −1 to 1. Changes in position and angle are drawn from a Gaussian distribution with zero mean and a standard deviation of 0.25. Independent Gaussian noise with zero mean and SD 0.1 is added to each pixel in each frame. Pixel values range from 0 to 1.
Each data set consists of several episodes of same length that are strung together and presented as one long stream ( Figure 2). For each episode, the starting position and angle are randomly initialized. After each episode, the object identity changes, that is, the two letters are presented alternately (Figure 2). In the following, we refer to object identity and x,y-coordinates as latent variables. While the pixel values themselves can be directly observed, latent variables can only be inferred from pixel values.
Pre-training and test data consist of 100 episodes of length 50.
The length and number of episodes in the training data varies between the experiments.

| Sensory representation
A well supported hypothesis about the function of the visual system is that it generates representations of the visual inputs that are invariant to many transformations, for example, changes of object orientation and position, view angle, lighting, and so on (Logothetis & Sheinberg, 1996;Rolls, 2000). SFA proposes that invariant representations are most likely those that are varying slowly in time. Therefore, the objective of SFA is to extract features from the input that vary as slowly as possible. Indeed, SFA applied to sequences of moving objects naturally learns a compact representation of object identity and pose (Franzius, Wilbert, & Wiskott, 2011), akin to that found in inferior temporal cortex. Experimental evidence supports the hypothesis that slowness may play an important role in forming neural representations (Li & DiCarlo, 2010). While the features being extracted by SFA on such a high level are usually easy to analyze and interpret, namely object identity and pose, the functions that extract the features are much less accessible. However, on the lowest level of the visual hierarchy, the functions can be analyzed and have been found to correspond to complex cell receptive fields (Berkes & Wiskott, 2005). Here we use a linear version of this model to learn plausible invariant representations of the visual input. Note that we model the representations formed in the neocortex with SFA, not the hippocampal mechanisms, although studies have demonstrated slowly changing neural signals (Cai et al., 2016;Tsao et al., 2018) as well as time cells (MacDonald, Lepage, Eden, & Eichenbaum, 2011;Pastalkova, Itskov, Amarasingham, & Buzsáki, 2008) in the hippocampus. As opposed to the representations generated by SFA, these are attributable to a putative explicit representation of time in the hippocampus.
SFA finds instantaneous scalar functions that generate slowly varying output from quickly varying input. Given a multidimensional input x(t) and a function space F, SFA finds a set of functions {g (1) (x), under the following constraints: F I G U R E 2 Example input. Shown are two episodes containing four images each (top), and the corresponding relevant latent variables (object identity and x,y-coordinates, bottom). Episodes are strung together to form one data set. The object is switched and its position is randomized at the start of each episode, hence the latent variables exhibit a jump at the transition between episodes The delta value defined by Equation (1) is a measure of the slowness of the signal y (i) (t). It will also be used as one of the criteria for comparing the quality of different sensory representations. Equations (2) and (3) ensure that SFA does not generate the trivial solution of a constant function (for which Δ = 0). The constraint in Equation (4) ensures that SFA does not yield the same feature twice and that the features are ordered according to the degree of their slowness, that is, y (1) (t) has the smallest delta value and Δ(y (i) ) < Δ(y ( j) ) for i < j.
The standard batch implementation of SFA is available from the Python library Modular Toolkit for Data Processing (MDP; Zito, Wilbert, Wiskott, & Berkes, 2008). There is also an incremental version of the algorithm (Kompella, Luciw, & Schmidhuber, 2012), which has been shown to asymptotically reach the same performance as batch SFA with sufficient training and has the advantage of being a more plausible model for a biological learning system. In an incremental learning system, samples from the training data are presented one by one and the model parameters are updated every time a sample is presented. No memory of the previous samples is available, only the information stored in the model parameters of the learning system. By contrast, all samples are available at the same time in a batch learning algorithm and training is done once on the entire training data set.
Since we want to study the effects of memory on the learning of representations, we have to dissociate memory from the learning process. Hence, for the high-level sensory representation (SFA hi ) we use the incremental algorithm which does not require holding the entire training data set in memory. The learning rate is set to 0.005, which yielded the best asymptotic performance in our scenarios. However, because the low-level representation (SFA lo ) serves only preprocessing purposes and remains unchanged during the experiments, SFA lo is implemented with the batch algorithm to be sure to reach optimal performance.
For simplicity, we operate in a linear function space ("linear SFA"), which is sufficient to reliably extract the features of interest (position and identity of the object). We use several identical SFA lo nodes with overlapping receptive fields as a simple model of receptive fields in visual cortex, as it is common in work on SFA. As illustrated in Figure 3, the receptive field of each node of SFA lo spans an 18 × 18 pixel area of the input image and has an overlap of 6 pixels with the receptive fields of the neighboring nodes. Thus, 3 × 3 nodes jointly cover the image space. Each node of SFA lo generates 32 features, and all 9 × 32 SFA lo features are strung together in a single vector. These features are further processed by SFA hi , which is a single node of linear incremental SFA that extracts 16 features.

| Episodic memory
The training data are stored in episodic memory after passing through SFA lo . The focus of our model is the function of a sequential episodic memory in the formation of an optimal sensory representation and not the functioning of the memory system itself. Therefore, our episodic memory model is highly simplified. Its sequential nature is reminiscent of theoretical considerations proposing that the hippocampus holds a record of past stimuli and episodic memory recall results in a "jump back in time" (Howard & Eichenbaum, 2013). Experimental data indeed show that the population vector in the hippocampus changes gradually over time and on successful memory retrieval, the population activity at the time of encoding is reinstated (Folkerts, Rutishauser, & Howard, 2018). While the population vector in the hippocampus might encode past, present and even future to some extent at the same time, our model is simplified to focus on the sequential nature of episodic memory without an explicit representation of time or context. However, recalling a past memory reinstates the activity (= the retrieved pattern) that was present at the time of experience in the model as well. Furthermore, since the stored patterns are SFA features, and these features change slowly in time, the patterns in the hippocampus change gradually at the time of experience.
When modeling storage and retrieval of episodic memory, we distinguish two different modes as mentioned above. F I G U R E 3 Structure of the slow feature analysis (SFA) network. The black dots represent SFA nodes. The gray patches represent the receptive fields that partially overlap in the SFA lo network 2.3.2 | Generating novel sequences from episode fragments

| Repeated verbatim replay of the episodes
The episodic system is represented by a highly simplified algorithmic model for sequence storage and retrieval, which was inspired by iterative retrieval from a hetero-associative network. In this model, each episode element or pattern y i is stored individually. The sequential information is preserved by storing a retrieval key y * i + 1 pointing to the next pattern in the episode. Hence, the information about sequential order is only available on a pairwise basis (y i , y * i + 1 ), not on a global level.
Because the last element of an episode does not have a succeeding element that could be used as a key, it is not stored as a pattern explicitly. It is only stored as a key associated with the second-to-last episode element (Figures 4 and 5).
Retrieval of a sequence is initiated by providing a retrieval cuê y t = 0 to the system. The algorithm calculates the Euclidean distance of y t to all patterns in memory and retrieves the one with the smallest distance, y 0 i . The key y * i + 1 associated with y 0 i is then used as a cue for the next retrieval step:ŷ t + 1 = y * i + 1 . The process described so far is able to retrieve episodes from memory perfectly (except for the last pattern of an episode which cannot be retrieved, see above), unless two or more patterns from different episodes are identical. Sequence retrieval in biological neural networks, however, is subject to internal and external noise, we therefore add a noise term ϵ t $ N(0, σ), σ = 0.2 to the cue in each retrieval step (Equation 5). To avoid getting stuck in short loops during retrieval, a depression term is introduced: Every pattern y i in memory is associated with a depression value a i that is added to the distance to the cue during retrieval. a i is initialized with 0 and is increased by a fixed amount α every time y i is retrieved.
Depression values decay exponentially with a decay constant of 1 b . Equation (7) defines the depression term, withû i t+ 1 ð Þ being a unit vector, in which the element at position i t + 1 is 1.
In our simulations α and b are both set to 400. This provides enough immediate depression to avoid short retrieval loops, but the decay still allows one pattern to be retrieved multiple times during one simulation.
Because the last element of each episode is stored in memory only as a key ( Figure 5, empty rings), it cannot be retrieved from memory. Therefore, when cued with one such pattern, the algorithm will retrieve a pattern that is similar to the cue, thus continuing the sequence in a smooth manner where the input stream normally would have exhibited a jump. These transitions are more frequent when the stored episodes are shorter.

| Comparison of feature quality
After training SFA hi with and without episodic memory, we assess the quality of their features using the following two different criteria.
1. The delta value (Equation 1) of the SFA hi output on a set of testing data is used as a measurement of feature quality (e.g., Figure 8a).

F I G U R E 4
Illustration of the simplified model of episodic memory. The two episodes at the top are stored in the model of episodic memory as a set of pattern-key pairs. A key to the next pattern is stored along with each pattern. The last pattern of each episode (K, L) is not stored explicitly, that is, it only appears in memory as a key. During retrieval, noise is added to the retrieval cue and the closest memory pattern (Euclidean distance) to the noisy cue is retrieved. The corresponding key is used as the next retrieval cue. Due to the noise, retrieval can yield an incorrect transition (G ! H in this example) Since the delta value is the objective function of SFA, features with a lower delta value are better in terms of the algorithm.

| Visual discrimination performance
We follow the approach from one of the first studies that showed perceptual deficits in patients with MTL damage (A. C. H. Lee, Fischer, et al., 2005). Patients and controls viewed pairs of test images, along with a sample image to compare to. The task was to decide which one of the two test images was more similar to the sample image. The test image pairs were linear combinations of the shown sample image and a second sample image, which was not shown.
To simulate such a visual discrimination task, we use two different paradigms for image generation, each of which takes into account one of the latent variable types: Paradigm 1 mixes the sample images based on the coordinate of the object, Paradigm 2 mixes based on object identity, which models the original task more closely. By varying the level of similarity of the mixtures ("mixing level"), the difficulty of the task can be controlled for.

| Paradigm 1 (position discrimination)
The two sample images (S1/S2) show the letter T in different spatial positions. The distance between these two positions is fixed at 1 unit.
(a) (b) F I G U R E 5 Visualization of noise-free sequence retrieval from episodic memory. Retrieval is compared for stored episodes of different lengths. Example patterns are visualized in two-dimensional space. Patterns and keys in memory are represented by filled circles and rings, respectively. The pattern-key associations stored in memory are depicted by solid lines. The initial retrieval cue is the key marked by an arrowhead. In each retrieval step the pattern closest to the cue is retrieved from memory. In the noise-free case this is the pattern identical to the key (the filled circle in the ring). However, if the end of a stored episode is reached, there is no pattern identical to the key in memory (circle empty) and the most similar pattern is retrieved (dashed line). After retrieving a pattern, the key associated (solid line) to that pattern is used as a cue for the next retrieval step. Dashed lines and black filled circles represent the retrieved sequence. (a) Episodic memory contains three episodes of length six. A sequence of length six is retrieved from memory. During retrieval, the end of a stored episode is reached twice. (b) Episodic memory contains six episodes of length three. A sequence of length six is retrieved from memory. During retrieval, the end of a stored episode is reached four times [Color figure can be viewed at wileyonlinelibrary.com] Based on these samples, test images (T1/T2) are generated by using different mixing levels. For example, for a mixing level of 0, T1 is identical to S1 and T2 is identical to S2. When the mixing level is increased, the letters in T1 and T2 are moved along a line toward each other until for a mixing level of 0.5, T1 and T2 are identical. An example set of image pairs is shown in Figure 6a.

| Paradigm 2 (object discrimination)
Two noise-free sample images, one showing the letter T at a random position (S1) and the other showing the letter L at the same position (S2), are generated. Based on these samples, test images (T1/T2) for different levels of mixing level m are generated. Test images are a linear combination of the two sample images: Gaussian noise with zero mean and a SD of 0.1 is added to the pixels of the images after mixing (gray values range from 0 to 1). An example set of image pairs according to Paradigm 2 is shown in Figure 6b.
In both paradigms, the task is to decide, which image, T1 or T2, is more similar to S1. T1 is always the correct answer. Hence, the memory in this case is not completely absent, but retrieval is less reliable. This might be a more realistic model for the effect of partial lesions.
For all memory models, the discrimination performance is better with intact episodic memory in both discrimination paradigms (position as well as object discrimination, see Section 2.5; Figure 7), F I G U R E 6 Sample stimuli for the simulated visual discrimination. (a) Example stimulus set for Paradigm 1. Two sample images (S1/S2) that only differ in the position of the letter T are generated (left-most images). The distance between the letter positions is fixed. The discrimination task is to decide which one of two different mixtures T1/T2 (shown as vertical pairs) is more similar to S1. The mixing level indicates the difficulty of the task. For mixing level 0.5 both T1 and T2 are identical (apart from noise). Reducing mixing levels increases the distance between T1 and T2 until, for mixing level 0, T1 = S1 and T2 = S2. This analysis will eventually allow us to account for the performance difference between intact and lesioned memory, which, first, is more pronounced for simple replay than for generated novel sequences and, second, more pronounced in Paradigm 1 than in F I G U R E 7 Sensory discrimination performance is superior with intact episodic memory. The dashed lines represent the hit rate of the sensory representation trained with lesioned memory, the solid lines represent performance with intact memory. The mixing level represents the level of similarity of the images to discriminate, which amounts to the level of difficulty of the discrimination. Each row of the figure shows the results from a different memory model as follows: (a and b) Repeated verbatim replay of the episodes. (c and d) Generating novel sequences from a hetero-associative sequence storage model. (e and f) Elevated versus no retrieval noise (σ) in the sequence storage model. The columns represent the two discrimination paradigms: (a, c, and e) Results for Paradigm 1 (position discrimination). (b, d, and f) Results for Paradigm 2 (object discrimination) Paradigm 2. Anticipating the results of our analysis below, the first effect is due to the fact that, in our simplified model, verbatim replay improves learning more than the generation of novel sequences. This might explain why replay is frequently observed in the hippocampus (Diba & Buzsáki, 2007;Louie & Wilson, 2001) and why it is important for learning (Girardeau et al., 2009

| Repeated replay of the episodes
Replaying episodes faithfully from memory repeatedly is possibly the simplest form of sequential memory recall. It provides the neocortex with overall more training data and more opportunities to learn from the experienced episodes. Thus, we expect better sensory representations after multiple repetitions of memory replay. This is reminiscent of a process of systems consolidation, in which information from hippocampal memories is gradually extracted into neocortical memory stores with repeated retrievals.
Indeed, we find that the more often the training data are rep- For comparison, we trained another SFA hi instance by generating 40 unique data sets, instead of using the same data set 40 times. This new network was thus trained on the same overall amount of training data, but it has been exposed to more unique sensory stimuli than the network trained by replaying a limited amount of experience from memory. Interestingly, the replay network outperforms the network trained on more data (Figure 8a,b). This additionally emphasizes the importance of memory replay for learning.
In order to visualize the linear transformation that SFA hi [E] learned, which we call the sensory representation, we plotted the response of the entire network to stimuli at different positions ( Figure 9). As

| Generating novel sequences by tying together episode fragments
Episodic memory can also be used to improve training performance without increasing the total amount of data. When modeled with a hetero-associative sequence storage model (as described above, Section 2.3), the hippocampal system can associate similar patterns with each other that were experienced at different times. It has been shown that this sequential or associative property of the hippocampal memory is exploited for inference or generalization (Bunsey & Eichenbaum, 1996;Wimmer & Shohamy, 2012;Zeithamova, Dominick, & Preston, 2012). We hypothesize that these properties can also be used in our case of training the sensory representations. By associating similar episode fragments with each other, novel sequences can be generated from memory that are more useful to the representational learning system than the original episodes.
In our model, the two different objects are presented alternately, that is, at the start of a new episode the object identity is switched.
Also, the position of the object is randomly re-initialized. When the episodes are strung together to form the training data set, the latent variables (object coordinates, identity) jump at the transitions between episodes (see also Section 2.1; Figure 2). These discontinuities make it harder for SFA to learn a representation of the latent variables, because SFA learns features that vary slowly in time. The shorter the episodes in the training data are, the more discontinuities with regard to the latent variables are present in the data (given that the overall amount of data is constant). This is reflected in the delta values (average rate of change) of the latent variables in the training data ( Figure 10c, dashed lines). Hence, we expect the quality of the sensory representation learned to be lower with shorter episodes.
The hetero-associative sequence storage model of episodic memory, however, counteracts the discontinuities in the original data by associating similar episodes or episode fragments with each other, generating smooth sequences of fixed length. A data set composed of these sequences has a fixed number of discontinuities, which is reflected in the delta values (Figure 10c, solid lines). Hence, if the sequences retrieved from episodic memory are used to train SFA we would expect the sensory representation to be independent of the episode length in the training data.
We trained SFA hi instances on episodes of different lengths between 2 and 600 while keeping the overall amount of data con-

stant. SFAhi[S] is trained on these episodes directly, whereas SFAhi
[E] is trained on sequences retrieved from episodic memory. This process is averaged over 16 repetitions to smooth out fluctuations that are introduced by randomness in the movement statistics of the input data, the selection of initial retrieval cues and the retrieval noise in episodic memory.
As expected, the longer the episodes in the training data are, the more precisely the features generated by SFA hi [S] represent the latent variables of the input, that is, delta values decline (Figure 10a) and This result shows that episodic memory can improve the sensory representations by providing training data in a more useful order.
While this main effect is only substantial for shorter episodes in our model, it is conceivable that it is more general for real sensory inputs, which are much more complex. In order to optimize representations for different aspects of sensory percepts, episodic memory content could be reorganized according to different criteria. Notably, the identity of the object is harder to extract for the SFA algorithm than the position, thus optimally representing the object identity in the feature output requires longer training episodes than representing x-and ycoordinate ( Figure 10b). Gupta et al. (2010)  instances from the trials where training episodes were of length 2, because the strongest effect of episodic memory was observed in that case.

Aside
In addition to the main effect described above, the data show two other, more nuanced, effects. Even though they do not affect our main results in any way, we explore the origin of these effects in the following to precisely understand the behavior of the model. This section can be skipped on a first reading.  [E] for episodes of a length larger than 80, which is the length of sequences generated by episodic memory. Since episodic memory is cued randomly, it will repeat at least parts of the episodes multiple times with a high probability. Because there are as many unique patterns as there are patterns in the training set, this repetition of some patterns implies that other patterns are not used at all, thus there are fewer unique patterns presented in the episodic than in the simple scenario. We show above (Section 3.2.1) that the system profits more from a repetition of input data than from being trained on more unique patterns. This explains the described effectin the episodic scenario the training data contains repetitions, while in the simple scenario every input pattern is unique.
2. Counterintuitively, the delta value of the SFA hi features increases slightly and the precision of their object coordinate representation F I G U R E 9 Sensory representations of SFA hi [E] for stimuli at different positions. SFA hi training was conducted 40 times. The value of each pixel is the response to an object centered on the respective position of the input image. (a) Legend depicting how to read the plots (b)-(e). The legend shows three example input images and illustrates which pixels of the plot represent the responses to these images. The top left pixel of the plots, for instance, is the response of the system to an object in the top left corner of the input image. Hence, if the pixel values of a plot display a gradient, that means the system responds differently to objects at different positions-the feature "codes" for object position. (b) Stimuli were single black pixels. Data were normalized by subtracting the representation of a zero image. From left to right: Features 1, 2, 3. (c) Stimuli were noisy images with either the letter T (top row) or L (bottom row). From left to right: Features 1, 2, 3. All plots in (b) and (c) display a spatial gradient, especially feature 1 and 2 code for object position. Feature 3 additionally displays a clear response difference between objects T and L, hence coding for object identity. (d and e) The same information as in (b) and (c) is depicted for the predictions of the linear regressor for x-coordinate, y-coordinate, and object identity (from left to right). Response gradients for the letters in Column 1 and 2 (x,y) and object specificity for Column 3 (identity) are more pronounced than in (b). This was expected since the SFA features are learned in an unsupervised manner, whereas the linear regressors were trained using ground truth as supervisory signal. SFA, slow feature analysis decreases slightly with increasing training episode length in the episodic scenario (Figure 10a,b). We found that this effect arises because of the property described above as well: These two effects further emphasize that replaying the same episodes causes a larger improvement of the sensory representations than providing the model with more unique episodes, a result we found above (Section 3.2.1; Figure 8a,b).

| Noise in episodic memory
The above approaches have compared a memory-free model to a model with a fully functioning memory system. In reality, when the MTL is damaged or in the case of age-related impairment, episodic memory might not be completely absent. Also, memory models of forgetting in the healthy suggest that memory is not completely lost, but that retrieval can fail or be incorrect due to interference of memory traces (Anderson, 2000;Underwood, 1957) or inaccurate retrieval cues (Tulving, 1974). We model these various effects with increased levels of retrieval noise in our hetero-associative memory. Gradually increasing the retrieval noise in the hetero-associative sequence storage model will lead to associations of more and more dissimilar patterns, up to the point at which retrieval is completely random. All other aspects of the model remain unchanged. As before, we gener- F I G U R E 1 0 Generating novel sequences using episodic memory improves sensory representations. The overall amount of training data was fixed at 30,000 frames, but the number and length of individual episodes varied from 15,000 episodes of length two up to 50 episodes of length 600. From these episodes, episodic memory always generated 375 sequences of length 80. We contrast the feature quality between the simple and the episodic scenario, depending on the length of the training episodes. In the simple scenario, those episodes were used for training SFA hi [S] directly, whereas in the episodic scenario, they were stored in episodic memory first, which then generated sequences with fixed length to train SFA hi [E] on. (a) Average of the delta values for the three slowest features on the test data. Triangular markers in the plot denote the SFA hi instances that were used for the visual discrimination. (b) Correlation of SFA hi feature output with latent variables of the input on test data. As in Figure 8, this is the Pearson correlation between latent variables and estimates by the regressors. (c) Average delta value of the latent variables, that is, coordinates and binary object category in the training data. In the episodic scenario, this is the output of episodic memory. SFA, slow feature analysis the training data was varied (x-axis in Figure 11), whereas the sequences retrieved from episodic memory always had a length of 80 patterns. Note that training SFA hi [E] with large noise in episodic memory is qualitatively different from training SFA hi [S] (without episodic memory), hence varying the level of retrieval noise does not yield a gradual transition from the episodic to the simple scenario.
As expected, with higher retrieval noise the feature quality is generally lower, that is, the features vary quicker (higher delta values) and represent the latent variables of the input less precisely (lower feature-latent-variables correlations; Figure 11). Retrieval noise leads to jumps to incorrect elements that are farther away from the previous element than the correct next element is. As a result, the delta values in the latent variables of the retrieved sequences are higher for larger noise ( Figure 11C).  Figure 11C). In the presence of intermediate noise, the closest element to the cue = key + noise is in many cases different from the element closest to the key and therefore further away from the previous element ( Figure 12). This leads to a bigger jump at the end of an episode, which does not occur in the middle of an episode, because in the latter case, the correct element is stored in the system and will be retrieved correctly, as long as the noise is not too large. Thus, these bigger jumps occur more frequently the shorter the episodes in memory are, which is reflected in the delta values of the latent variables ( Figure 11C) and the resulting feature quality (Figure 11a,b). For very large retrieval noise (σ = 6), the noise vector is larger than the average inter-item distance and so it makes no difference whether the element associated with the key is stored in the system or not ( Figure 12). Hence, for large noise the quality of the extracted features does not depend on the length of the stored episodes ( Figure 11). In simulated visual discrimination tasks, the systems that were trained using episodic memory outperformed the systems without episodic memory.
While our results were obtained with a particular kind of sensory representation extracted using SFA, there is a case to be made that our finding would extend to other types of models as well. Using episodic memory to replay the memory of the sensory experience faithfully is equivalent to repeated training iterations on the same data set.
It is likely that replay benefits any incremental learning algorithm, since many machine learning algorithms and biological agents require multiple repetitions to converge to an optimum, when extracting information from training data. For instance, reinforcement learning algorithms profit greatly from experience replay (Lin, 1993;Mnih et al., 2015). Hence, our findings that replay helps improve sensory discrimination is likely not limited to the particular algorithm that we have used here.
The benefit of retrieving sequences composed of episode fragments from memory might not be quite as general. The generation of novel sequences is equivalent to changing the order of presentation of individual samples. This transposition only makes a difference for learning algorithms that are sensitive to the temporal order in the training data. Such learning algorithms are probably employed by biological systems that have to represent temporal and possibly causal relationships. For instance, it has been suggested that the hippocampus helps to establish associations between spatially and temporally discontiguous events (Pyka & Cheng, 2014;Wallenstein, Eichenbaum, & Hasselmo, 1998). In our model, this is reflected in the sequential nature of episodic memory and the generation of novel sequences during which temporally discontiguous patterns can be associated. The particular algorithm we used in this study, SFA, is just one example of an algorithm that is sensitive to temporal order. It was previously shown to be well-suited for modeling realistic neuronal responses in the visual system (Berkes & Wiskott, 2005) as well as the hippocampus (Franzius, Sprekeler, & Wiskott, 2007), and so appears to be a reasonable choice for an algorithmic model of sensory representations.
Similarly, the memory in our model need not necessarily be episodic memory. In our study, we are looking at how the MTL can have an influence on perceptual tasks, and since memory in the human MTL has been linked strongly to episodic memory, we only refer to episodic memory. However, other types of memory could have a similar effect on neocortical representations by improving the quantity or the quality of the training data. Our model, though, has features that are consistent with episodic memory, namely that it has sensory content and that it is sequential.

| Sensory representations for visual discrimination
Our results suggest that visual discrimination is more accurate when the sensory representation has been tuned by intact episodic memory. After applying noise, the noisy retrieval cue can be anywhere within the respective ring. Retrieval always proceeds by retrieving the pattern closest to the noisy cue. If the noise is low, the noisy cue will not deviate much from the stored key, hence the retrieved pattern will be always the one closest to the key (left panel: No. 0, right panel: No. 1). When the noise is increased, the variability of the noisy cue is at some point so high that a number of different patterns can be retrieved, that have a larger distance to the previous pattern. For the retrieval steps within an episode (left panel), this point is reached only for high levels of noise, but for the retrieval steps at the end of an episode (right panel), this happens already at intermediate noise levels. This differential effect is strongest for short episodes, when there are more episode endings. Thus, feature quality is impaired more by intermediate noise for short episodes than for long episodes computational reasons, the same principles that we have identified should apply to more complex images as well. Processing those more complex images would require a more advanced sensory representation, which would take longer to train and perhaps consist of more layers. Hence, it could probably profit even more from episodic memory than our very simple representation.
The perceptual task that we modeled was a discrimination task, which involves only two comparisons. However, the procedure to perform an oddity judgment task (Barense, Gaffan, & Graham, 2007;Lech & Suchan, 2014;A. C. H. Lee, Buckley, et al., 2005)  H. Lee, Scahill, & Graham, 2008). While the perceptual-mnemonic hypothesis attributes this activation to the direct involvement of the MTL in perception, incidental memory encoding processes could be an alternative explanation, especially because the presented stimuli are usually trial-unique (A. C. H. Lee et al., 2008). In some studies, the authors acknowledge this possibility while others are designed to control for that. For instance, Lech and Suchan (2014) controlled for incidental encoding by conducting an additional recognition task after the visual task and comparing the recognition of the studied items to the recognition of the items from an oddity judgment task. Since significantly fewer of the items from the oddity task were recognized as compared to the studied items, the authors concluded that no incidental encoding had occurred. However, although the recognition rate on the studied items (~50%) is indeed higher than on the items from the oddity task (~30%), a comparison to random performance is not possible because the false alarm rate was not given. Hence, it cannot be excluded that items from the oddity task have been stored and successfully retrieved from memory. Furthermore, it is not surprising that the memory of incidentally encoded items is weaker than that of explicitly encoded items, especially because the memory had been maintained over a delay period of 1 week.
In another fMRI study by Lee et al. the pool of initially trial-unique stimuli was used three times in an oddity judgment task in order to investigate whether MTL activation decreases when a stimulus is presented repeatedly (A. C. H. Lee et al., 2008). Indeed, the authors found clear evidence of incidental memory encoding in posterior hippocampus and parahippocampus during oddity judgments. For perirhinal cortex and anterior hippocampus the evidence is less clear, but incidental encoding could not be ruled out, especially because the participants' performance on the task improved across the three sessions.

| Functional subdivision of the MTL
Studies often suggest differential roles for the hippocampus and the perirhinal cortex in perception, depending on the stimulus material A. C. H. Lee, Buckley, et al., 2005; A. C. H. Lee, Fischer, et al., 2005; A. C. H. Lee et al., 2008). It has been proposed that the hippocampus is involved in the perception of scenes, whereas the perirhinal cortex in involved in the perception of faces and complex objects. In lesion studies, performance is compared between patients with localized hippocampal damage and patients with extensive MTL damage, which indeed includes the perirhinal cortex, but also other structures (A. C. H. Lee, Fischer, et al., 2005;Shrager et al., 2006;Suzuki, 2009). Moreover, the hippocampus is usually damaged to a larger degree in MTL patients as compared to hippocampal patients. The influence of the perirhinal cortex on perception is then derived by subtracting the effect of hippocampal lesions from that of MTL lesions. It is possible that the resulting difference reflects the specific contribution of the perirhinal cortex to perception. Alternatively, the lesion size alone could account for the different impairments of hippocampal and MTL patients (Suzuki, 2009). According to this view, scene perception is already impaired by a small lesion, while it requires a larger lesion to impair the perception of objects and faces.
Furthermore, the influence of lesion size on task performance might not be linear, such that a small increase in lesion size could have a large effect on visual perception, or vice versa.
Studies using fMRI are inconclusive in this regard. Some do report differential activation of the hippocampus and the perirhinal cortex depending on the type of stimulus A. C. H. Lee et al., 2008), while others could not reproduce that finding (Lech & Suchan, 2014; A. C. H. Lee et al., 2006). This leaves room for speculation that the reported activation differences are not attributable to different stimulus categories, but to different stimulus complexities or differences in the low-level features, for example, round or sharp edges, textures, and so on. For instance, it has been shown that second-order image statistics differ between image categories (Torralba & Oliva, 2003).
We conclude that the influence of the MTL on perceptual processes might stem from its mnemonic function, not from a direct role in the perceptual process. Increased activation in MTL areas during perceptual tasks might reflect task-irrelevant, memory-related activity.
Patients with damage to the MTL might be impaired in perceptual tasks because their sensory representation is less optimized, due to the limited availability of episodic memory for the tuning of sensory representations.

ACKNOWLEDGMENTS
The authors thank Boris Suchan for helpful comments on the manuscript. This work was supported by grants from the German Research

DATA AVAILABILITY STATEMENT
The code for the computational model is openly available at https://doi.