Visuo-manual interaction in visual short-term memory (VSTM) has been investigated little, despite its importance in everyday tasks requiring the coordination of visual perception and manual action. This study examines the influence of a manual action performed during stimulus learning on a subsequent VSTM test for object appearance. The memory display comprised a sequence of briefly presented 1/f noise discs (i.e., possessing spectral properties akin to natural images), wherein each new stimulus was presented at a unique screen location. Participants either did (or did not) perform a concurrent manual action (spatial tapping) task requiring that a hand-held stylus be moved to a position on a touch tablet that corresponded (or did not correspond) to the screen position of each new stimulus as it appeared. At test, a single stimulus was presented, either at one of the original screen positions, or at a new position. Two factors were examined: the execution (or otherwise) of spatial tapping at a corresponding or non-corresponding position, and the presentation of test stimuli either at their original spatial positions, or at new positions. We find that spatial tapping at corresponding positions elevates VSTM performance by more than 15%, but this occurs only when stimulus positions are matched from memory to test display. Our findings suggest that multimodal attentional focus during stimulus encoding (incorporating visual, spatial, and manual components) leads to stronger, more robust memory representations. We posit several possible explanations for this effect.