Skip to content

C64 Bank-Switched Double Buffer Scrolling

Bank-switching combined with workload distribution enables dramatic CPU overhead reduction for parallax background scrollers on the Commodore 64. This analysis documents the optimization progression for a multi-layer mountain terrain scroller.

Multi-layer parallax scrolling creates compelling depth perception but demands substantial CPU resources. Each layer requires character data manipulation, color RAM updates, and scroll register management—operations that multiply across layers and accumulate into significant per-frame overhead. This technical study examines the evolution from an unoptimized baseline through progressive refinements yielding order-of-magnitude improvements.

The test case involves a mountain landscape with multiple depth layers: distant peaks, intermediate foothills, and near-ground terrain. Each layer scrolls at different speeds relative to player movement, creating the parallax depth effect. The challenge: achieving smooth multi-layer scrolling while reserving adequate CPU cycles for game logic, sprites, and audio.

Initial Conditions

The baseline implementation consumed approximately 37 raster lines per frame—prohibitive overhead for a background layer. Memory-optimized loop structures prioritized RAM conservation at significant execution cost, creating timing bottlenecks within interrupt handlers.

The original code structure used compact loop constructs—a common approach when RAM conservation takes priority over execution speed. Loops iterate through character rows, performing scroll calculations and memory updates within each iteration. Loop control overhead (counter manipulation, conditional branching) consumes cycles that accumulate across iterations.

With 25 visible character rows and multiple operations per row, even small per-iteration overhead multiplies significantly. The baseline consumed approximately 2,343 cycles per scroll update—nearly 40% of the cycles available within a single frame’s visible region. This left insufficient time for other game operations without causing visible frame rate degradation.

Additionally, the scroll update routine executed within interrupt context, blocking other interrupt handlers during its extended runtime. This created cascading timing problems across the interrupt system, producing visible jitter in sprite multiplexing and parallax layer boundaries.

Phase 1: Loop Optimization

Initial improvements converted memory-efficient rolled loops to unrolled sequential instructions, eliminating iteration overhead:

Implementation Cycle Count Raster Consumption
Baseline 2,343 ~37 lines
Unrolled 1,920 ~30.5 lines

Outcome: 20% execution improvement. However, interrupt time remained excessive while RAM consumption increased substantially.

Loop unrolling replaces iteration with repetition—instead of looping 25 times with loop control overhead, the code contains 25 sequential copies of the loop body. This eliminates counter decrements, conditional branches, and branch delay cycles, reducing per-row overhead to the pure operation cost.

The trade-off manifests as increased code size. Where a looped implementation might occupy 50-80 bytes, the unrolled equivalent expands to 300-500 bytes depending on operation complexity. For a single scroll layer, this trade-off typically proves acceptable. However, multiple layers with multiple unrolled routines strain available RAM.

Despite the 20% improvement, 30 raster lines per scroll update remained problematic. With multiple parallax layers, a 60 raster line status bar, and sprite multiplexing requirements, the timing budget showed clear deficits. More aggressive optimization was necessary.

Self-Modifying Code Considerations

The 6502 architecture enables self-modifying code—instructions that alter their own operands during execution. Scroll routines can modify address operands within the unrolled sequence, eliminating index calculations entirely. This technique provides additional cycle savings but introduces debugging complexity and may conflict with ROM-based code requirements.

Phase 2: Bank-Switched Distribution

The primary optimization allocates dual screen buffers (2K additional RAM) and distributes software scroll operations across hardware scroll transitions.

The fundamental insight: hardware scroll provides 8 pixels of movement through register manipulation alone (the X-scroll bits in $D016). Only at the 8-pixel boundary does software intervention become necessary to reposition character data. By maintaining two complete screen buffers and distributing character updates across the 8-frame hardware scroll cycle, per-frame overhead drops dramatically.

Architecture

  1. Dual Buffers: Display alternates between banks while off-screen buffer receives scroll updates
  2. Distributed Workload: Column updates distributed across hardware positions (6-7-7-7-7-4 pattern)
  3. Double-Step Scrolling: Software scroll advances 2 columns (16 pixels) per cycle rather than single-column increments
  4. Buffer Swap: Visible bank toggles during hardware scroll register resets

The dual-buffer concept separates display from update operations. While one buffer displays on screen, the other receives scroll updates off-screen. At the hardware scroll boundary (when X-scroll wraps from 7 back to 0), buffers swap roles—the updated buffer becomes visible while the former display buffer receives the next update cycle.

Workload distribution recognizes that full-screen character scroll (40 columns × 25 rows = 1000 bytes) need not occur in a single frame. Spreading updates across 6-7 frames means each frame processes only 140-170 bytes—approximately 1/6th the original workload. The specific 6-7-7-7-7-4 pattern reflects the 8-pixel hardware scroll cycle with minor adjustments for execution timing balance.

Double-step scrolling compounds the optimization. Rather than scrolling one column (8 pixels) per hardware cycle, the system scrolls two columns (16 pixels). This halves the frequency of software scroll operations while maintaining smooth visual motion at typical scroll speeds.

Results

  • Raster consumption reduced below 25% of baseline (~9 lines versus 37)
  • 75% total raster time recovery
  • 4x improvement over baseline, 3x over unrolled version

Subsequent refinements compressed execution to 1-2 raster lines.

The 75% time recovery transforms the parallax layer from a timing burden into a manageable background operation. At 1-2 raster lines per frame, scroll updates execute almost invisibly within the frame budget, leaving ample resources for game logic, audio, and multiple simultaneous layers.

Resource Considerations

Dual screen allocation consumed 1K within the 16K VIC-II bank, displacing 16 sprite definitions. Compensation strategies:

  • Runtime sprite mirroring through code execution rather than pre-defined graphics
  • Dynamic sprite streaming from regions outside active VIC bank boundaries

The VIC-II addresses a 16K memory window for all graphics operations—screen RAM, character definitions, and sprite graphics must all fit within this bank. Adding a second 1K screen buffer reduces space available for other graphics by 1K, equivalent to 16 sprite frames (64 bytes each).

Sprite mirroring addresses symmetric graphics requirements. Rather than storing separate sprite frames for left-facing and right-facing characters, the engine stores only one direction and generates mirrors through runtime bit-reversal algorithms. This halves memory requirements for symmetric sprites while adding minor CPU overhead during sprite setup.

Dynamic streaming enables sprites located in main RAM (outside the VIC bank) to appear on screen. During vertical blanking or other non-display periods, the engine copies sprite data from main RAM into VIC-accessible sprite buffers. This expands effective sprite memory to the full 64K address space at the cost of transfer time.

Sprite Coordination

Bank transitions occurring near sprite rendering boundaries introduced potential display corruption. Resolution required NMI-triggered bank switching scheduled prior to affected screen split positions.

The VIC-II fetches sprite data during specific scanlines based on sprite Y-positions. If a screen bank switch occurs while the VIC-II expects sprite data from the old bank, it reads from the new bank’s corresponding addresses—typically producing garbage graphics or invisible sprites.

The coordination strategy schedules bank switches during safe windows—scanlines where no sprites are being fetched. NMI-based timing provides the precision necessary to place bank switches within these narrow safe zones. The engine calculates safe windows based on current sprite Y-positions and schedules switch operations accordingly.

For sprites spanning bank switch boundaries, additional complexity arises. A sprite whose rendering begins before the switch and ends after experiences mid-render bank changes. Solutions include: restricting sprite Y-positions to avoid boundary regions, or executing double-speed bank switches that complete within single scanlines.

Complexity Factors

Bidirectional scrolling introduces substantial complexity beyond single-direction implementations. Wraparound synchronization and direction-change handling required significant development iteration to achieve consistent behavior.

Unidirectional scrolling maintains monotonic buffer state—new columns always enter from one edge while old columns exit from the other. Bidirectional scrolling must handle direction reversals, where column entry/exit patterns invert instantaneously. The dual-buffer system must track which buffer contains valid data for each direction and swap appropriately when direction changes.

Wraparound synchronization ensures seamless screen boundaries when scrolling loops. The level map forms a continuous ring, with the right edge connecting to the left edge. Buffer updates must correctly wrap screen RAM addresses at boundaries, and scroll position tracking must handle 16-bit arithmetic with proper carry propagation.

Direction-change handling demands immediate response to player input while maintaining visual stability. Abrupt direction reversals can produce a single frame of incorrect scroll direction if buffer swaps and scroll updates execute in wrong order. The engine validates direction state at multiple points within the scroll update sequence to prevent such glitches.

Implementation Notes

Complete source remains unpublished given implementation complexity, undocumented opcode dependencies, and integrated NMI timing requirements. The architectural documentation enables independent implementation by experienced developers while preserving technical challenge.

The implementation uses several undocumented 6502 opcodes for cycle-critical operations—opcodes whose behavior varies slightly across processor revisions. This creates compatibility considerations: code tested on one C64 variant may behave differently on another. Production implementations should verify behavior across NMOS 6510 and 8500 variants.

NMI integration adds another layer of system-wide coupling. The scroll system’s timing assumptions depend on NMI handler execution at predicted intervals. Changes to NMI handler content or timing can propagate unexpected effects into scroll behavior. This interdependency demands holistic system design rather than modular component development.

Applicability and Extensions

The bank-switched double-buffer technique applies broadly to scrolling systems beyond this specific implementation. Vertical scrolling, diagonal scrolling, and multi-directional scrolling all benefit from similar workload distribution and buffer management principles.

Color RAM presents a notable exception—the C64’s color RAM occupies fixed addresses ($D800-$DBFF) that cannot be bank-switched. For layers requiring per-character color changes, color RAM updates must execute conventionally during each scroll cycle. The techniques documented here apply fully to character/screen RAM while color RAM follows different optimization paths.

See also: raster budget optimization case study · VSP scrolling alternatives · Wild Wood parallax analysis