CPU bus, clocks and memory latency
In this new technical article, I will present how central elements like the ST bus, memory and clocks are implemented in zeST. This is a crucial part, since the choices we make here constrain all the rest of the hardware implementation. These choices are also very dependent from the hardware platform we are working on.
I will try to explain in the most understandable way as I can, so anyone with very basic notions of synchrorous logic and timing should be able to understand.
A suitable hardware platform
When I decided to jump into hardware design, and start a re-implemintation of the Atari ST as a support project for that, I first had to choose a FPGA board. I really did not want to start making my own board, but rather use an off-the-shelf board with the minimum necessary components:
- a chip from Xilinx, because I already had prior experience with the Xilinx development tools,
- a system-on-chip (SoC), with both a CPU and FPGA programmable logic, so the CPU could handle everything that can not be implemented on the FPGA, especially the input/output peripherals like keyboard/mouse and storage devices,
- a decent amount of on-chip memory,
- USB host connectivity,
- a HDMI video output.
After a bit of research I stumbled upon the Z-Turn board, that featured all that for a decent price.
One of the main questions I did not have the answer to yet was whether the main memory was fast enough to handle the kind of memory accesses the 68000 processor does. To answer that question, I had to study how the 68000 works in detail, and how we can manage memory accesses in a way that is compatible with the 68000 timings.
Timing of memory accesses on the 68000
So let’s begin with the 68000. Bus accesses (to memory, but also to memory-mapped peripheral registers) are done through what are called bus cycles, which are sequences of signal exchanges between the processor and the addressed peripherals. From the MC68000 user’s manual, you can find a very precise description of how signals are handled during the different kinds of bus cycles.
Write cycle
Let’s start with the bus write cycle. Here is a timing diagram I generated from actual values coming from the 68000 core in zeST:
Before we go any further, let me explain how to read timing diagrams, in case you did not know. Each row corresponds to a signal that can be either a binary signal with only high (1) and low (0) states, or a group of bits that form a number, usually written in hexadecimal format. Such a group of bits is called a bus. For instance on the above diagram, clk_8 is a binary signal which constantly changes it state between 0 and 1 (that what clocks are supposed to do), and A is a bus. Such diagrams shows how signal values evolve during time. Timestamps increase from left to right. On the previous diagram, the time values (the 2376, 2378 etc) are sample numbers, but in other cases they can be actual time values in microseconds or nanoseconds for instance. The sampling rate here is of two samples per clock cycle so we can see both the high and low states of the clk_8 clock.
The write cycle is divided in 8 states (numbered from S0 to S7 in the diagram above). You can notice both upper and lower clock transitions are used by the 68000, unlike most of modern electronics. This means two different states in a single clock cycle. The important states are the following:
- In State S1, the processor sets the write address to the A (address) bus.
- In S2, the processor sets the ASn (address strobe) signal to 0, meaning there is a valid address on the address bus. It also sets RWn (read / not write) signal to zero, meaning the current bus cycle is a write cycle. The 16-bit data value to be written is set on the D (data) bus.
- In S4, the processor sets either LDSn, UDSn or both to zero. These are the data strobe bits, enabling either the lower byte, upper byte, or both bytes. This allows to write either the lower or upper byte of the data bus (8-bit writes), or both (16-bit write). The bus should set the DTACKn (data transaction acknowledge) flag to 0 as soon as the write order has been processed. On the ST, DTACKn can be asserted by the MMU if the write is done to memory, or the GLUE if the write is done to a memory-mapped peripheral register, like the on the MFP, the sound chip etc.
- In S5, if the value of DTACKn is zero, then the transaction is validated. Otherwise the CPU will initiate wait cycles until a peripheral asserts DTACKn, or a bus error occurs (this article will not talk about bus errors so here we consider all transactions are valid).
- in S7, the ASn, LDSn, UDSn bits are set back to 1, the idle value.
- before the end of S7, RWn is set back to 1. DTACKn is deasserted by the peripheral.
Read cycle
Here is the time diagram for read cycles:
The read cycle happens between the blue and yellow bars. As you can see, it is not very much different from write cycles, apart from the fact that LDSn/UDSn are asserted at the same time as ASn, and RWn stays at 1 during the whole cycle.
Just like for the write bus cycle, DTACKn signal must also be asserted before S5, but this time, the peripheral must send the value to be read on the D bus (yes, the data bus is bi-directional), and DTACKn can only be asserted once the data value has been fetched.
Constraints resulting from the 68000 bus timings
As you can see from the analysis of the write and read bus cycles, the time window between the actual read/write orders (that is, when ASn and UDSn/LDSn are asserted) and the resulting values to be sent and received (confirmed when DTACKn is asserted) is quite much less than the 4 cycles of the whole bus cycle. For read, the order begins at State S2 and must end before S5 if we do not want additional wait states. So that makes a time window of only 1.5 cycle. At the 8 MHz clock frequency, that means the time window is of 187.5 ns (nanoseconds). Worse, for the write cycle, UDSn/LDSn are only asserted at State S4, meaning our time window is only of half a cycle, which is 62.5 ns. The memory on the ST supports such short delays, so there is no problem on the real hardware. The architecture of the ST is also built such that the cycles outside of memory accesses (corresponding to States S6, S7, S0 and S1) are used to fetch the data for the video display. So two memory accesses can actually be issued during a single bus cycle of the 68000!
Timing of the FPGA board’s memory
I measured the duration of memory accesses from the Z-Turn board’s FPGA to the main memory. The duration of memory reads is typically between 180 and 200 ns. So in the best case, we already are at the upper limit of the time window for reads of the 68000! In the worst case, we are beyond the limit. Moreover, these access times are not constant, and some memory accesses can last 500 ns! These exceedingly long memory access durations are quite rare, though. Memory writes are a little shorter than the memory reads, but still much longer than the 62.5 ns time window.
So what’s happening ? Why are memory time accesses worse on a 2010-era platform compared to a computer from the 1980’s ?
For a long time I thought it was because of the DDR3 SDRAM that was being used on the board, that was known for its especially long transfer rates. But from reading the specifications, this does not add up: typical accesses are less than 20 ns on these kinds of SDRAM memories. The actual main reason is that our FPGA is not dealing directly with the RAM, but is accessing it through the Zynq’s on-chip memory controller, through a dedicated AXI memory bus. Also in my design, I am using a simplified bus mechanism that I later convert to AXI transactions.
So we need to add delays for the “simplified to AXI” bus signal conversions, which is about 1 or 2 cycles of the 100 MHz main FPGA clock, each cycle being 10 ns, so the counter quickly increases. Then, the internal delays for the on-chip controller, taking AXI bus requests and converting them into DDR3 transactions. I honestly have no idea about what delay is added here. And finally, what I suppose is the worst part: the memory controller is shared between the FPGA and the ARM CPU of the Zynq chip. This means that memory requests are multiplexed, and that our requests have to wait in line so current requests issued by the CPU have been fullfilled. This also explains the variability of memory access times, which depend on what the CPU is doing. This also explains why some other FPGA-based retro computing projects like MiST and MiSTer are relying on dedicated SDRAM that is directly controlled by the FPGA.
Now let me explain to you how I solved the memory latency problem in zeST.
Clock conversion in zeST
In zeST, we are actually dealing with several clocks:
- The ST main clock, which is 8 MHz. As explained above, we are actually managing two states per clock cycle, so it’s more handled like a 16 MHz clock.
- The video clock, which is 32 MHz and corresponds to the pixel clock in ST high resolution mode. Its states are the two states of the ST clock plus two intermediate states.
- The MFP clock which is 2.4576 MHz.
- Other clocks that are 4, 2 and 1 MHz, used by the ACIAs, the YM2149 sound chip or the keyboard processor.
- The main FPGA clock, which is set to 100 MHz, and is the one clock to rule them all.
Actually, all clocks in the ST are handled using clock enables. They are signals that are synchronous to the FPGA main clock, and determine which clock edges (transitions from 0 to 1 or from 1 to 0) of the FPGA clock correspond to which edges of the different, slower clocks. So actually, for instance, the ST main clock, which is 8 MHz, is not a clock signal by itself, but it is handled as two clock enables: one for the rising edge (0 to 1 transition) , one for the falling edge (1 to 0) because the 68000 uses both.
The following diagram shows the correspondence between the main 8 MHz clock enables and the resulting simulated clock signal:
Here the sampling period is one sample at every period of the FPGA clock, so each sample corresponds to 10 ns. You can see the clken1 and clken2 clock enables every 12 or 13 cycles, providing an average period of 125 ns which is the period of a 8 MHz clock. The resulting clk_8 signal is just shown here to illustrate the corresponding 8 MHz clock signal, but it is not actually used in the design.
The cycle-exact 68000 core by Ijor I am using in zeST uses those two clock enables. The rest of zeST’s implementation of the ST components is all based on clock enables so everything is completely synchronous and punctuated by the main FPGA clock and the sequence of clock enables that is generated by a dedicated piece of logic in the design.
Dealing with memory latency
So now that you know how the different clocks in the ST are managed, let me ask the question: what happens if the clock enables are not evenly distributed over time ? Could I, for instance, issue clken2 just one cycle after clken1 ? Could I issue a burst of sevreal alternating clken1/clken2 pairs, one enable every cycle, then do nothing during several hundreds of nanoseconds ? The answer is yes, because what matters is that you actually get 8 million clock cycles every second, 512 cycles at every video scanline (in 50 Hz PAL), 313 lines per frame, etc. What actually happens in short periods of a few microseconds is not visible at our macroscopic scale. The only issue occurs when you need to produce a video signal of a steady 32 MHz pixel rate, but that is solved using an asynchronous FIFO buffer.
So, zeST’s solution to memory access latency is to delay some clock enables, namely, clken2 since it is responsible for triggering the S5 state in the 68000 bus cycle. When a memory access is pending, we can just wait for it to finish before triggering clken2 and the DTACKn signal that validates the memory transfer. This way we never introduce any wait states on the 68000 bus cycles. The lost time can be catched up by accelerating the clock enable rate until we have issued the proper number of clock enables.
Here is what our memory accesses actually look like:
This diagram describes a few 68000 bus cycles actually running on the FPGA. At the first two rows, you can see the two clock enables, then the resulting clk_8 clock. You can notice it is completely distorted compared to the simulation I showed in the previous section. At the bottom you can see the r, r_done, w, w_done signals. They are the signals in my simplified memory bus protocol, meaning a read transaction is in progress (r), the read transaction has finished (r_done), a write transaction is in progress (w) and the write transaction has finished (w_done). You can see the clk_8 clock signal remains high during long periods; that corresponds to pending read or write transactions that are being waited for before enabling clken2 that will trigger the low state of the clock.
Conclusion
This clocking mechanism has shown that you can use the cheap Zynq 7000-based boards to implement an Atari ST withouth the need to rely on dedicated FPGA memory. This should also be enough for implementing any kind of retro machine based on the 68000 processor at 8 MHz. I can not ensure this would work for faster machines, even though there are margins for progression. There are many parts that have been implemented while I still was learning VHDL and protocols like AXI. Some parts of the design could be rewritten in a more efficient way. I currently make no use of the fast data burst transfer rates of DDR3 memory, so I could implement some basic caching and prefetch mechanism, reducing the total amount of DDR3 transactions, thus saving some bandwidth and reducing the delays to be catched up by the clock enable logic.
If you have read this article up to this point, thank you very much. It is a very important article to understand the technical challenges of being able to implement an Atari ST on a low cost off-the shelf FPGA board. I’ve been planning to write it for more than a year, but I never found a satisfactory way of explaining such technical aspects to the general public. I still have doubts whether it will be well or at least partly understood. If you have comments, questions, please post them by replying on the dedicated post on Twitter.