Mandelbrot Explorer v2 for the Altera Cyclone II FPGA
October 10, 2010
In my free time for the past several weekends, I've been working on a complete revamp of the FPGA-based Mandelbrot generator our group created for our digital design class last year.
My goal was to take all that I've learned about hardware design for the past several months as a research assistant in the Oakland University Nano-imaging laboratory, and apply it towards a redesign of this old project, to see how far I could push it. My two design goals: (1) at least double the data width while (2) meeting or exceeding the performance of the original.
The impetus for goal 1 stems from the fact that the precision of numbers used in the calculation has a direct impact on depth to which you can zoom into the Mandelbrot before the image becomes pixely and distorted. As was discussed in the posting for the first generator, 32 bits of fixed-point precision allows for about 23 zooms (where each zoom is a factor of 2), before the number representation falls apart and the fractal loses it's resolution. Consider that the starting Mandelbrot is 19 inches across (size of monitor), and that each zoom doubles the effective size of the Mandelbrot. Therefore, the monitor required to view the entire fractal after 23 zooms is 2^23 * 19 inches = 2516 miles across! Now, suppose that we increased the data width to 64. This would allow for about 55 total zooms (9 bits are needed to differentiate the points the screen). 2^55 * 19 inches = 10.8 trillion miles or 1.8 light years! So, at a minimum, I decided to pursue a data width of 64 bits, twice that used on the original ECE378 project. A light-year wide, or bust!
So, to accomplish this, I decided to take a closer look at the FPGA used and what it's capable of. Instead of simply using the VHDL fixed point libraries and letting the compiler decide which multiplier hardware to infer from that, I decided to generate the multiplier component directly, so as to best utilize the available multiplier blocks on the FPGA. The first stop was Xilinx CoreGen, where I configured a 64x64 signed multiplier, the widest possible. According to CoreGen this utilized 16 MULT18X18 blocks. My Xilinx Spartan3E-500 has 20 MULT18X18, so this chip could only hold a single 64x64 multiplier. The project would certainly work, but I hoped to fit more, to parallelize the calculation as much as possible. So, I looked toward my Altera Cyclone II-35 board instead as an alternative. A 64-bit multiplier consumed 32 DSP_9BIT components (the Cyclone II-35 has 70), so I could easily fit two multipliers. (Towards the end of this project, I thought about this a bit more, and it occurred to me that I should be able to increase this width even further without using any more multiplier blocks...64 is not evenly divisible by 9, but 72 is! So, I increased the width to 72, and while there were more LUTs consumed, it still used 32 DSP_9BIT blocks. Cool!)
So, to accomplish this, I decided to take a closer look at the FPGA used and what it's capable of. Instead of simply using the VHDL fixed point libraries and letting the compiler decide which multiplier hardware to infer from that, I decided to generate the multiplier component directly, so as to best utilize the available multiplier blocks on the FPGA. The first stop was Xilinx CoreGen, where I configured a 64x64 signed multiplier, the widest possible. According to CoreGen this utilized 16 MULT18X18 blocks. My Xilinx Spartan3E-500 has 20 MULT18X18, so this chip could only hold a single 64x64 multiplier. The project would certainly work, but I hoped to fit more, to parallelize the calculation as much as possible. So, I looked toward my Altera Cyclone II-35 board instead as an alternative. A 64-bit multiplier consumed 32 DSP_9BIT components (the Cyclone II-35 has 70), so I could easily fit two multipliers. (Towards the end of this project, I thought about this a bit more, and it occurred to me that I should be able to increase this width even further without using any more multiplier blocks...64 is not evenly divisible by 9, but 72 is! So, I increased the width to 72, and while there were more LUTs consumed, it still used 32 DSP_9BIT blocks. Cool!)
Next was the architecture of the calculation itself. The old design was a little simplistic. It performed all of the operations necessary for the completion of a single iteration (3 multiplies, 2 adds, 1 subtract, 1 shift) in a single clock cycle. This was easy to implement, but all that logic means it cannot be clocked very quickly. The maximum frequency of the single-cycle 32-bit Mandelbrot core was only 25MHz. 1/(25MHz) yields a clock period of 40ns, and we were able to fit two of these on a chip, yielding 2 iterations/40ns = 1 iteration/20ns, translating to a theoretical maximum of 5*10^7 iterations per second.
72-bit wide multiplications are even harder to finish in a single clock cycle. I got a rough idea of their timing properties by compiling a few different configurations in Quartus. A single-cycle 72x72 multiplier had a maximum frequency of around 20MHz. If the 3 multiplies were performed serially, followed by a cycle for the smaller operations, this would equate to 4 cycles at a clock period of 50ns, or an entire 200ns for only 1 iteration! I experimented a bit, and found that one with a pipeline depth of 6 could function safely at 100MHz. So I ran with it, building a 12-stage state machine around this multiplier. There are 3 multiplies necessary for the iteration of a single point. These are fed in and the answers are available by the end of 9 clock cycles. (See pipeline diagram.) As the answers become available, the necessary addition, subtraction, shift and compare operations are carried out (each of these operations is small enough to do in a single clock cycle), before it is time to iterate that point again. The state machine runs around and keeps the multipliers fed with the calculations for four points at a time, until all of them are completed, and it is free to operate on the next quartet of points. (It should be noted that there is some inefficiency possible with this design, as if one point completes before the other three, the state machine waits until all are done before proceeding. I hypothesized that the small gains made by implementing a more sophisticated multiprocessing scheme to keep the calculation circuitry truly continuously fed with available points would probably be outweighed by the overhead needed to implement it on-chip. Having said that, this is an area that I would like to explore further in the future!) Once the pipeline is filled, this Mandelbrot core will churn out 4 iterations every 12 clock cycles = 4 iterations/120ns. There are two Mandelbrot cores present on the FPGA, yielding 8 iterations/120ns = 1 iteration/15ns, so this translates to a theoretical performance of 6.67*10^7 iterations per second, 33% faster than the old version! Huzzah! In real-world benchmarks, the gains ended up being even greater, probably due to some major optimizations made to the plotting state machine.
I benchmarked the time to render the following screen on both systems:
Both systems function at 640x480 resolution, and were set to 255 maximum iterations per point.
Old generator: 693ms
New generator: 391ms => 1.77x as fast!
I also added the ability to increase the calculation effort, if desired. This becomes necessary during the deeper zooms made possible by the increased precision. At the left is a screen drawn with the max iterations set to hex FF (255), the hardcoded max of the old Mandelbrot generator. At the right is a screen drawn with the max iterations set to hex 1FF (511). The one at the left took 1096ms to generate, while the one at the right took 1277ms to generate. With a little extra time spent, we were able to obtain that much more detail. The new Mandelbrot generator can be set to take as many as 65535 iterations per pixel per frame.
At the end of the day, the new 72-bit Mandelbrot generator is capable of zooms of up to 2^62 or 4.61*10^18 times! Again, assuming you start with the full set visible on 19 inch monitor, this means that when we've fully zoomed in, you would need a monitor 238 light years across to view the entire Mandelbrot set! It's pretty safe to say that when you're zoomed in that far, you're very likely looking at something that no other human being has ever seen before! All explored very quickly in custom hardware on this little 1" by 1" piece of silicon sitting on your desk. It's hard not to sit there in awe that what you're experiencing is even possible. There is, in this simple equation, a hidden world of infinite detail and beauty, one we are free to explore, only limited by the precision of our tools and the speed at which they operate.
UPDATE 2016/3/10:
Check out version 3, targeting a Xilinx Spartan-3A FPGA on the Mercury development board!