I've now moved on to other topics, but I still want to mention some of my thoughts on the still quite poor NAND performance on the GTA02 (and generally, the S3C24xx).
It seems like we are down to a point where the CPU is 100% busy reading from NAND, which is odd. Why would reading from a mass storage device make the CPU so busy? Well, because Samsung "forgot" to add DMA support to all of their integrated NAND controllers, from the old 2410 through the 2440, 2442, 2443 and up to the shiny new 6410, all the NAND controllers don't support DMA. In fact, they don't even have a FIFO or some kind of internal buffer for the received data. This is really weird, considering the facts that
- every other peripheral (SD/MMC, SPI, UART, ...) can use DMA
- Samsung as provider of both NAND flash and SoC should be experts in providing good flash performance
- I cannot see any strong architectural limitation. The data is read into a register. The register should be replaced with a FIFO, and a DMA can regularly read from that register or FIFO and put it somewhere else into memory. It's not any different from e.g. SPI or UART DMA.
In the current mainline Linux driver for S3C2440 NAND, we busy-wait (poll) for completion of the read request. This is of course sub-optimal. I implemented a version that uses the Read/Busy line IRQ and a 'struct complation' based mechanism and to my big surprise the CPU usage and throughput was identical. It seems like that NAND flash in the GTA02 is fast enough to max out the CPU.
So probably all that can be done is to optimize the actual code that is executed during the NAND read. It might be worth implementing some small hand-optimized assembly implementation as standalone code (not using the existing driver) to see how far the hardware can actually go.
Isn't it sad that you use Samsungs SoC and Samsungs NAND (packaged together in one component as multi-chip-package by Samsung) and still get less than 50% of the performance of the NAND chip, according to the data sheets :(