Like Flash, 3D XPoint Enters The Datacenter As Cache

In the datacenter, flash memory took off first as a caching layer between processors and their cache memories and main memory and the ridiculously slow disk drives that hang off the PCI-Express bus on the systems. It wasn’t until the price of flash came way down and the capacities of flash card and drives came down that companies could think about going completely to flash for some, much less all of their workloads.

So it will be with Intel’s Optane 3D XPoint non-volatile memory, which Intel is starting to roll out in its initial datacenter-class SSDs and will eventually deliver in DIMM memory, U.2 drive, possibly M.2 form factors for servers. Just like it did with 2D and 3D NAND flash drives so many years ago.

This time, though, the performance and cost of 3D XPoint will be a lot closer to DRAM memory – and significantly will be addressable as memory – in the systems. This is going to change the way architects design systems and programmers push them to do so as they try to find a better balance of components to move more data in and out of systems faster and more predictably. We think, as does Intel, that certain systems where latency for reads and writes is critical will have Optane sprinkled in various memory tiers and we do not think that 3D XPoint will be a replacement for much more capacious and much less expensive flash cards, SSDs, and sticks that probably have much better serial as opposed to random read and write performance.

There could come a day when 3D XPoint replaces NAND flash, just like there will be a day when the last disk drive will be sold; it is just that no one knows when that day will be. Enterprises with their relatively modest datasets will have a lot easier time going all-flash than the hyperscalers will, which is why Google is still talking about disk drive innovation. That said, we think that the hyperscalers will be right at the front of the line for Optane storage in whatever form factor they can get it because of the low latency and higher degree of predictability of performance that 3D XPoint is bringing to the table.

Sliding In Between DRAM And Flash – And Next To DRAM

As we have talked about many times before, the memory hierarchy in systems has been getting deeper and wider over the years and despite the desire for greater simplicity, we simply are never going to get it, particularly with Moore’s Law running up against the limits of process technology and with the prospect some time in our future that compute could get more, rather than less, expensive over time. That is why expanding the memory hierarchy and staging data intelligently is so critical in current systems and will be even more important in the coming years.

While 3D XPoint, the non-volatile memory that Intel unveiled in July 2015 in conjunction with memory partner Micron Technology, has taken decades to come to market, it is coming along at exactly the right time, bridging the gap between flash storage inside the server and on the network and the various levels of cache and the main memory in the system in a way that flash has tried to do but just does not.

As you can see from the chart above, 3D XPoint slides right in between DRAM and flash in terms of its latency and data capacity, and is about an order of magnitude slower and fatter than DRAM is compared to the on-chip SRAM memory that comprises the L3 cache on a processor. 3D XPoint latency is three orders of magnitude faster than NAND flash SSDs hanging off the PCI-Express bus and, at least in theory, will be able to offer about the same capacity as NAND flash drives – and it can be addressed with the same load/store metrics as DRAM memory. Application software will no doubt have to be tuned to accommodate the difference in performance between real DRAM DIMMs and 3D XPoint DIMMs, but the net gain in main memory capacity will be a boon for customers who want to keep terabytes of data in main memory, as close to the CPU as possible for performance reasons.

Disk drives are so slow that they might as well be on the other side of the galaxy, but they are still, when configured properly, the lowest cost high capacity storage method available for extremely large datasets like those at hyperscalers and cloud builders. A judicious mix of 3D XPoint and flash memory front ending disk-based capacity can mask some of the very low performance that disks offer, just like caches in the processor mask the fact that cores have not changed their speeds of between 2 GHz and 3 GHz in years and we need to add more of them to a chip to get more throughput out of the increasing transistors we can etch onto the chip with each process shrink.

As you might expect with Intel, the company is not just kicking out an Optane 3D XPoint SSD and calling it a day. The company has worked with Micron to design and manufacture 3D XPoint memory chips and has designed and built its own storage form factors employing these chips, and it is also working on its own interconnect features to make them hum in the devices where they are employed. As it turns out, there is a software component to 3D XPoint as well, and Intel has created an interesting tool that will allow 3D XPoint SSDs to look like a kind of extended DRAM to the systems that employ them – even if the 3D XPoint memory is not plugged into DRAM DIMMSLOTS, but is working over the PCI-Express bus.

The very first 3D XPoint coming out of Intel that is aimed at the datacenter is a PCI-Express card called the Optane SSD DC P4800X, which will compete against and augment the capabilities of its DC P3700 datacenter-class devices. Intel will eventually ship U.2 2.5-inch form factors of its 3D XPoint devices; the U.2 is an NVM-Express link using four lanes of PCI-Express (x4) just like the M.2 memory stick form factor uses. The P3700s also come in PCI-Express card (known as AIC, short for “add in card”) form factors and in 2.5-inch SATA drive form factors.

Before we get into the weeds with the feeds and speeds, let’s talk about what Optane SSDs will be available when. The initial PCI-Express card is in limited availability starting March 19 and comes in a 375 GB capacity; this will be more broadly available in the second half of this year, which is also when Intel will be ramping up its “Skylake” Xeon processors and the “Purley” server platform that has been tuned for both the Skylake processors and the 3D XPoint memory. In the second quarter, Intel will double up the capacity on these Optane PCI-Express cards to 750 GB, and in the second half, they will double again up to 1.5 TB. Intel will also roll out a 2.5-inch U.2 form factor SSD in the second quarter of this year with a 375 GB capacity and then roll out fatter 750 GB and 1.5 TB U.2 devices in the second half.

While James Myers, the director for NVM solutions architecture in the Non-Volatile Memory Solutions Group who briefed us on the Optane SSDs, did not say that server makers would be getting access to M.2 memory stick devices, there is nothing to stop any of them from employing the 16 GB and 32 GB Optane M.2 devices that previewed earlier this year at the Consumer Electronics Show in Las Vegas inside of servers. Many hyperscalers use a pair of M.2 NAND flash drives as boot devices, and where speed is important on reboots, they could switch to consumer-grade Optane M.2 storage and get the added benefit of being able to directly address the memory in the M.2 device in a load/store fashion. It is the much higher endurance and lower latency of 3D XPoint compared to NAND flash that makes this memory-style addressability possible.

The Feeds And Speeds

With Optane memory, Intel is keen on is to make sure customers are gauging it correctly against flash and therefore using it for its current intended purposes. That’s why we will look at the performance first and then talk about use cases.

With flash devices, which has many channels all linking to chunks of flash storage, the queue depths – a measure of how many simultaneous access operations the device can perform at the same time – can be quite high and that helps to significantly boost the theoretical performance, in terms of I/O operations per second, that the device can sustain. But with the Optane SSDs that Intel is delivering, the queue depths are quite low by comparison, and yet Intel says this is not only sufficient to get screaming performance, it is actually also more indicative of the real-world queue depths that Intel sees with datacenter applications.

“In all of the analysis that we have done with enterprise applications – and we have been saying these things for years now – the majority of applications only see a queue depth of 4 to 8 in most active workloads,” Myers explains. “What is amazing with the Optane SSD compared to the P3700 is that the queue depth 1 performance is eight times higher than what you will see with a NAND SSD. In data sheets, you will see in the fine print that the NAND SSDs have a queue depth as high as 128 or 256, and very high transfer rates when they achieve these results. We are actually showing that the Optane drive will saturate performance around a queue depth of 12 with 70-30 mix of random 4 KB reads and writes. This provides breakthrough performance, up to 5X to 8X faster, where the Optane drive is actually going to be used.”

Myers adds putting higher queue depths on the Optane devices would just add latency, so it is not going to do that. Because of the fact that the Optane SSD can be saturated with a queue depth of 12, the company will be specifying performance at queue depths of 1 and 16 on its spec sheets, and comparisons with NAND flash drives should be done at the same queue depths (it looks like 1, 12, and 16 are the important queues to keep in mind for the first generation of Optanes). Getting such information out of flash suppliers might not be easy, but Intel has done so for its P3700 NAND flash SSDs as part of the datacenter-class Optane rollout.

Here is an example comparing performance of the two devices at low queue depths:

In the tests above, the benchmarks were run on a two-socket “Wildcat Pass” PCSD Server made by Intel itself using its top-bin “Broadwell” Xeon E5-2699 v4 processors, which have 22 cores running at 2.2 GHz. The system is configured with 396 GB of main memory and runs CentOS 7.2; the machine was configured alternatively with a 375 GB Optane SSD and a 1.6 TB P3700 SSD.

Speaking very generally using its synthetic SSD benchmarks, the Optane SSD delivered under 10 microseconds of read latency with that mixed 70/30 workload ay 4 KB file sizes, which is about a factor of 10X better than the P3700 SSD can do. As for quality of service, which is a big differentiator that Intel is focusing on with the Optane SSDs, the 99.999th percentile access was about 100X better at under 200 microseconds, so on the flash, there are some big, big outliers that can really slow down performance. At the 99th percentile of reads, the quality of service is about 60X better, with the flash-based P3700 hovering at just under 3 milliseconds. Here is the scatter graph in a mixed workload of reads comparing the two devices:

The blue dots show the reads on the P3700 flash device as the I/O benchmark runs; the tiny orange scatter at the bottom shows the distribution of read response times for the Optane SSD. There is just no comparison at all for this workload. Optane 3D XPoint is so much better than NAND flash.

The new way of testing performance that Intel is putting out there with its Optane SSDs is what it calls responsiveness under load. In this case, the system is emulating a multitenant environment, specifically with a set of high IOPs applications that is running on a machine that is also spinning up and down a lot of virtual machines with workloads on them. In this case, there is a random write workload that is pushing against the drive while at the same time there is a single-threaded read workload pulling at the drive.

In this particular test, the write workload rate keeps getting stepped up over time, and the average read latency on the NAND flash and 3D XPoint drives is shown in the same blue and orange colors. As the write load increases on the system from VM starts over the half hour that this test was run, you can see how the average read response time on the flash climbs steadily but the 3D XPoint read response time stays fairly consistent – from 8X at the low end of the load to 40X better on the high end of the load. The Optane SSD could keep handling workloads above 800 MB/sec, but Myers says that the P3700 flash SSD could not take on more workload at a write rate of more than 650 MB/sec, by the way. The Optane response was well under 30 microseconds for the whole test, all the way out to 2 GB/sec of that write workload. (The chart above doesn’t show that.)

For workloads where cost per unit of capacity, high queue depth, or high sequential read performance are a requirement, Myers says that a NAND flash device is a better bet. (Just the same, we would have loved to see the sequential performance data for Optane SSDs.)

Thanks to the much higher endurance of 3D XPoint compared to flash, it will be a better option for caching and main memory extension jobs that were not as well suited to flash (but have been made from flash and other components just the same in a number of cases, such as Memory1 from Diablo Technologies or NVDIMMs from Micron, just to name two options.

Flash memory has to do a lot of juggling to write data, and it wears the flash cells out eventually. 3D XPoint takes a “write in place” approach, says Myers, and that helps give it a higher endurance and so does the automatic load balancing across devices, which also boosts throughput, too. A NAND flash might be able to support anywhere from one half to ten drive writes per day, but the Optane P4800X SSD is starting out with the ability to support 30 drive writes per day – a rate that is not even physically possible in the current form factors, but why not?

“This drive is specified at 30 drive writes per day, and we believe that this should meet the majority of all of the uses in both storage and caching as well as using this device to extend memory,” says Myers. “And in those memory usages, it will take quite a bit more workload, but based on our testing that this is the right metric.” Myers added that the DWPD level will increase over time, and as it does, it will be ever-more suitable for memory expansion jobs.

Horses For Courses

Intel sees the Optane SSDs that it is rolling out this year being used in two main ways. One is for fast storage and caching of slower storage, and the other is for extending main memory.

The caching use, particularly for latency-sensitive metadata applications, is an obvious one, and given the very consistent performance levels, it is a no-brainer to expect for many customers to pile on here. By sprinkling 3D XPoint memory into systems, customers in the financial services, intelligence, and other economic sectors where their applications have latency thresholds are going to want to add in Optane cards or DIMMs or SSDs or maybe even M.2 sticks even if Intel doesn’t have a server class variant of them officially.

Moreover, Intel is stressing the caching use case because it cannot make 3D XPoint memory in the kinds of volumes it is planning to in the coming years, and therefore it cannot expect to push them for their raw capacity. In a way, 3D XPoint is the new flash and flash is the new disk and disk is the new tape and tape is something akin to geology. . . .

To give a real-world example of how Optane SSDs might be used for raw storage, Myers trotted out MySQL, which he says is the backbone for most of the instant messaging among the major hyperscalers in China and which is also a big part of the infrastructure for hyperscalers and cloud builders in the United States. Intel fired up the Sysbench MySQL benchmark test using a 100 GB database on a two-socket Xeon server with the same E5-2699 v4 processors, 384 GB of DDR4 memory, and a single 400 GB Intel S3710 SSD as a boot device; this machine ran CentOS 7.2 and MYSQL Server 5.7.14. Intel put one P3700 flash SSD in the machine and ran the test, then swapped it out for an Optane SSD. Here is how they stacked up:

The machine with the Optane SSD was able to pump through 11.8X as many SQL transactions with roughly the same latency for the 99th percentile of transactions. Which means companies could boost their throughput by more than an order of magnitude by switching storage in their MySQL servers, or cut the number of nodes by an order of magnitude, or do some combination of both to reduce costs and increase performance at the same time.

Another way to use the 3D XPoint acceleration is to get work done faster. In another test, Intel measured the time it took to complete 1 million Sysbench transactions. With the Optane SSD acting as a cache layer for a flash-based MySQL database, it took about 350 seconds to finish the job on a single two-socket Xeon box, but it took 5.6X more time to do so without that Optane SSD.

Intel also unveiled some stats on accelerating and expanding Ceph object storage arrays by using 3D XPoint instead of flash to house the metadata that describes where objects are located on an all-flash array running Ceph. Take a look:

Just by changing the single flash SSD in the two-socket node running Ceph, Intel was able to push more than 4.5X more metadata operations through the setup, and that also allowed for the capacity of the storage array to be expanded by a factor of 4.5X while also yielding a factor of 10X better responsiveness to boot.

The last workload that Intel showed off using Optane SSDs was memory extension, and the idea here is that customers do not have to wait for 3D XPoint DIMMs to do this – something that perhaps many customers were expecting. Intel has created a software layer – a kind of memory hypervisor – that gloms together DRAM DIMM capacity running on the memory bus and Optane 3D XPoint capacity on the PCI-Express bus and runs it as an aggregated pool of memory. This software, called Memory Drive Technology, uses the load/store semantics of main memory and will not require changes to the operating systems or the applications running on machines. While Memory Drive Technology can in theory run on any processor architecture, Intel is only supporting it on its own Xeon chips. So write your own, ARM and Power and Opteron vendors.

Intel ran a bunch of tests on this memory pool setup. On one test, it ran the SGEMM matrix multiplication test on that two-socket system that had the top-bin Broadwell Xeons plus 768 GB of main memory; this system delivered 2,322 gigaflops of performance. Intel then unplugged all but 128 GB of the DDR4 memory and put in four of the 375 GB Optane SSDs, for a total addressable space of 1.6 TB. Because more of the dataset could be loaded into “memory” and because of the data placement and load balancing features of Optane SSDs (which sometimes does better than NUMA data placement in memory even on two-socket servers), this latter setup could push 2,605 gigaflops on the SGEMM test.

For transaction processing workloads on the memory extended machine, Intel fired up that Sysbench MySQL workload on these same two configurations, and the plain vanilla DDR4 machine was able to do 1,077 transactions per second, and the memory extended version comprised of a mix of DDR4 and Optane SSDs was able to do 870 transactions per second. That is 81 percent of the workload, and presumably for a memory complex that is a lot less expensive. Presumably the performance could be a lot better with Optane DIMMs and native load/store semantics on the actual memory bus, but Intel might charge a lot for those Optane DIMMs so this SSD approach to extension might provide better bang for the buck.

The Broadwell Xeons top out at 24 DDR4 memorySLOTS and at 128 GB per stick, that works out to 3 TB of actual main memory maximum in the system. The Memory Drive Technology is currently capped at 12 TB of addressable space per socket; the 48-bit virtual addressing ceiling in the Xeon architecture – which Intel is not expected to raise during the Skylake generation, but the way – means memory tops out at 64 TB. No matter the media. In any event, with this generation of Xeons and Memory Drive Technology, a two socket-server can have 24 TB of mixed DDR4-Optane memory, which is a factor of 8X over the plain vanilla DDR4 maximum using very, very expensive 128 GB memory sticks. A four-socket Xeon E5 server top out at 12 TB of real DDR4 memory, but can have an Optane extended space that hits 48 TB.

Intel is charging $1,520 for the P4800X Optane SSD with 375 GB of capacity. Over at Hewlett Packard Enterprise, a 128 GB DDR4 memory module running at 2.4 GHz will run somewhere around $6,500 to $7,200, depending on who you ask; a 64 GB module looks to be anywhere from $1,800 to $4,500, which is a ridiculous spread. The more common and more popular 32 GB sticks cost somewhere under $1,200 or so. That’s about $4 per GB for the Optane and about 10X times that for DDR4 main memory with relatively skinny sticks. Intel could charge somewhere between $4 per GB and $40 per GB for Optane DIMMs, and so long as the performance is decent, take a very big bite out of the DDR memory business.

One last thing: A bundle of the P4800X Optane SSD and the Memory Drive Technology costs $1,951 and you need to license that memory hypervisor on any Optane SSD that wants to be part of a memory subsystem.