Power Management

Idle power management for SSDs can be surprisingly complicated, especially for NVMe drives. But it is also vitally important for any battery-powered system. Real-world client storage workloads leave SSDs idle most of the time, so idle behavior is a big factor in how battery-friendly a drive is. Power draw when idle isn't the only thing that matters; how quickly a drive can enter or wake up from a low-power state can have a big impact on how effective its power management is.

For SATA SSDs, the host system doesn't have a lot of say in how the drive manages power. Using the SATA Aggressive Link Power Management (ALPM) feature to mostly power the SATA connection is usually sufficient to put a drive to sleep. But the lowest-power sleep state supported by SATA devices (DevSleep) requires extra signalling on a pin that's part of the SATA power connector. This means that DevSleep is in practice only supported on laptops, and our desktop testbeds cannot use or measure this sleep state.

NVMe includes numerous features pertaining to power management or thermal management. Most of them are optional in the NVMe spec, but there's a common subset supported by most consumer SSDs. NVMe drives can support numerous different power states, including multiple active and multiple inactive power states. The drive's firmware provides information about its capabilities to the host system:

Samsung 980 PRO
NVMe Power States
Controller Samsung Elpis
Firmware 1B2QGXA7
Power
State
Maximum
Power
Active/Idle Entry
Latency
Exit
Latency
PS 0 8.49 W Active - -
PS 1 4.48 W Active - 0.2 ms
PS 2 3.18 W Active - 1.0 ms
PS 3 40 mW Idle 2.0 ms 1.2 ms
PS 4 5 mW Idle 0.5 ms 9.5 ms

 

When a drive and the host OS both support the Autonomous Power State Transition (APST) feature in NMVe 1.1 or later, the host system can give the drive a set of rules for how long it should wait while idle before dropping down to a lower-power state. Operating systems choose these delays based on the power state entry and exit latencies claimed by the drive, and other configuration information about the system's overall tolerance for increased disk access times.

One common problem with the NVMe APST feature is that the NVMe spec doesn't really say anything about how APST interacts with PCIe Active State Power Management. SSD vendors tend to make assumptions that eg. a system which configures the drive to use its deepest idle state will fully support PCIe APSM. Most of the time, things work out, but it's also possible to end up with a drive that goes to sleep and never wakes up, or a drive that defaults back to its highest power state if anything goes wrong when it tries to go to sleep.

Using our Coffee Lake testbed that has fully functional PCIe power management, we test SSD power in three states. Active idle is when the drive is not using any externally-configurable power management features: SATA or PCIe link power management is disabled, and NVMe APST is off. We're now using a more reliable and broadly-compatible method for disabling APST through the Linux kernel rather than directly poking the drive's registers. This means that some drives will probably end up showing higher active idle power draw than we have previously measured.

Even though there are many combinations of power management settings and power states that can be used with a typical consumer NVMe SSD, we condense it down to just two low-power configurations to test. What we call "Desktop Idle" is using the features that are almost always available and working on desktop platforms, even if they're off by default. This includes enabling SATA ALPM, NVMe APST, and PCIe ASPM.

Next, we have the "Laptop Idle" state, with all the power-saving features fully enabled. For SATA SSDs, this would include DevSleep, so we can't fairly measure the Laptop Idle power draw of SSDs. For NVMe SSDs, this includes enabling PCIe L1 substates.

Idle Power Consumption - No PMIdle Power Consumption - DesktopIdle Power Consumption - Laptop

Accurately measuring the time it takes for a drive to enter a low-power state is tricky, but measuring the time taken to wake up is straightforward. We run a synthetic test that performs a single 4kB random read once every 10 seconds. When power management features are disabled and the drive stays in its active idle state, the random read latency will be determined mainly by the speed of the NAND flash. When the drive is in the Desktop Idle or Laptop Idle state, it will go to sleep between each random read, so we can repeatedly sample the time taken to wake up and perform a random read. The difference between this time and the random read latency from the drive in the active idle state is due almost entirely to the overhead of waking up the drive from a sleep state, and this difference is what we report as a drive's wake-up latency.

Idle Wake-Up Latency

 

Conclusions

In this article we hope we've given you an insight into how much goes into testing a modern solid state storage drive - something more than just running CrystalDiskMark and finding peak sequential speeds! The new suite is not only more in-depth, but also we've streamlined it somewhat for automation, enabling fewer sleepless nights as deadlines loom on the horizon (or put another way, more reviews to come). We're obviously keen to take on additional feedback with the testing, so please leave a comment below.

Advanced Synthetic Tests: Block Sizes and Cache Size Effects
Comments Locked

70 Comments

View All Comments

  • thestryker - Monday, February 1, 2021 - link

    The explanations along with all of the data was a great way to show insight into both the why and how for the new bench setup.

    The only thing I'd like to see is either 900p/905p Optane drives added in and wishful thinking would be p5800x. Even though they're at relatively unattainable prices due to low volume, discontinued and/or being enterprise they do represent nand alternatives with a rather different performance profile. Hopefully Intel will opt to bring another consumer version out once they have broad PCIe 4+ support across their consumer product line.
  • Billy Tallis - Monday, February 1, 2021 - link

    The 900P and 905P are in line for their turn on the testbed. Those two and the WD Black AN1500 will be the first drives I use the new Quarch PAM for, since that currently only has a fixture for the PCIe add-in card form factor. (I did also run some tests on one of the Optane drives while experimenting to develop this suite, but they haven't run through the final version of the full suite.)

    I do think it's a lot more likely that I'll get a P5800X sample than a Micron X100. But it wouldn't surprise me if Intel holds off on press samples until they're ready for it to be reviewed on an Intel platform.
  • thestryker - Monday, February 1, 2021 - link

    Great to hear, I figured that they just hadn't had their turn yet.

    I've assumed that Intel + PCIe 4+ drives were all going to be waiting on Intel's own platforms which is why I still hold out hope for a future consumer Optane.
  • Greg100 - Monday, February 1, 2021 - link

    I too would love to see tests of the Intel P5800X and Micron X100 on this new test platform.
    In fact, these drives interest me most for OS and software installation.

    The 900P and 905P are rather a historical curiosity nowadays due to their slow sequential transfer rates.
  • p1esk - Tuesday, February 2, 2021 - link

    AFAIK 905p still destroys any modern drive in random reads/writes. Also, it does not degrade like other drives when their cache is full.
  • Oxford Guy - Tuesday, February 2, 2021 - link

    Shame about the ridiculous pricing, though.

    Intel also refused to validate its 'storage' drives like the 800p, 900p, and 905p for use as a disk cache. This site posted some interesting results for the 118GB 800p. It made it seem like that drive might actually be relevant for people on a tight budget who need a lot of storage and don't mind the noise of a mechanical hard disk. But, Intel's site has very clear statements saying that only the 'memory' Optane products (which seem to be very pointless for consumers) are supported for cache use.
  • ksec - Monday, February 1, 2021 - link

    OH Thank You. Coming here just to comment on P5800X. I am wondering on the power usage the idle time of SSD is so much higher than what I expected.
  • pexxie - Monday, February 1, 2021 - link

    Wow, very thorough. Phenomenal work.
    Would there be any interest in testing synchronized writing? (i.e. bypassing the device's volatile write-back cache). In linux this can be done with the oflags sync or dsync.
    E.g. 4K Random Writes: "dd if=/dev/urandom of=/testfile bs=4k count=10k oflag=sync"
    Without the oflag, or using oflag=direct; you'd be using the write-back cache, which looks great but comes with reliability risk. See a write-up here: https://www.postgresql.org/docs/current/wal-reliab...
  • linuxgeex - Tuesday, February 2, 2021 - link

    That's not 4K random writes you're testing. It's sequential writing of pseudorandom data generated by the kernel.
  • pexxie - Wednesday, February 3, 2021 - link

    I think I get what you're saying/writing. It's random data, but not being written to random locations on the disk. Any clue how to do the latter? Gosh, that's a plot-thickener of note for me. :-P How random is "random"? :-O

Log in

Don't have an account? Sign up now