PC AMD Reiniciando aleatoriamente

Marcelo Fantinati Elias · 27 de maio de 2020

Fala galera beleza? Preciso de uma ajuda...

Meu PC está reiniciando do nada, às vezes durante o game, ou só usando o Chrome e, de vez em quando até mesmo na BIOS com tudo em default.

Isso começou cerca de 3 meses atrás depois que comprei:
Ryzen 3600 novo
Memória RAM Delta RGB Teamgroup 2x8 3000mhz novas
E uma MSI B350 Tomahawk Arctic usada com 3 meses de garantia

Resto dos componentes:
Fonte Cooler Master MWE White 650w 80 plus
Cooler Sangue Frio 120mm
RTX 2060 Zotac Twin Fan
1 SSD Western Digital 120gb
1 HD Samsung e 1 Toshiba ambos de 1TB
Gabinete Elisyum Gamemax

Coisas que já tentei:
**BIOS e todos os drivers foram atualizados no dia que eu montei o PC
*Limpei CMOS diversas vezes (não resetei a BIOS porque só)
*Tentei overclocks e underclocks (esse último normalmente deixa mais estável quando tá reiniciando sem parar)
*Temperaturas excelentes, inclusive com over de 4.3Ghz com 1.35v
*NÃO é fonte, foram testadas 3 fontes diferentes, inclusive comprei uma nova por desencargo.

PS: Sei que o problema tem relação com memória RAM porque os resets vão ficando mais frequentes até q tire uma das memórias e mude a outra de slot. Já testei com outra memória Ballistic de um colega e o problema persistiu

Muito obrigado pela atenção!

Brc · 27 de maio de 2020

4 horas atrás, Marcelo Fantinati Elias disse:

Fala galera beleza? Preciso de uma ajuda...

Meu PC está reiniciando do nada, às vezes durante o game, ou só usando o Chrome e, de vez em quando até mesmo na BIOS com tudo em default.

Isso começou cerca de 3 meses atrás depois que comprei:
Ryzen 3600 novo
Memória RAM Delta RGB Teamgroup 2x8 3000mhz novas
E uma MSI B350 Tomahawk Arctic usada com 3 meses de garantia

Resto dos componentes:
Fonte Cooler Master MWE White 650w 80 plus
Cooler Sangue Frio 120mm
RTX 2060 Zotac Twin Fan
1 SSD Western Digital 120gb
1 HD Samsung e 1 Toshiba ambos de 1TB
Gabinete Elisyum Gamemax

Coisas que já tentei:
**BIOS e todos os drivers foram atualizados no dia que eu montei o PC
*Limpei CMOS diversas vezes (não resetei a BIOS porque só)
*Tentei overclocks e underclocks (esse último normalmente deixa mais estável quando tá reiniciando sem parar)
*Temperaturas excelentes, inclusive com over de 4.3Ghz com 1.35v
*NÃO é fonte, foram testadas 3 fontes diferentes, inclusive comprei uma nova por desencargo.

PS: Sei que o problema tem relação com memória RAM porque os resets vão ficando mais frequentes até q tire uma das memórias e mude a outra de slot. Já testei com outra memória Ballistic de um colega e o problema persistiu

Muito obrigado pela atenção!

1.35v é muito para overclock estático em processadores 7nm. Não há consenso de tensão segura ainda, mas ninguém gosta de colocar acima de 1.275v, há degradação do processador ao longo do tempo.

Me diz uma coisa, isso daí de resetar começou recentemente? Pode ser que tenha degradado seu processador por causa da alta tensão...

Reviewing Voltage Recommendations for Zen 2

There have been one or two reports of degradation on Zen 2 at surprisingly low voltages. We need to review this and figure out if the advice needs to change, and if so what to. But first, let’s cover how ‘safe voltages’ work, why 1.325V was recommended, and why these reports are very surprising.

#Part 1: What makes voltage “safe” or “unsafe”

There are two (and a half) main types of failure related to voltage: electromigration, and oxide breakdown (or dielectric breakdown). Electromigration is most commonly considered, but with how aggressive Zen 2’s Precision Boost is it’s important to understand oxide breakdown in order to understand what can make voltage safe or unsafe on Zen 2. There are of course many other failure mechanisms for silicon chips, some of which do relate to voltage, but these two should be enough to understand for a sensible discussion.

##1: Electromigration

Electromigration is a process where an electric current leads to physical damage to the conductor. This is described as being due to moving electrons physically hitting atoms – I don’t know if this is a theory or proven, but it’s a good way to understand it. You can see the physical effect of electromigration under an electron microscope here: https://upload.wikimedia.org/wikipedia/commons/5/5f/In_situ_electromigration.gif

Electromigration is described by Black’s Equation, which doesn’t include voltage as a parameter but does include current and temperature. As far as I know there isn’t a way to figure out the exact relationships between voltage and lifetime or temperature and lifetime without either experimentation or more information than is publicly available, but in general;

* The higher the temperature, the faster electromigration does damage

* The higher the current, the faster electromigration does damage

Note that this is about speed of damage, not ‘damage or no damage’. Unless you’re running at absolute zero, electromigration will always be taking place. More on this later.

Electromigration could in principle occur anywhere on a chip – ranging from a weak spot in a specific part of a chip, to a big power plane, or even parts of the package not on the silicon itself. Voltage affects electromigration directly because current is typically directly proportional to voltage. However, as long as the specific parts active aren’t too delicate, it’s possible that a lighter load that draws less current could be run at a much much higher voltage before electromigration would become a concern. Taking a simplified view, if the cores are robust but the internal power plane is delicate then you could run a very high voltage through one or two cores as this still wouldn’t be pulling too much current through the power plane, but an all-core load would need reduced voltage to keep the current under control.

##1.5: Dielectric Breakdown

Dielectric breakdown, also known as oxide breakdown when applied to semiconductors, happens when the voltage across an insulator is enough to forcibly turn it into a conductor. For example, we can see air being subjected to a form of dielectric breakdown when a spark happens. Immediate dielectric breakdown shouldn’t occur in normal overclocking, but when a high enough voltage instantly kills a CPU, this is why.

##2: Time-dependent Gate Oxide Breakdown

Time-dependent gate oxide breakdown happens as a result of a transistor being subjected to high voltage that isn’t enough for immediate dielectric breakdown, regardless of current. The mechanism doesn’t seem to be well understood, but the practical result is that random damage adds up over time. The main consideration for time-dependent gate oxide breakdown seems to be voltage, it seems probably that temperature would have an effect as well.

There are physics mechanisms by which voltage leads to oxide damage like hot-carrier injection, I would characterise time-dependent gate oxide breakdown as the observed effect of all the various mechanisms.

Because voltage affects time-dependent gate oxide breakdown directly, it doesn’t seem to me that how heavy the load is would affect it directly. However, any parts that are completely power gated would not be subject to time-dependent gate oxide breakdown while power gated. A single-core load would not intrinsically be any less at risk of time-dependent gate oxide breakdown, it could be mitigated by hopping the load between cores, but only to a fairly limited extent.

##“Safe” vs “Unsafe”

Both electromigration and time-dependent gate oxide breakdown are processes that are always happening to a certain extent, as long as the chip is powered on. Reducing voltage reduces their rate, and increasing voltage increases their rate. Electromigration is also slowed down by reducing temperature, and both are reduced by lighter workload since as well as the reduced current drawn you can power gate more of the chip.

In a way this means there is no such thing as a “safe” voltage. Simply running a chip damages it, even at stock, regardless of voltage. So what we mean when we say “safe voltage” is effectively something like;

**“The voltage at which it’s expected that the chip will not be damaged or destroyed so fast that we regret overclocking it”**

“Safe voltage”, unless you’re *undervolting*, does not mean and has never meant that the lifespan of the chip is unaffected nor that degradation will never take place. However the expectation is that degradation should take many years to appear. “Safe voltage” is also, and I cannot stress this enough, not a magic value where there’s no value below it and loads above it. It’s just an arbitrary line in the sand.

This terminology and the concept of “safe” is in itself something that may need to be revisited at some point. In the past the tradeoff when overclocking was mainly increased power consumption above an arbitrary TDP, or eating into voltage margins. We now live in an era where turbo at stock takes chips well beyond their arbitrary TDPs anyway, and platform design is getting better at narrowing the voltage margins with Intel’s tightly regulated IVR/FIVR on some platforms and AMD’s clock stretching to deal with transients on a cycle-by-cycle basis.

##LLC digression or “playing lawyer against the laws of physics”

In relation to “safe voltage” it’s also worth pointing out that there’s always an assumption that a sensibly low LLC level with a decent bit of Vdroop will be used. Some people will set the “safe” voltage then start pushing up LLC to get more and more load voltage. The chip is of course affected by the voltage and current actually going through it, not the bios settings, and will take damage just as it would at a higher set voltage with more Vdroop. When people think there’s somehow a loophole that lets them raise voltage without it being “unsafe voltage”, they can tell the chip that all they want, but it won’t listen to them and un-degrade itself. This isn’t directly relevant but is worth noting when talking about “safe voltage”. Similarly if you go the other way and have very high droop but tune to hit precisely the “safe voltage” according to monitoring software under load, with a set voltage well in excess, you’re running higher voltage at both idle and load than someone saying “X volts is safe” intends. Anyway…

#Part 2: Where voltage recommendations come from

Traditionally there are three ways for the community to find out what voltage is “safe” for a particular chip;

* The manufacturer just says what they think is “safe” directly. Examples include OCZ, who included a max voltage in their memory warranties, and AMD in the FX and Phenom II days who did not warranty overclocking but did include some conservative voltage advice in their Dragon platform performance tuning guide and FX performance tuning guide.

* Employees or associates of the manufacturer with access to privileged information develop a personal opinion as to what is “safe” informed by that information, then share that opinion as nothing more than a belief that they personally sincerely hold. Intel are currently an example of this, with employees in their OC division willing and allowed to share personal opinions on what they run their personal systems at in a way that Intel can’t be held to.

* Guesswork based on what similar chips can take, and feedback from people who have been able to degrade or kill chips. Related to this is a “rule of thumb” that OC power should be less than 2x stock power, but that isn’t necessarily reliable as some chips (especially at the top end of core counts) are constrained on stock power by sanity and not safety, whereas others might be pushed closer to their limit at stock than 2x implies.

Recently two more methods have been employed;

* Looking for appropriate information in datasheets. This does not work – unless you think every single non-IVR Intel CPU back to at least 32nm Sandy Bridge has been able to take exactly the voltage that just *happens* to be the maximum possible with their VID table. There are also issues with mistaking stress ratings for 24/7 operation, such as with DDR4 where the 1.5V stress rated has been wrongly conflated with “1.5V, plus whatever noise comes out of the VRM above that, all day every day”.

* Reverse engineering firmware. This is a fairly new thing with modern AMD CPUs, which are programmed with a complex boost algorithm. The idea is that given the boost behaviour shouldn’t apply an unsafe voltage, see what you can get it to apply with a given load and that should be safe for that load.

Variance between chips happens – different leakage means different current draw and therefore different electromigration at the same voltage, oxide layers will vary in thickness, and so on – and is rarely explicitly addressed, but when a manufacturer has had a hand in recommendations you can bet that it’s accounted for and the number given is close to being a lower bound. When numbers come from the community, ultimately there’s information about degradation that has been fed back, so that’s also reacting to the more delicate chips.

#Part 3: Where 1.325V for Matisse comes from

I’m going to use more active voice for this part, as I’m talking about my personal thinking and choices.

If you’re reading this, you’re probably aware of The Stilt’s excellent “strictly technical” articles. The immediate reason for the 1.325V recommendation will seem obvious – The Stilt quoted 1.325V as the *average* value that a Matisse chip’s own firmware will allow for intensive loads. A subreddit user (I can’t remember the spelling of their name but they have a strong history of good contributions) also linked to this with a title along the lines of “Maximum all-core voltage for ryzen 3rd gen is 1.325V”. I also pinned this post for a while. The obvious reading is that this was seen out of context, taken as gospel, and the nuance discarded.

In fact, some information had already made its way to me already from very credible sources. This was very limited and what I can responsibly share is even more limited, because for some reason this is treated as super duper secret, and while I disagree with that it’s not my choice. I’ll also say now that I won’t be answering any questions at all about this, or engaging in conversation about it, nor will I be blinking a set amount of times or anything else. I understand that’s frustrating but I’m lucky to be able to even mention this.

The information I have indicated that the 1.325V value that was getting popular was if anything conservative. To be clear, this related to values given as friendly advice by people in the know – certainly assuming long-term use, but also assuming high-end cooling. And definitely not something anyone should be held to (I suspect not being held to anything is why this is so locked down). However this gave me an indication of what not to be alarmed by, that placed 1.325V squarely in the “not alarmed” box. I’m talking about this because it’s the truth behind what happened, and it does not take precedence over actual experiences.

It’s also worth nothing that many credible media outlets had shown overclocking results with much higher voltages, such as [techpowerup who tested their 3700X with 1.4V](https://www.techpowerup.com/review/amd-ryzen-7-3700x/21.html).

The choice was therefore, rather than declaring a *higher* voltage than 1.325V or looking further into it, to accept and endorse the value that was getting popular on the basis that, as well as not going against the tide, it should be on the safe side. This was also supported by the public information – after all, The Stilt didn’t talk about messing with “reliability scalars” for Matisse. 1.325V average for Matisse appeared, in terms of safety, equivalent to 1.33V for Pinnacle Ridge – and short-term degradation for Pinnacle Ridge was only reported, to my knowledge, above 1.38V.

#Part 4: Why 1.325V might not be a good recommendation for Matisse

There have been multiple reports of chips degrading in surprising ways. For example;

* A 3900X “died” (no longer stable at stock) from 1.36V set voltage with droop to a software measurement of 1.32v under load. This was “at 80C most of the time”. (report from AHOC supporters discord)

* A 3700X run at 1.29V lost 0.05GHz: https://www.reddit.com/r/overclocking/comments/eojepz/did_my_motherboard_degrade_my_cpu/

* A 3700X run at 1.325V lost stability at previously stable settings: https://www.reddit.com/r/overclocking/comments/ex7kk9/why_cpu_degrade/

* A pair of 3600s exhibited degredation below 1.325V: https://www.reddit.com/r/overclocking/comments/eu3fbl/r5_3600_degradation_testing/

Now, these chips will all limit themselves to different voltages at stock when governed by SenseMI – different “FIT voltages” as they’re called – and it certainly seems likely for them to be less than 1.325V. However, as explored above, it’s not in itself unexpected that 1.325V would be above what’s practically a stock voltage. Again, even setting aside any whispers from any sources, even a chip with a very low 1.275V “FIT voltage” would be expected to last well at 1.325V based on how Pinnacle Ridge behaved. This means one way or another there’s something unexpected happening.

What it doesn't seem to be is user error. There are too many reports for that at this point, and while some may be a little above 1.325V the effects still wouldn't be expected this soon.

#Part 4.5: Things about Matisse that might complicate the situation

Fair warning, there’s going to be speculation in this section.

I want to try and enumerate the possibilities to figure out what’s going on, and specifically avoid jumping to any conclusions. Frankly, I don’t think there is a single obvious conclusion. But first I want to talk about what we know.

What we know about how Matisse differs from past CPUs:

* Matisse is on a different process. This means resistance to electromigration and resistance to oxide breakdown will be different. Obviously this affects what voltage is safe, and will have different effects in low-current and high-current conditions. However, electromigration is also affected by temperature and a change to the process means the nature of the temperature scaling will be different.

* Matisse’s package uses narrow copper pillars with smaller solder bumps rather than traditional larger solder bumps. This is unlikely to be relevant to failures as AMD have historically been very good when it comes to packaging, but the fundamental shift in how the package works is nonetheless better to mention than disregard as package can sometimes be implicated in premature failures. https://www.anandtech.com/show/14525/amd-zen-2-microarchitecture-analysis-ryzen-3000-and-epyc-rome/5

* The speeds Matisse boosts to depend heavily on temperature. This has been covered by gamersnexus at https://www.gamersnexus.net/guides/3491-explaining-precision-boost-overdrive-benchmarks-auto-oc but even when a Matisse chip is within its temperature spec, it will boost higher at lower temperatures. This would be expected to also mean higher voltages at lower temperatures.

* Matisse selects clocks very quickly, apparently every 1ms, which will be faster than monitoring utilities poll. Hwinfo64 for example polls every 2 seconds by default, and has a minimum polling interval of every 50ms.

* It’s possible that Matisse might have onboard dLDOs that are active when running PBO. Older Ryzen generations have these but they’re locked in bypass mode on desktop SKUs, since they wouldn’t be able to take the current. If they were activated on Matisse it would potentially allow for a higher all-core voltage while protecting more sensitive cores.

#Part 5: Where do we go from here?

What we won’t do is say “don’t overclock”. There are always ways to tune performance – trading away power, eating into voltage margins by using a better than baseline motherboard, or improving cooling. But we need to look at the options.

##Option 1: Reduce the suggestion for a “safe” fixed voltage

We could drop the blanket recommendation, say to 1.3v or 1.25V. If we go low enough it has to be safe – but it would probably make a fixed OC worse even all-core than stock on many chips. This does retain the benefit of giving users a single straightforward value.

##Option 2: Tell people to determine their own safe voltage

This is the option that seems to be gaining mass popularity, and I guess would be the ‘path of least resistance’ now in the way 1.325V was previously. The idea is to still treat every chip as having its own concrete “safe voltage”, just per individual chip rather than per family as was done in the past and with option 1.

The way this works is having assumed there’s a specific “safe voltage” per individual sample, we then assume that an end user can experimentally determine this for themselves with reliable results. The recommend method I’ve seen is to enable PBO (thus lifting off power and current limits), run the heaviest possible all-core workload, and then use software voltage monitoring to see what voltage the chip is getting.

There are a couple of problems with this. Firstly, user error exists. Someone might pick the wrong prime95 setting, not assign enough threads, or have a background load that reduces the overall stress on the CPU. Secondly, if temperatures have an effect then someone might be testing at lower temperatures than they see under real load and still end up with an excessive voltage number. I also worry about this option because I already see some users arguing an obviously excessive voltage like 1.4V is fine because their chip runs it with PBO, and I can forsee some of the “Prime95 is overkill” crowd wilfully deciding to base their fixed voltage on a load that leads to a higher voltage.

Reasons temperatures could change include change of room temperature, increases in dust affecting the cooling, and extra GPU load dumping heat into the case. It’s also possible an impatient user would just not let the system reach equilibrium.

It’s my belief that this method is valid in many cases, but for the reasons above is not for everyone.

##Option 3: Recommend PBO tweaking over fixed overclocks

Matisse chips seem to have a lot of room for tweaking without setting a fixed clock and voltage, and this also lets the chips boost higher for light loads compared to a fixed voltage as well as helping with idle behaviour. This also doesn’t mean we’d be banning discussion of fixed overclocks, just that it wouldn’t be the immediate recommendation for people with Matisse-based daily systems.

The safe voltage FAQ entry for Matisse would say something like;

>For Matisse it is recommended NOT to use a manual overclock in most cases. The technologies AMD collectively refers to as SenseMI, including Precision Boost 2, provide a very aggressive boost in lighter workloads while maintaining safety in heavier workloads in a way that a fixed manual voltage cannot compete with.

>You can of course still overclock Matisse chips but this is best achieved with Precision Boost Overdrive (PBO), which expands Precision Boost 2 power limits to trade power for performance (manual PBO limits can also trade performance to restrict power below stock).

It's my belief that this method would be the best to recommend. However I'd appreciate any and all constructive feedback people have.

Marcelo Fantinati Elias · 27 de maio de 2020

@Brc pode ser, mas normalmente o looping de resets (quando fica mt curto o tempo entre os resets) é usar 1 pente de memória em um slot não preferencial, sinal de degradação é o aumento de tensão para um mesmo clock, o que não ocorreu (ainda)

Acredito q o problema esteja entre MB e CPU pois testei com outra memória e o problema se repetiu, enquanto isso as minhas memórias foram testadas em outro sistema e não apresentou defeito...