Dedicated Server Hosting: CPU overheating
Oct. 31st, 2021 12:43 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
At PostJobFree we consider potential migration to OvhCloud dedicated servers hosting.
1) ~6 weeks ago we ordered Advance-2-LE server.
We started running ElasticSearch percolator on that server.
2) 5 weeks later, the server hanged.
OvhCloud support rebooted the server and the server continued working.
3) We reviewed /var/log/messages on the server and found "Core temperature above threshold, cpu clock throttled" messages.
The CPU throttling lasted less than a second, but throttling events repeated several times per day
4) We created a ticket for OvhCloud support.
5) OvhCloud support responded:
6) But fixing cooling system created a disk problem [that OvhCloud fixed 1 day later]:
7) Cooling system fix made the server temperature more stable.
Under heavy stress CPU temperature could temporary went up to 86C, but did not go to 100C when throttling starts.
/var/log/messages no longer reported "Core temperature above threshold, cpu clock throttled".
8) The server kept reporting other warnings in /var/log/messages, so I asked OvhCloud support:
9) After 1 more day (2021-10-29 20:52 EDT) OvhCloud reported:
10) This motherboard replacement damaged cooling system, so the server started to overheat under stress again.
Temperature spikes under stress reached 97C+ and caused occasional CPU throttling.
11) We requested to cancel that "Advance-2-LE" server, and requested Advance-4 server instead.
We continue to evaluate OvhCloud...
Update: AMD EPYC 7313 - how to measure temperature?
1) ~6 weeks ago we ordered Advance-2-LE server.
We started running ElasticSearch percolator on that server.
2) 5 weeks later, the server hanged.
OvhCloud support rebooted the server and the server continued working.
3) We reviewed /var/log/messages on the server and found "Core temperature above threshold, cpu clock throttled" messages.
The CPU throttling lasted less than a second, but throttling events repeated several times per day
Oct 26 03:19:48 pesovh kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU6: Core temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU9: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU7: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU6: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU8: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU5: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU4: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU10: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU11: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU6: Core temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU9: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU10: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU1: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU5: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU6: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU7: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU4: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU11: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU8: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU2: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU0: Core temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU3: Package temperature/speed normal
4) We created a ticket for OvhCloud support.
5) OvhCloud support responded:
Date 2021-10-26 12:02:04 EDT (UTC -04:00), Server check:
We have detected an issue with the cooling system, which has now been corrected.
The temperatures reported are now normal.
The server is booted from disk and is on the login screen. Ping OK and services are up.
6) But fixing cooling system created a disk problem [that OvhCloud fixed 1 day later]:
Date 2021-10-27 10:30:36 EDT (UTC -04:00), Extensive diagnosis:
Server would not boot successfully into customer OS. It was found a disk would not detect properly.
Disk issue has been rectified. Server will now boot to customer OS. Booting to the customer OS does not provide ping.
Server booted into customer rescue. Ping 'OK'. Hardware verified.
7) Cooling system fix made the server temperature more stable.
Under heavy stress CPU temperature could temporary went up to 86C, but did not go to 100C when throttling starts.
/var/log/messages no longer reported "Core temperature above threshold, cpu clock throttled".
8) The server kept reporting other warnings in /var/log/messages, so I asked OvhCloud support:
If the network adapter is fine - why do I see many "eno2" messages in /var/log/messages ?
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2118] device (eno2): Activation: failed for connection 'System eno2'
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2119] device (eno2): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2215] dhcp4 (eno2): canceled DHCP transaction
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2215] dhcp4 (eno2): state changed timeout -> done
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2219] policy: auto-activating connection 'System eno2' (b186f945-cc80-911d-668c-b51be8596980)
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2223] device (eno2): Activation: starting connection 'System eno2' (b186f945-cc80-911d-668c-b51be8596980)
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2224] device (eno2): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2226] device (eno2): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2234] device (eno2): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Oct 29 11:02:31 testovh NetworkManager[1233]:[1635505351.2235] dhcp4 (eno2): activation: beginning transaction (timeout in 45 seconds)
9) After 1 more day (2021-10-29 20:52 EDT) OvhCloud reported:
Unfortunately, the NIC interface was flapping so the motherboard had to be replaced due to the NIC itself being onboard.
The cabling was swapped when the motherboard was replaced as a secondary precaution.
For the eno2 warnings in your logs, it is possibly due to the interface flapping and dropping when trying to reconnect. If you currently do not plan on using vRack and would like to prevent seeing your interface from attempting to connect, we recommend disabling the adapter entirely via at the OS level and through the BIOS.
Noman M.
10) This motherboard replacement damaged cooling system, so the server started to overheat under stress again.
Temperature spikes under stress reached 97C+ and caused occasional CPU throttling.
11) We requested to cancel that "Advance-2-LE" server, and requested Advance-4 server instead.
We continue to evaluate OvhCloud...
Update: AMD EPYC 7313 - how to measure temperature?
no subject
Date: 2021-10-31 06:29 am (UTC)Пожар
Date: 2021-10-31 11:39 am (UTC)=======
https://www.datacenterdynamics.com/en/news/fire-destroys-ovhclouds-sbg2-data-center-strasbourg/
March 10, 2021
The data center, at the Rue du Bassin de l'Industrie at the Port du Rhin, caught fire at 12:40 am, according to a report by Antoine Bonin of local news site DNA, who reports: "When help arrived, the structure was completely set on fire, with flames bursting out several tens of meters in height."
The fire spread to two other buildings, damaging one other data center on the site. "A part of SBG1 is destroyed," said a tweet from OVHcloud founder and chairman Octave Klaba, who recommended that customers activate disaster plans, as "the whole site has been isolated, which impacts all services in SBG1-4."
=======
OVHcloud founder and chairman Octave Klaba - video:
https://www.ovh.com/fr/images/sbg/index-en.html
no subject
Date: 2021-10-31 08:41 pm (UTC)no subject
Date: 2021-11-01 04:14 am (UTC)no subject
Date: 2021-11-01 06:50 am (UTC)Would be interested to know if the issues are normal, or if OvhCloud is not that great, or something's gone wrong for some reason.
Hope you manage to find one that works.
no subject
Date: 2021-11-01 07:11 am (UTC)OvhCloud promised to prepare Advance-4 server in 72 hours (3 days).
We ordered Advance-4 server more than 4 days ago, so OvhCloud missed their promise already.