dennisgorelik: 2020-06-13 in my home office (Default)
[personal profile] dennisgorelik
At PostJobFree we consider potential migration to OvhCloud dedicated servers hosting.

1) ~6 weeks ago we ordered Advance-2-LE server.
We started running ElasticSearch percolator on that server.

2) 5 weeks later, the server hanged.
OvhCloud support rebooted the server and the server continued working.

3) We reviewed /var/log/messages on the server and found "Core temperature above threshold, cpu clock throttled" messages.
The CPU throttling lasted less than a second, but throttling events repeated several times per day
Oct 26 03:19:48 pesovh kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU6: Core temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU9: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU7: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU6: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU8: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU5: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU4: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU10: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU11: Package temperature above threshold, cpu clock throttled (total events = 3)
Oct 26 03:19:48 pesovh kernel: CPU6: Core temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU9: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU10: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU1: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU5: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU6: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU7: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU4: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU11: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU8: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU2: Package temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU0: Core temperature/speed normal
Oct 26 03:19:48 pesovh kernel: CPU3: Package temperature/speed normal

4) We created a ticket for OvhCloud support.

5) OvhCloud support responded:
Date 2021-10-26 12:02:04 EDT (UTC -04:00), Server check:
We have detected an issue with the cooling system, which has now been corrected.

The temperatures reported are now normal.

The server is booted from disk and is on the login screen. Ping OK and services are up.

6) But fixing cooling system created a disk problem [that OvhCloud fixed 1 day later]:
Date 2021-10-27 10:30:36 EDT (UTC -04:00), Extensive diagnosis:
Server would not boot successfully into customer OS. It was found a disk would not detect properly.

Disk issue has been rectified. Server will now boot to customer OS. Booting to the customer OS does not provide ping.

Server booted into customer rescue. Ping 'OK'. Hardware verified.

7) Cooling system fix made the server temperature more stable.
Under heavy stress CPU temperature could temporary went up to 86C, but did not go to 100C when throttling starts.
/var/log/messages no longer reported "Core temperature above threshold, cpu clock throttled".

8) The server kept reporting other warnings in /var/log/messages, so I asked OvhCloud support:
If the network adapter is fine - why do I see many "eno2" messages in /var/log/messages ?
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2118] device (eno2): Activation: failed for connection 'System eno2'
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2119] device (eno2): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2215] dhcp4 (eno2): canceled DHCP transaction
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2215] dhcp4 (eno2): state changed timeout -> done
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2219] policy: auto-activating connection 'System eno2' (b186f945-cc80-911d-668c-b51be8596980)
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2223] device (eno2): Activation: starting connection 'System eno2' (b186f945-cc80-911d-668c-b51be8596980)
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2224] device (eno2): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2226] device (eno2): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2234] device (eno2): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')

Oct 29 11:02:31 testovh NetworkManager[1233]: [1635505351.2235] dhcp4 (eno2): activation: beginning transaction (timeout in 45 seconds)

9) After 1 more day (2021-10-29 20:52 EDT) OvhCloud reported:
Unfortunately, the NIC interface was flapping so the motherboard had to be replaced due to the NIC itself being onboard.

The cabling was swapped when the motherboard was replaced as a secondary precaution.

For the eno2 warnings in your logs, it is possibly due to the interface flapping and dropping when trying to reconnect. If you currently do not plan on using vRack and would like to prevent seeing your interface from attempting to connect, we recommend disabling the adapter entirely via at the OS level and through the BIOS.

Noman M.

10) This motherboard replacement damaged cooling system, so the server started to overheat under stress again.
Temperature spikes under stress reached 97C+ and caused occasional CPU throttling.

11) We requested to cancel that "Advance-2-LE" server, and requested Advance-4 server instead.

We continue to evaluate OvhCloud...

Update: AMD EPYC 7313 - how to measure temperature?

Date: 2021-10-31 06:29 am (UTC)
krivye_ru4ki: (Default)
From: [personal profile] krivye_ru4ki
Что-то напомнило - звоню в датацентр, мол, так и так, аномально высокие значения температуры показываются в мониторинге, железо что ли глючит? А нет, всё нормально, говорят, у нас в датацентре пожар))

Date: 2021-10-31 08:41 pm (UTC)
From: [personal profile] edrevol
Issue after issue...

Date: 2021-11-01 06:50 am (UTC)
From: [personal profile] edrevol
I know next to nothing about hosting services. Can't help you there, sorry (feel bad commenting with no suggestions).

Would be interested to know if the issues are normal, or if OvhCloud is not that great, or something's gone wrong for some reason.

Hope you manage to find one that works.

Profile

dennisgorelik: 2020-06-13 in my home office (Default)
Dennis Gorelik

June 2025

S M T W T F S
1234 567
891011 12 13 14
15161718192021
22232425262728
2930     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 18th, 2025 11:40 am
Powered by Dreamwidth Studios