TimeKeeper can handle the unique environment of the cloud and virtual hosts. However, it’s important to understand the environment for time sync in virtualized systems in order to configure TimeKeeper and set expectations properly. On virtual hardware the clock is virtualized and may show discontinuous gaps in time which does not happen on physical systems. As a result applications depending on legacy time synchronization technology in the cloud see time jumps, lurches, and divergences which cause misbehavior. TimeKeeper recognizes this environment and is designed to synchronize the clock properly. With care and proper configuration it’s possible to have near bare-metal hardware accurate time. However, virtual host operations can impact the accuracy of time on these systems so expectation should be set appropriately and users must be aware of these issues. We describe those below.
Common cloud/virtual issues and solutions
TimeKeeper detects when a virtual server instance is migrated or stopped then restarted and keeps time consistent for all applications. This means that timestamps, log files and application output are correlated throughout the network as images move around. In general we recommend disabling migration or stopping the virtual instance for long periods of time. That is because the time will immediately show an error of the amount the operation required. For example if it takes 3 seconds to migrate the virtual instance to a new host TimeKeeper will show a 3 second error. That is because the virtual clock has been stopped for that long and when it wakes up the time is incorrect.
TimeKeeper will immediately recognize this and adjust the clock rate as quickly as possible to correct the error smoothly. TimeKeeper will not “jump” the time to make corrections. It smoothly speeds and slows the clock to adjust time via a controlled slew. Traditional and non-VM aware time synchronization tools will allow VM introduced errors to accumulate and then adjust the error all at once, then allow more error to accumulate. Applications are often not tolerant of time jumping, lurching and making abrupt changes. Logs will show incorrect times and multiple servers tracking the same time source will show wildly varying times. TimeKeeper makes time predictable and stops that behavior. This can cause a sawtooth pattern in the time offset graphs when these errors are introduced. The image below shows that.
Pauses in operation of the virtual instance can be many seconds and quite dramatic or for a few milliseconds at a time. The latter is common when physical hardware is overloaded and cannot service all virtual hosts. The above time errors can appear frequently in those cases. It’s best to reduce the load on the physical hardware in those cases to allow the virtual machines to run properly without the pauses that introduce time errors.
It is possible to configure TimeKeeper to immediately correct large time errors rather than smoothly adjust the time (which can take a while). Enabling the option below will cause TimeKeeper to correct any large time errors immediately. By default, an error of 5 seconds or more is required for TimeKeeper to make this correction.
TimeKeeper also provides the SET_TIME_THRESHOLD option to specify a different threshold. Here where a 50ms error will be immediately corrected for:
Using this option allows VM instances held up by common virtualization overhead (resource contention, migration, etc) to be quickly brought back into sync rather than letting TimeKeeper slowly correct for the error. In situations where periodic delays are unavoidable, this option can make sure clock accuracy remains within regulatory limits. Because TimeKeeper will validate an offset before allowing the clock to step, this process can take as long as is required to process several PTP/NTP queries. If it’s important to step the clock as quickly as possible, it’s recommended that clients be configured to query at a higher than normal sampling rate so that those validation steps can be completed faster.
Note that SET_TIME_THRESHOLD can also affect startup behavior. If SET_TIME_AT_STARTUP is not set, TimeKeeper steers the clock (rather than step) to incrementally correct any offset error if the offset is less than the value of SET_TIME_THRESHOLD. This means that setting SET_TIME_THRESHOLD to a different value (from the 5 second default) in order to change the threshold where the clock is stepped post startup due to VM delays also changes the threshold where TimeKeeper will step the clock at startup. Setting SET_TIME_AT_STARTUP means TimeKeeper always sets the clock on startup regardless of offset and SET_TIME_THRESHOLD only affects post startup behavior.
Good time sources
Something often overlooked in cloud and virtual environments but not with physical hardware is the quality of what is providing time to the client. Time servers in most cloud environments, if they are even available, provide very poor accuracy. Users often have very little choice in what to use, too. In some self-hosted virtual environments that may not be the case due to availability of local time servers managed by a local team. In cases where a 3rd party provides time and manages the platforms one must take care in picking where time comes from.
Some cloud providers, such as AWS, provide local time servers but our testing has shown that these may not be the best choice. The accuracy of these sources can vary greatly and be erratic so thorough testing is suggested. It’s sometimes best to use internal time servers even when going across the WAN to get to them. See more in the “Timing Architecture” section below.
Cross-check of time
The cloud presents a unique challenge when trying to verify time sync. Some cloud providers do not offer traceability to UTC (required for some regulatory reporting). Doing cross-checks against other time sources often require going over the same network links that the primary time source uses and means the cross-check suffers the same problems the main time source does.
In all setups, but especially with cloud and virtual system, one needs to check that the time one is being fed is accurate. A single bad network segment or bad time server should not cause you to have bad time and TimeKeeper won’t allow that to happen if configured properly. TimeKeeper is able to track multiple time sources at the same time and monitor the primary time source for quality and alert when there is a problem and optionally to take corrective action immediately - even in the presence of jamming or spoof attack. This is often a case-by-case issue that is specific to the particular setup and sources available.
Timing architecture in cloud environments
Accurate time synchronization in cloud environments is not difficult to setup but there are some things one must know. New financial regulations (MiFID II and FINRA changes) require 1 millisecond and sometimes 100 microsecond accuracy to UTC time as well as traceability to UTC. Many social media, gaming and other applications require similar accurate timing. In the past that was not easy since these are virtualized environments where system time does not always match “real time”. Getting past that is not difficult now though. Below we describe some possible configurations and tradeoffs using a cloud service to show how to use TimeKeeper to achieve those time accuracy milestones.
It’s not unusual to see 10 minutes/day drift with non-time synchronized AWS EC2 instances, given the harsh environment for time sync in the cloud. Precise time requires the sophisticated algorithms, filters, and network timing models that TimeKeeper provides.
It’s also very important to properly setup your clients to get time from a good quality time server. Below you can see how a typical setup normally works. AWS does not provide internal time servers so one has to get time from the internet. Often that means whatever default time servers a particular version of Windows or Linux time sync clients defaults to. That’s often a very bad choice. Even with GPS-backed TimeKeeper NTP (equivalent to UTC here) as the original time source over the internet there may 3 intermediate time servers adding their own errors before it reaches the final client system. TimeKeeper automatically maps time sources that allows you to visualize this and check for a similar setups.
Below are typical cloud configurations. They use external (outside of the hosting cloud) servers which often are themselves getting 2nd or 3rd hand time from other servers. It’s both a typical/default setup and also the worst quality time.
Below is a far better setup. A single cloud instance is used to track a good time server. That in turn provides time to the other systems within the cloud. This will allow close tracking of GPS/UTC on one server that can clean up the time signal (remove jitter, stabilize any WAN noise, etc) while also allowing for a very tight time sync between instances.
This above achieves the most common need in cloud environments. That is to keep a group of servers as tightly synchronized as possible with that time then being reasonably close to UTC (wall-clock time). TimeKeeper can handle these competing requirements by tracking an absolute time source (UTC) and providing that time to peer systems in the cloud. This allows a collection of servers to be managed as a single unit.
This is an ideal setup. This is only available when a GPS time server is available inside of your own network. TimeKeeper is configured to source time directly from that system over a leased line or other direct link from within the cloud.