Python: Comparing timestamp strings

This article discusses how you should (and shouldn't) compare string representations of timestamps in Python.

Often times when programmers process external data there are timestamps (date and time) and usually this data comes into Python as a string, although some times it could be an integer or float. The international standard for exchanging timestamps is ISO-8601. One of its specifications is for timestamps to look like this:

"2022-04-24T11:33+00:00"

This format provides a date and time and a UTC offset (i.e. time zone). In the Python standard library there is a data type to represent this, the timezone-aware datetime.datetime type. However when receiving timestamps from external sources where often the serialization format is JSON, we often see this data as strings because JSON doesn't have a timestamp data type. So the JSON string '{"name": "Sheba", "weight": 3.6, "admitted": "2022-04-24T11:33+00:00"}' would get decoded from JSON as

{
    'name': 'Sheba',
    'weight': 3.6,
    'admitted': '2022-04-24T11:33+00:00',
}

Now let's say we receive data, like the JSON above, except it's a list of items containing a timestamp string. And let's say we wanted to get the most recent item according to the timestam, for example:

data = [
    {
        'name': 'Sheba',
        'admitted': '2022-04-24T11:33+00:00',
    },
    {
        'name': 'Amber',
        'admitted': '2020-09-24T09:14+00:00',
    },
    {
        'name': 'Mittens',
        'admitted': '2021-11-07T16:08+00:00',
    },
]

Here, the temptation might be to do something like this:

most_recent_admission = max(data, key=lambda i: i["admitted"])

This gives us the result we expect (Sheba) however I one should think about the data type being compared in the max() call. It is comparing strings which, in the above example, compare the same as how we would compare actual timestamps. The danger is that this is not always the case. Consider the following timestamp strings:

>>> d1 = '2022-04-25 12:15:39.410116+00:00'
>>> d2 = '2022-04-25 07:15:39.410116-05:00'
>>> d1 > d2
>>> True

As strings d1 compares as greater than d2 which we might interpret to mean that d1 is the more recent timestamp. However in reality these times are equal!

>>> d1 = datetime.datetime(2022, 4, 25, 12, 15, 39, 410116, tzinfo=datetime.timezone.utc)
>>> d2 = datetime.datetime(2022, 4, 25, 7, 15, 39, 410116, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=68400)))
>>> d1 > d2
False
>>> d1 == d2
True

Hopefully by the above example you're convinced comparing timestamps as strings is not ideal. A quick way to overcome this would be to change our call to max() to something like this:

max(data, key=lambda i: dt.datetime.fromisoformat(i["admitted"]))

Though this works fine and overcomes our problem, there is a common believe that "external" data formats should be converted to "internal" data formats as soon as it as consumed (and converted back to "external" data formats just before it is exported). This is often useful if you plan to do other things with the data later on. So this could be re-written further as

my_data = [{**i, "admitted": dt.datetime.fromisoformat(i["admitted"])} for i in data]
most_recent_admission = max(my_data, key=lambda i: i["admitted"])

Then if we need to do any subsequent datetime manipulation on the "admitted" field it's already in the data type we need. There are other higher-level libraries that can be used for modeling external data into higher-level data types in Python. Some examples include dataclasses in the standard library as well as the third-party pydantic library.