This article discusses how you should (and shouldn't) compare string representations of timestamps in Python.
Often times when programmers process external data there are timestamps (date and time) and usually this data comes into Python as a string, although some times it could be an integer or float. The international standard for exchanging timestamps is ISO-8601. One of its specifications is for timestamps to look like this:
"2022-04-24T11:33+00:00"
This format provides a date and time and a UTC offset (i.e. time zone). In the
Python standard library there is a data type to represent this, the
timezone-aware
datetime.datetime
type. However when receiving timestamps from external sources where often the
serialization format is JSON, we often see this data as strings because JSON
doesn't have a timestamp data type. So the JSON string '{"name": "Sheba", "weight": 3.6, "admitted": "2022-04-24T11:33+00:00"}'
would get decoded from JSON as
{
'name': 'Sheba',
'weight': 3.6,
'admitted': '2022-04-24T11:33+00:00',
}
Now let's say we receive data, like the JSON above, except it's a list of items containing a timestamp string. And let's say we wanted to get the most recent item according to the timestam, for example:
data = [
{
'name': 'Sheba',
'admitted': '2022-04-24T11:33+00:00',
},
{
'name': 'Amber',
'admitted': '2020-09-24T09:14+00:00',
},
{
'name': 'Mittens',
'admitted': '2021-11-07T16:08+00:00',
},
]
Here, the temptation might be to do something like this:
most_recent_admission = max(data, key=lambda i: i["admitted"])
This gives us the result we expect (Sheba) however I one should think about
the data type being compared in the max()
call. It is comparing strings
which, in the above example, compare the same as how we would compare actual
timestamps. The danger is that this is not always the case. Consider the
following timestamp strings:
>>> d1 = '2022-04-25 12:15:39.410116+00:00'
>>> d2 = '2022-04-25 07:15:39.410116-05:00'
>>> d1 > d2
>>> True
As strings d1
compares as greater than d2
which we might interpret to mean
that d1
is the more recent timestamp. However in reality these times are
equal!
>>> d1 = datetime.datetime(2022, 4, 25, 12, 15, 39, 410116, tzinfo=datetime.timezone.utc)
>>> d2 = datetime.datetime(2022, 4, 25, 7, 15, 39, 410116, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=68400)))
>>> d1 > d2
False
>>> d1 == d2
True
Hopefully by the above example you're convinced comparing timestamps as
strings is not ideal. A quick way to overcome this would be to change our
call to max()
to something like this:
max(data, key=lambda i: dt.datetime.fromisoformat(i["admitted"]))
Though this works fine and overcomes our problem, there is a common believe that "external" data formats should be converted to "internal" data formats as soon as it as consumed (and converted back to "external" data formats just before it is exported). This is often useful if you plan to do other things with the data later on. So this could be re-written further as
my_data = [{**i, "admitted": dt.datetime.fromisoformat(i["admitted"])} for i in data]
most_recent_admission = max(my_data, key=lambda i: i["admitted"])
Then if we need to do any subsequent datetime
manipulation on the
"admitted"
field it's already in the data type we need. There are other
higher-level libraries that can be used for modeling external data into
higher-level data types in Python. Some examples include
dataclasses in the
standard library as well as the third-party
pydantic library.