Tutorial 1: Loading and Sampling Trajectory Data
Real-world mobility files vary widely in structure and formatting. Timestamps may be recorded as UNIX integers or ISO-formatted strings, with or without timezone offsets. Coordinate columns may follow different naming conventions, and files may be stored either as flat CSVs or as partitioned Parquet directories. This notebook demonstrates how nomad.io.base standardizes data loading across these variations using two example datasets: a CSV file (gc-data.csv) and a partitioned Parquet directory (gc-data/).
Inspecting schemas
Let's start by inspecting the schemas of the datasets we will use with the nomad helper function table_columns from the io module. This method reports column names for both flat files and partitioned datasets without reading the full content into memory.
from nomad.io import base as loader
print(loader.table_columns("gc-data.csv", format="csv"))
print(loader.table_columns("gc-data/", format="parquet"))
Index(['identifier', 'unix_timestamp', 'device_lat', 'device_lon', 'date',
'offset_seconds', 'local_datetime'],
dtype='object')
Index(['user_id', 'timestamp', 'latitude', 'longitude', 'tz_offset',
'datetime', 'date'],
dtype='object')
Loading data
Reading data with pandas or Parquet readers does not enforce any particular schema, but spatiotemporal data often contains columns that must follow specific formats. The from_file function applies consistent type casting, converting temporal fields to datetime objects, ensuring coordinates are numeric, and optionally creating a tz_offset column to store timezone offsets when parsing datetime strings. This enables compatibility with engines like Spark, in which Timestamp objects cannot store timezone information. When column names differ from expected defaults, from_file accepts a traj_cols dictionary that maps standard names to the dataset’s column names, allowing downstream functions to locate required fields without renaming or altering the data.
traj_cols = {
"user_id": "identifier",
"timestamp": "unix_timestamp",
"latitude": "device_lat",
"longitude": "device_lon",
"datetime": "local_datetime",
"tz_offset": "offset_seconds",
"date": "date"
}
df_mapped = loader.from_file("gc-data.csv", format="csv", traj_cols=traj_cols)
df_mapped.head()
| identifier | unix_timestamp | device_lat | device_lon | date | offset_seconds | local_datetime | |
|---|---|---|---|---|---|---|---|
| 0 | wizardly_joliot | 1704119340 | 38.321711 | -36.667334 | 2024-01-01 | 0 | 2024-01-01 14:29:00 |
| 1 | wizardly_joliot | 1704119700 | 38.321676 | -36.667365 | 2024-01-01 | 0 | 2024-01-01 14:35:00 |
| 2 | wonderful_swirles | 1704121560 | 38.321017 | -36.667869 | 2024-01-01 | -7200 | 2024-01-01 13:06:00 |
| 3 | youthful_galileo | 1704098820 | 38.321625 | -36.666612 | 2024-01-01 | 0 | 2024-01-01 08:47:00 |
| 4 | youthful_galileo | 1704103140 | 38.321681 | -36.666841 | 2024-01-01 | 0 | 2024-01-01 09:59:00 |
This mapping makes the dataset compatible with nomad tools without modifying its original structure. Algorithms expecting standard names like timestamp, latitude, or user_id will work correctly, thanks to the dictionary.
# This dataset has default column names, so no traj_cols argument is necessary
df_pq = loader.from_file("gc-data/", format="parquet", parse_dates=True)
df_pq.head()
| user_id | timestamp | latitude | longitude | tz_offset | datetime | date | |
|---|---|---|---|---|---|---|---|
| 0 | wizardly_joliot | 1704119340 | 38.321711 | -36.667334 | 0 | 2024-01-01 14:29:00 | 2024-01-01 |
| 1 | wizardly_joliot | 1704119700 | 38.321676 | -36.667365 | 0 | 2024-01-01 14:35:00 | 2024-01-01 |
| 2 | wonderful_swirles | 1704121560 | 38.321017 | -36.667869 | -7200 | 2024-01-01 13:06:00 | 2024-01-01 |
| 3 | youthful_galileo | 1704098820 | 38.321625 | -36.666612 | 0 | 2024-01-01 08:47:00 | 2024-01-01 |
| 4 | youthful_galileo | 1704103140 | 38.321681 | -36.666841 | 0 | 2024-01-01 09:59:00 | 2024-01-01 |
Even when GPS data is stored in partitioned directories (e.g. date=2024-01-01/), from_file seamlessly handles it, allowing users familiar with Pandas to simplify the inspection of partitioned datasets in parquet formats without worrying about data casting.
Working on smaller samples and persistence
Large mobility datasets should typically not be fully loaded into the memory of a machine during interactive analysis, so subsampling by user is a common step in early analyses. nomad's sample_users selects a reproducible subset of user IDs, and sample_from_file filters the input dataset to include only those records. The resulting sample can be written to disk using to_file, partitioned by date in hive format to preserve compatibility with distributed engines. Reading the output back with from_file confirms that the sample was saved correctly and remains compatible with the same loading functions.
users = loader.sample_users("gc-data/", format="parquet", size=10, seed=42)
sample_df = loader.sample_from_file("gc-data/", users=users, format="parquet")
loader.to_file(sample_df, "/tmp/nomad_sample", format="parquet", partition_by=["date"], existing_data_behavior='delete_matching')
round_trip = loader.from_file("/tmp/nomad_sample", format="parquet")
round_trip.head()
C:\Users\pacob\Documents\notebooks\daphme\nomad\io\base.py:613: UserWarning: The 'datetime' column has timezone-naive records consider localizing or using unix timestamps.
warnings.warn(f"The '{col}' column has timezone-naive records consider localizing or using unix timestamps.")
| user_id | timestamp | latitude | longitude | tz_offset | datetime | date | |
|---|---|---|---|---|---|---|---|
| 0 | wizardly_joliot | 1704119340 | 38.321711 | -36.667334 | 0 | 2024-01-01 14:29:00 | 2024-01-01 |
| 1 | wizardly_joliot | 1704119700 | 38.321676 | -36.667365 | 0 | 2024-01-01 14:35:00 | 2024-01-01 |
| 2 | competent_torvalds | 1704114840 | 38.320659 | -36.667228 | -7200 | 2024-01-01 11:14:00 | 2024-01-01 |
| 3 | competent_torvalds | 1704117060 | 38.322056 | -36.667541 | -7200 | 2024-01-01 11:51:00 | 2024-01-01 |
| 4 | competent_torvalds | 1704117120 | 38.322075 | -36.667592 | -7200 | 2024-01-01 11:52:00 | 2024-01-01 |