Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

When did your data REALLY arrive in BigQuery?

6 min readMay 1, 2025

--

A little while ago, I was lucky enough to collaborate with some engineers who were working on Google Cloud. They were building a data pipeline that streamed millions of rows of event data into BigQuery every few seconds. The pipeline would also incrementally identify and process the latest data for reporting and distribution to downstream systems.

They needed to be sure they were always working with a complete and accurate dataset for each processing window, preventing missed or duplicate data. To do this, it was crucial for them to accurately pinpoint two moments in time: when the data arrived in BigQuery, and when their batch processing made this data available. It’s a challenge that crops up often and highlights the importance of ingestion time — which in its most accurate form, is the moment data is committed to storage and becomes queryable within BigQuery.

Ingestion time is fundamental for building reliable data systems. Beyond ensuring your dashboards, reports and ETL processes use the freshest data, it’s also important for understanding how your data progresses through your pipeline. The ability to compare event and ingestion timestamps can help identify precisely where latency originates — be it the event source, network, or ingestion step. Critically, this understanding allows you to build a profile of late arriving data, and once you have this profile, you’ll be able to determine what action, if any, you need to take. You can optimize your ETL windows for completeness, set up alerts for unusual data delays, or investigate and improve upstream data sources, making your data more valuable and trustworthy.

This post introduces the approaches you can take to track BigQuery ingestion time. It seems straightforward, but there are a few gotchas you’ll want to keep in mind.

Application Timestamp

If you’re writing to BigQuery from an application, for example, a Dataflow pipeline, or streaming using BigQuery’s Storage Write API, it’s common practice to manually add a timestamp column during processing at the application level.

This is a common approach where the timestamp usually reflects when the application processed the data, and in many cases, this is a good enough indicator of ingestion time.

However, this timestamp might differ significantly from when BigQuery committed the data to storage due to network latency, API queuing, internal BigQuery processing, or application-level batching and retry mechanisms.

Note — streaming inserts with the Storage Write API using default values (see CURRENT_TIMESTAMP() section below) on a table column is often a very good approximation of the commit time, because writes typically happen in frequent, small transactions.

So, what approaches can be used when this application-level timestamp isn’t suitable, or when you’re not using an application to write to Bigquery?

Ingestion-Time Partitioning

You can partition a BigQuery table based on ingestion-time. BigQuery automatically assigns rows to partitions based on the time that BigQuery ingests the data. Partitions can have a granularity of hourly, daily, monthly, or yearly.

Tables partitioned by ingestion-time have a pseudocolumn named _PARTITIONTIME that stores metadata about the ingestion time for each row.

However, the timestamp is truncated at the partition boundary, and will not give you an exact ingestion time.
For example, in an hourly partitioned table, all rows will have a partition time of the start of the hour the row was inserted:

This approach is more suited to those wanting to optimize query performance and cost by filtering data based on partitions, or when an approximate, coarse timestamp is good enough.

It’s not suitable if you need a more precise ingestion timestamp.

CURRENT_TIMESTAMP() Function

BigQuery’s () function returns the current date and time as a timestamp object. When you create a table you can define a timestamp column and assign CURRENT_TIMESTAMP() as its value:

CREATE TABLE mydataset.raw_events (
event_id STRING,
event_timestamp TIMESTAMP,
ingestion_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
data JSON);

When you insert a row and do not provide a value for this particular column, the default value is written to the column instead — which in this case, will be the current time.

INSERT INTO
mydataset.raw_events (event_id, event_timestamp, ingestion_time, data)
VALUES (
'event_123',
TIMESTAMP('2025-04-30T18:55:00Z'),
DEFAULT,
JSON '{"sensor_id": "sensor_A", "value": 75.5, "status": "active"}'
);

This approach is great for getting an ingestion time in microsecond parts, and works a treat for streaming use cases that have frequent, small transactions.

However, you have to be mindful when using the CURRENT_TIMESTAMP() function in an INSERT statement that takes a while to run. The value of the CURRENT_TIMESTAMP() function is captured only once at the very start of the query statement, not individually for each row at the point of writing to BigQuery. For example, let’s imagine a statement like the below takes a few minutes to run:

INSERT INTO mydataset.processed_events
SELECT
CURRENT_TIMESTAMP() AS processing_time
s.event_time AS original_event_time,
s.event_id,
JSON_VALUE(s.event_data, '$.payload.product_code') AS product_code,
FROM
mydataset.raw_events s

All rows inserted by this statement will receive an identical timestamp for the processing_time column based on when the query statement began, not the time they were physically written to storage.

So while this might reflect the commit time for single-row inserts or very fast statements quite accurately, it becomes increasingly inaccurate as the duration of the INSERT statement increases. And keep in mind, using INSERT statements for streaming data isn’t generally recommended anyway.

So it’s a few minutes off, who cares? This inaccuracy can cause issues for downstream processes relying on precise ingestion time. Having a timestamp that predates the data actually being available and queryable in BigQuery could lead to missed or duplicate data processing. For example:

  • Let’s say we have a very complex query that inserts data into a table and takes 8 minutes to run.
  • It starts at 20:55:00 and finishes (at which point the data has been committed to storage and is available for querying) at 21:03.00.
  • The ingestion time on each and every row will be, let’s say 20:55:02
  • We also have the below script that runs at one minute past the hour, every hour, that attempts to get all the rows inserted in the last hour.
SELECT 
*
FROM
mydataset.processed_events
WHERE
ingestion_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
  • When this script runs at 21:01:00:
    Our data with an ingestion time of 20:55:02 will not be returned, because it is not yet available.
  • When this script runs again at 22:01:00:
    Our data will again not be returned, because it has been assigned an ingestion time of 20:55:02, which is outside of the time window of this query.

One way to get around this, is to query the change history of the table.

BigQuery Change History

We have a handful of that let you track the history of changes to a BigQuery table. In our particular case, we can use the time series function to return all the rows appended (inserted) to a table for a given time range:

SELECT 
*
FROM
APPENDS(
TABLE table,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 DAY),
CURRENT_TIMESTAMP()
)
  • _CHANGE_TIMESTAMP: is the commit time of the insert transaction in microseconds.
  • _CHANGE_TYPE: in the case of appends, this will always be INSERT. You can capture other change types (UPDATE, DELETE) by using the function instead.

While APPENDS doesn’t store the time persistently in the row where your data is located, it is able to extract the commit timestamp of the transaction that inserted the rows — the true time at which the data became queryable.

However, there are some limitations. APPENDS cannot be executed as part of multi-statement transactions and doesn’t support clones, snapshots, (materialized) views or external tables. Also, the time travel window (the furthest back in time you can apply this query) is limited to 7 days.

Summary

Ultimately, the best technique for capturing the ingestion time in BigQuery depends on how you’re writing to BigQuery and what your requirements are:

Check out these different approaches to see which is right for you:

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Rachael Deacon-smith
Rachael Deacon-smith

Written by Rachael Deacon-smith

I'm a Developer Advocate at Google Cloud, helping practitioners navigate the world of Data and AI