BigQuery Storage API

Read from and write to BigQuery tables using the Storage API. Higher throughput than the query API for large datasets.

Storage Read

Reads table data directly via the BigQuery Storage Read API. Returns Arrow RecordBatch.

- gcp_bigquery_storage_read:
    name: read_accounts
    credentials_path: /etc/gcp/service-account.json
    project_id: my-project
    dataset_id: salesforce
    table_id: accounts
    selected_fields:
      - id
      - name
      - industry
    row_restriction: "industry = 'Technology'"

Read fields

FieldTypeDefaultDescription
namestringrequiredTask name.
credentials_pathstringrequiredGCP service account credentials.
project_idstringrequiredGCP project ID.
dataset_idstringrequiredBigQuery dataset.
table_idstringrequiredBigQuery table.
selected_fieldslistColumns to read (all if omitted).
row_restrictionstringWHERE clause for filtering rows.
sample_percentagefloatRandom sampling percentage.
snapshot_timestringTime-travel query timestamp (RFC 3339).
max_stream_countintMax parallel read streams.
data_formatstringarrowResult format: arrow or avro.
depends_onlistUpstream task names.
retryobjectRetry configuration.

Storage Write

Streams data into BigQuery tables via the Storage Write API. Accepts Arrow RecordBatch input.

- gcp_bigquery_storage_write:
    name: write_accounts
    credentials_path: /etc/gcp/service-account.json
    project_id: my-project
    dataset_id: salesforce
    table_id: accounts

Write fields

FieldTypeDefaultDescription
namestringrequiredTask name.
credentials_pathstringrequiredGCP service account credentials.
project_idstringrequiredGCP project ID.
dataset_idstringrequiredBigQuery dataset.
table_idstringrequiredBigQuery table.
change_typestringCDC change type: upsert or delete.
depends_onlistUpstream task names.
retryobjectRetry configuration.