BigQuery Storage API

Read from and write to BigQuery tables using the Storage API. Higher throughput than the query API for large datasets.

Storage Read

Reads table data directly via the BigQuery Storage Read API. Returns Arrow RecordBatch.

- gcp_bigquery_storage_read:
    name: read_accounts
    credentials_path: /etc/gcp/service-account.json
    project_id: my-project
    dataset_id: salesforce
    table_id: accounts
    selected_fields:
      - id
      - name
      - industry
    row_restriction: "industry = 'Technology'"

Read fields

Field	Type	Default	Description
`name`	string	required	Task name.
`credentials_path`	string	required	GCP service account credentials.
`project_id`	string	required	GCP project ID.
`dataset_id`	string	required	BigQuery dataset.
`table_id`	string	required	BigQuery table.
`selected_fields`	list		Columns to read (all if omitted).
`row_restriction`	string		WHERE clause for filtering rows.
`sample_percentage`	float		Random sampling percentage.
`snapshot_time`	string		Time-travel query timestamp (RFC 3339).
`max_stream_count`	int		Max parallel read streams.
`data_format`	string	`arrow`	Result format: `arrow` or `avro`.
`depends_on`	list		Upstream task names.
`retry`	object		Retry configuration.

Storage Write

Streams data into BigQuery tables via the Storage Write API. Accepts Arrow RecordBatch input.

- gcp_bigquery_storage_write:
    name: write_accounts
    credentials_path: /etc/gcp/service-account.json
    project_id: my-project
    dataset_id: salesforce
    table_id: accounts

Write fields

Field	Type	Default	Description
`name`	string	required	Task name.
`credentials_path`	string	required	GCP service account credentials.
`project_id`	string	required	GCP project ID.
`dataset_id`	string	required	BigQuery dataset.
`table_id`	string	required	BigQuery table.
`change_type`	string		CDC change type: `upsert` or `delete`.
`depends_on`	list		Upstream task names.
`retry`	object		Retry configuration.