BigQuery¶

BigqueryUtility¶

class gwrappy.bigquery.BigqueryUtility(**kwargs)[source]¶

Initializes object for interacting with Bigquery API.

By default, Application Default Credentials are used.

If gcloud SDK isn’t installed, credential files have to be specified using the kwargs json_credentials_path and client_id.

Parameters:

max_retries (integer) – Argument specified with each API call to natively handle retryable errors.
client_secret_path – File path for client secret JSON file. Only required if credentials are invalid or unavailable.
json_credentials_path – File path for automatically generated credentials.
client_id – Credentials are stored as a key-value pair per client_id to facilitate multiple clients using the same credentials file. For simplicity, using one’s email address is sufficient.

list_projects(max_results=None, filter_exp=None)[source]¶

Abstraction of projects().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/projects/list]

Parameters:	max_results (integer) – If None, all results are iterated over and returned. filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns:	List of dictionary objects representing project resources.

list_jobs(project_id, state_filter=None, show_all=False, projection='full', max_results=None, earliest_date=None, filter_exp=None)[source]¶

Abstraction of jobs().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/list]

Note - All jobs are stored in BigQuery. Do set max_results or earliest_date to limit data returned.

Parameters:

project_id (string) – Unique project identifier.
state_filter (string) – Pre-filter API request for job state. Acceptable values are “done”, “pending” and “running”. [Equivalent API param: stateFilter]
show_all (boolean) – Whether to display jobs owned by all users in the project. [Equivalent API param: allUsers]
max_results (integer) – If None, all results are iterated over and returned.
projection (string) – Acceptable values are ‘full’, ‘minimal’. ‘full’ includes job configuration.
earliest_date (datetime object or string representation of datetime in %Y-%m-%d format.) – Only returns data after this date.
filter_exp (function) – Function that filters entries if filter_exp evaluates to True.

Returns:

List of dictionary objects representing job resources.

list_datasets(project_id, show_all=False, max_results=None, filter_exp=None)[source]¶

Abstraction of datasets().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/datasets/list]

Parameters:	project_id (string) – Unique project identifier. show_all (boolean) – Include hidden datasets generated when running queries on the UI. max_results (integer) – If None, all results are iterated over and returned. filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns:	List of dictionary objects representing dataset resources.

list_tables(project_id, dataset_id, max_results=None, filter_exp=None)[source]¶

Abstraction of tables().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/tables/list]

Parameters:	project_id (string) – Unique project identifier. dataset_id (string) – Unique dataset identifier. max_results (integer) – If None, all results are iterated over and returned. filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns:	List of dictionary objects representing table resources.

get_job(project_id, job_id)[source]¶

Abstraction of jobs().get() method. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/get]

Parameters:	project_id (string) – Unique project identifier. job_id (string) – Unique job identifier.
Returns:	Dictionary object representing job resource.

get_table_info(project_id, dataset_id, table_id)[source]¶

Abstraction of tables().get() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/get]

Parameters:	project_id (string) – Unique project identifier. dataset_id (string) – Unique dataset identifier. table_id (string) – Unique table identifier.
Returns:	Dictionary object representing table resource.

delete_table(project_id, dataset_id, table_id)[source]¶

Abstraction of tables().delete() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/delete]

Parameters:	project_id (string) – Unique project identifier. dataset_id (string) – Unique dataset identifier. table_id (string) – Unique table identifier.
Raises:	AssertionError if unsuccessful. Response should be empty string if successful.

poll_job_status(job_resp, sleep_time=1)[source]¶

Check status of job until status is “DONE”.

Parameters:	job_resp (dictionary) – Representation of job resource. sleep_time (integer) – Time to pause (seconds) between polls.
Returns:	Dictionary object representing job resource’s final state.
Raises:	JobError object if an error is discovered after job finishes running.

sync_query(project_id, query, return_type='list', sleep_time=1, dry_run=False, **kwargs)[source]¶

Abstraction of jobs().query() method, iterating and parsing query results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/query]

Parameters:	project_id (string) – Unique project identifier. query (string) – SQL query return_type (string) – Format for result to be returned. Accepted types are “list”, “dataframe”, and “json”. sleep_time (integer) – Time to pause (seconds) between polls. dry_run (boolean) – Basic statistics about the query, without actually running it. Mainly for testing or estimating amount of data processed. useLegacySql – Toggle between Legacy and Standard SQL.
Returns:	If not dry_run: result in specified type, JobResponse object. If dry_run: Dictionary object representing expected query statistics.
Raises:	JobError object if an error is discovered after job finishes running.

async_query(project_id, query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, return_type='list', sleep_time=1, **kwargs)[source]¶

Abstraction of jobs().insert() method for query job, iterating and parsing query results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]

Asynchronous queries always write to an intermediate (destination) table.

This query method is preferable over sync_query if:
Large results are returned.
UDF functions are required.
Results returned also need to be stored in a table.

Parameters:	project_id (string) – Unique project identifier. query (string) – SQL query dest_project_id (string) – Unique project identifier of destination table. dest_dataset_id (string) – Unique dataset identifier of destination table. dest_table_id (string) – Unique table identifier of destination table. udf (string or list) – One or more UDF functions if required by the query. return_type (string) – Format for result to be returned. Accepted types are “list”, “dataframe”, and “json”. sleep_time (integer) – Time to pause (seconds) between polls. useLegacySql – Toggle between Legacy and Standard SQL. writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
Returns:	result in specified type, JobResponse object.
Raises:	JobError object if an error is discovered after job finishes running.

write_table(project_id, query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, wait_finish=True, sleep_time=1, **kwargs)[source]¶

Abstraction of jobs().insert() method for query job, without returning results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]

Parameters:	project_id (string) – Unique project identifier. query (string) – SQL query dest_project_id (string) – Unique project identifier of destination table. dest_dataset_id (string) – Unique dataset identifier of destination table. dest_table_id (string) – Unique table identifier of destination table. udf (string or list) – One or more UDF functions if required by the query. wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards. sleep_time (integer) – Time to pause (seconds) between polls. useLegacySql – Toggle between Legacy and Standard SQL. writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
Returns:	If wait_finish: result in specified type, JobResponse object. If not wait_finish: JobResponse object.
Raises:	If wait_finish: JobError object if an error is discovered after job finishes running.

write_view(query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, overwrite_existing=True, **kwargs)[source]¶

Views are analogous to a virtual table, functioning as a table but only returning results from the underlying query when called.

Parameters:

query (string) – SQL query
dest_project_id (string) – Unique project identifier of destination table.
dest_dataset_id (string) – Unique dataset identifier of destination table.
dest_table_id (string) – Unique table identifier of destination table.
udf (string or list) – One or more UDF functions if required by the query.
overwrite_existing – Safety flag, would raise HttpError if table exists and overwrite_existing=False
useLegacySql – Toggle between Legacy and Standard SQL.

Returns:

TableResponse object for the newly inserted table

load_from_gcs(dest_project_id, dest_dataset_id, dest_table_id, schema, source_uris, wait_finish=True, sleep_time=1, **kwargs)[source]¶

For loading data from Google Cloud Storage.

Abstraction of jobs().insert() method for load job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]

Parameters:

dest_project_id (string) – Unique project identifier of destination table.
dest_dataset_id (string) – Unique dataset identifier of destination table.
dest_table_id (string) – Unique table identifier of destination table.
schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
source_uris (string or list) – One or more uris referencing GCS objects
wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
sleep_time (integer) – Time to pause (seconds) between polls.
writeDisposition – Determines table writing behaviour.
sourceFormat – Indicates format of input data.
skipLeadingRows – Leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
fieldDelimiter – Indicates field delimiter.
allowQuotedNewlines – Indicates presence of quoted newlines in fields.
allowJaggedRows – Accept rows that are missing trailing optional columns. (Only CSV)
ignoreUnknownValues – Allow extra values that are not represented in the table schema.
maxBadRecords – Maximum number of bad records that BigQuery can ignore when running the job.

Returns:

JobResponse object

export_to_gcs(source_project_id, source_dataset_id, source_table_id, dest_uris, wait_finish=True, sleep_time=1, **kwargs)[source]¶

For exporting data into Google Cloud Storage.

Abstraction of jobs().insert() method for extract job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]

Parameters:

source_project_id (string) – Unique project identifier of source table.
source_dataset_id (string) – Unique dataset identifier of source table.
source_table_id (string) – Unique table identifier of source table.
dest_uris (string or list) – One or more uris referencing GCS objects
wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
sleep_time (integer) – Time to pause (seconds) between polls.
destinationFormat – (Optional) Config kwarg that indicates format of output data.
compression – (Optional) Config kwarg for type of compression applied.
fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
printHeader – (Optional) Config kwarg indicating if table headers should be written.

Returns:

JobResponse object

copy_table(source_data, dest_project_id, dest_dataset_id, dest_table_id, wait_finish=True, sleep_time=1, **kwargs)[source]¶

For copying existing table(s) to a new or existing table.

Abstraction of jobs().insert() method for copy job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]

Parameters:

source_data – Representations of single or multiple existing tables to copy from.
source_date – dictionary or list of dictionaries
dest_project_id (string) – Unique project identifier of destination table.
dest_dataset_id (string) – Unique dataset identifier of destination table.
dest_table_id (string) – Unique table identifier of destination table.
wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
sleep_time (integer) – Time to pause (seconds) between polls.
writeDisposition – (Optional) Config kwarg that determines table writing behaviour.

Returns:

JobResponse object

load_from_string(dest_project_id, dest_dataset_id, dest_table_id, schema, load_string, wait_finish=True, sleep_time=1, **kwargs)[source]¶

For loading data from string representation of a file/object.

Can be used in conjunction with gwrappy.bigquery.utils.file_to_string()

Parameters:

dest_project_id (string) – Unique project identifier of destination table.
dest_dataset_id (string) – Unique dataset identifier of destination table.
dest_table_id (string) – Unique table identifier of destination table.
schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
load_string (string) – String representation of an object.
wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
sleep_time (integer) – Time to pause (seconds) between polls.
writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
sourceFormat – (Optional) Config kwarg that indicates format of input data.
skipLeadingRows – (Optional) Config kwarg for leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
allowQuotedNewlines – (Optional) Config kwarg indicating presence of quoted newlines in fields.
allowJaggedRows – Accept rows that are missing trailing optional columns. (Only CSV)
ignoreUnknownValues – Allow extra values that are not represented in the table schema.
maxBadRecords – Maximum number of bad records that BigQuery can ignore when running the job.

Returns:

JobResponse object

write_federated_table(dest_project_id, dest_dataset_id, dest_table_id, schema, source_uris, overwrite_existing=True, **kwargs)[source]¶

Imagine a View for Google Cloud Storage object(s).

Abstraction of jobs().insert() method for load job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]

Parameters:

dest_project_id (string) – Unique project identifier of destination table.
dest_dataset_id (string) – Unique dataset identifier of destination table.
dest_table_id (string) – Unique table identifier of destination table.
schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
source_uris (string or list) – One or more uris referencing GCS objects
overwrite_existing – Safety flag, would raise HttpError if table exists and overwrite_existing=False
sourceFormat – (Optional) Config kwarg that indicates format of input data.
skipLeadingRows – (Optional) Config kwarg for leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
compression – (Optional) Config kwarg for type of compression applied.
allowQuotedNewlines – (Optional) Config kwarg indicating presence of quoted newlines in fields.

Returns:

TableResponse object

update_table_info(project_id, dataset_id, table_id, table_description=None, schema=None)[source]¶

Abstraction of tables().patch() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/patch]

Parameters:

project_id (string) – Unique project identifier.
dataset_id (string) – Unique dataset identifier.
table_id (string) – Unique table identifier.
table_description (string) – Optional description for table. If None, would not overwrite existing description.
schema_fields –
schema (list of dictionaries) – Schema fields to change (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]

Returns:

TableResponse

poll_resp_list(response_list, sleep_time=1)[source]¶

Convenience function for iterating and polling list of responses collected with jobs wait_finish=False.

If any job fails, its respective Error object is returned to ensure errors would not break polling subsequent responses.

Parameters:	response_list (list of dicts or JobResponse objects) – List of response objects sleep_time (integer) – Time to pause (seconds) between polls.
Returns:	List of JobResponse or Error (JobError/HttpError) objects representing job resource’s final state.

Misc Classes/Functions¶

class gwrappy.bigquery.utils.JobResponse(resp, description=None)[source]¶

Wrapper for Bigquery job responses, mainly for calculating/parsing job statistics into human readable formats for logging.

Parameters:	resp (dictionary) – Dictionary representation of a job resource. description – Optional string descriptor for specific function of job.

class gwrappy.bigquery.utils.TableResponse(resp, description=None)[source]¶

Wrapper for Bigquery table resources, mainly for calculating/parsing job statistics into human readable formats for logging.

Parameters:	resp (dictionary) – Dictionary representation of a table resource. description – Optional string descriptor for table.

gwrappy.bigquery.utils.read_sql(read_path, **kwargs)[source]¶

Reads text file, performing string substitution using str.format() method if necessary.

Parameters:	read_path – File path containing SQL query. kwargs – Key-Value pairs referencing {key} within query for substitution.
Returns:	Query string.

gwrappy.bigquery.utils.bq_schema_from_df(input_df)[source]¶

Derive Bigquery Schema from Pandas Dataframe object.

Parameters:	input_df – Pandas Dataframe object
Returns:	List of dictionaries which can be fed directly as Bigquery schemas.

gwrappy.bigquery.utils.file_to_string(f, source_format='csv')[source]¶

Specifically for BigqueryUtility().load_from_string()

Parameters:	f (file path, list of lists/dicts, dataframe, or string representation of json list) – Object to convert to string. source_format (string) – Indicates format of input data. Accepted values are “csv” and “json”.
Returns:	String representation of object/file contents