BigQuery

BigqueryUtility

class gwrappy.bigquery.BigqueryUtility(**kwargs)[source]

Initializes object for interacting with Bigquery API.

By default, Application Default Credentials are used.
If gcloud SDK isn’t installed, credential files have to be specified using the kwargs json_credentials_path and client_id.
Parameters:
  • max_retries (integer) – Argument specified with each API call to natively handle retryable errors.
  • client_secret_path – File path for client secret JSON file. Only required if credentials are invalid or unavailable.
  • json_credentials_path – File path for automatically generated credentials.
  • client_id – Credentials are stored as a key-value pair per client_id to facilitate multiple clients using the same credentials file. For simplicity, using one’s email address is sufficient.
list_projects(max_results=None, filter_exp=None)[source]

Abstraction of projects().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/projects/list]

Parameters:
  • max_results (integer) – If None, all results are iterated over and returned.
  • filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns:

List of dictionary objects representing project resources.

list_jobs(project_id, state_filter=None, show_all=False, projection='full', max_results=None, earliest_date=None, filter_exp=None)[source]

Abstraction of jobs().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/list]

Note - All jobs are stored in BigQuery. Do set max_results or earliest_date to limit data returned.

Parameters:
  • project_id (string) – Unique project identifier.
  • state_filter (string) – Pre-filter API request for job state. Acceptable values are “done”, “pending” and “running”. [Equivalent API param: stateFilter]
  • show_all (boolean) – Whether to display jobs owned by all users in the project. [Equivalent API param: allUsers]
  • max_results (integer) – If None, all results are iterated over and returned.
  • projection (string) – Acceptable values are ‘full’, ‘minimal’. ‘full’ includes job configuration.
  • earliest_date (datetime object or string representation of datetime in %Y-%m-%d format.) – Only returns data after this date.
  • filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns:

List of dictionary objects representing job resources.

list_datasets(project_id, show_all=False, max_results=None, filter_exp=None)[source]

Abstraction of datasets().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/datasets/list]

Parameters:
  • project_id (string) – Unique project identifier.
  • show_all (boolean) – Include hidden datasets generated when running queries on the UI.
  • max_results (integer) – If None, all results are iterated over and returned.
  • filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns:

List of dictionary objects representing dataset resources.

list_tables(project_id, dataset_id, max_results=None, filter_exp=None)[source]

Abstraction of tables().list() method with inbuilt iteration functionality. [https://cloud.google.com/bigquery/docs/reference/v2/tables/list]

Parameters:
  • project_id (string) – Unique project identifier.
  • dataset_id (string) – Unique dataset identifier.
  • max_results (integer) – If None, all results are iterated over and returned.
  • filter_exp (function) – Function that filters entries if filter_exp evaluates to True.
Returns:

List of dictionary objects representing table resources.

get_job(project_id, job_id)[source]

Abstraction of jobs().get() method. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/get]

Parameters:
  • project_id (string) – Unique project identifier.
  • job_id (string) – Unique job identifier.
Returns:

Dictionary object representing job resource.

get_table_info(project_id, dataset_id, table_id)[source]

Abstraction of tables().get() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/get]

Parameters:
  • project_id (string) – Unique project identifier.
  • dataset_id (string) – Unique dataset identifier.
  • table_id (string) – Unique table identifier.
Returns:

Dictionary object representing table resource.

delete_table(project_id, dataset_id, table_id)[source]

Abstraction of tables().delete() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/delete]

Parameters:
  • project_id (string) – Unique project identifier.
  • dataset_id (string) – Unique dataset identifier.
  • table_id (string) – Unique table identifier.
Raises:

AssertionError if unsuccessful. Response should be empty string if successful.

poll_job_status(job_resp, sleep_time=1)[source]

Check status of job until status is “DONE”.

Parameters:
  • job_resp (dictionary) – Representation of job resource.
  • sleep_time (integer) – Time to pause (seconds) between polls.
Returns:

Dictionary object representing job resource’s final state.

Raises:

JobError object if an error is discovered after job finishes running.

sync_query(project_id, query, return_type='list', sleep_time=1, dry_run=False, **kwargs)[source]

Abstraction of jobs().query() method, iterating and parsing query results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/query]

Parameters:
  • project_id (string) – Unique project identifier.
  • query (string) – SQL query
  • return_type (string) – Format for result to be returned. Accepted types are “list”, “dataframe”, and “json”.
  • sleep_time (integer) – Time to pause (seconds) between polls.
  • dry_run (boolean) – Basic statistics about the query, without actually running it. Mainly for testing or estimating amount of data processed.
  • useLegacySql – Toggle between Legacy and Standard SQL.
Returns:

If not dry_run: result in specified type, JobResponse object. If dry_run: Dictionary object representing expected query statistics.

Raises:

JobError object if an error is discovered after job finishes running.

async_query(project_id, query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, return_type='list', sleep_time=1, **kwargs)[source]

Abstraction of jobs().insert() method for query job, iterating and parsing query results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]

Asynchronous queries always write to an intermediate (destination) table.

This query method is preferable over sync_query if:
1. Large results are returned.
2. UDF functions are required.
3. Results returned also need to be stored in a table.
Parameters:
  • project_id (string) – Unique project identifier.
  • query (string) – SQL query
  • dest_project_id (string) – Unique project identifier of destination table.
  • dest_dataset_id (string) – Unique dataset identifier of destination table.
  • dest_table_id (string) – Unique table identifier of destination table.
  • udf (string or list) – One or more UDF functions if required by the query.
  • return_type (string) – Format for result to be returned. Accepted types are “list”, “dataframe”, and “json”.
  • sleep_time (integer) – Time to pause (seconds) between polls.
  • useLegacySql – Toggle between Legacy and Standard SQL.
  • writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
Returns:

result in specified type, JobResponse object.

Raises:

JobError object if an error is discovered after job finishes running.

write_table(project_id, query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, wait_finish=True, sleep_time=1, **kwargs)[source]

Abstraction of jobs().insert() method for query job, without returning results. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]

Parameters:
  • project_id (string) – Unique project identifier.
  • query (string) – SQL query
  • dest_project_id (string) – Unique project identifier of destination table.
  • dest_dataset_id (string) – Unique dataset identifier of destination table.
  • dest_table_id (string) – Unique table identifier of destination table.
  • udf (string or list) – One or more UDF functions if required by the query.
  • wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
  • sleep_time (integer) – Time to pause (seconds) between polls.
  • useLegacySql – Toggle between Legacy and Standard SQL.
  • writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
Returns:

If wait_finish: result in specified type, JobResponse object. If not wait_finish: JobResponse object.

Raises:

If wait_finish: JobError object if an error is discovered after job finishes running.

write_view(query, dest_project_id, dest_dataset_id, dest_table_id, udf=None, overwrite_existing=True, **kwargs)[source]

Views are analogous to a virtual table, functioning as a table but only returning results from the underlying query when called.

Parameters:
  • query (string) – SQL query
  • dest_project_id (string) – Unique project identifier of destination table.
  • dest_dataset_id (string) – Unique dataset identifier of destination table.
  • dest_table_id (string) – Unique table identifier of destination table.
  • udf (string or list) – One or more UDF functions if required by the query.
  • overwrite_existing – Safety flag, would raise HttpError if table exists and overwrite_existing=False
  • useLegacySql – Toggle between Legacy and Standard SQL.
Returns:

TableResponse object for the newly inserted table

load_from_gcs(dest_project_id, dest_dataset_id, dest_table_id, schema, source_uris, wait_finish=True, sleep_time=1, **kwargs)[source]
For loading data from Google Cloud Storage.
Abstraction of jobs().insert() method for load job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters:
  • dest_project_id (string) – Unique project identifier of destination table.
  • dest_dataset_id (string) – Unique dataset identifier of destination table.
  • dest_table_id (string) – Unique table identifier of destination table.
  • schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
  • source_uris (string or list) – One or more uris referencing GCS objects
  • wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
  • sleep_time (integer) – Time to pause (seconds) between polls.
  • writeDisposition – Determines table writing behaviour.
  • sourceFormat – Indicates format of input data.
  • skipLeadingRows – Leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
  • fieldDelimiter – Indicates field delimiter.
  • allowQuotedNewlines – Indicates presence of quoted newlines in fields.
  • allowJaggedRows – Accept rows that are missing trailing optional columns. (Only CSV)
  • ignoreUnknownValues – Allow extra values that are not represented in the table schema.
  • maxBadRecords – Maximum number of bad records that BigQuery can ignore when running the job.
Returns:

JobResponse object

export_to_gcs(source_project_id, source_dataset_id, source_table_id, dest_uris, wait_finish=True, sleep_time=1, **kwargs)[source]
For exporting data into Google Cloud Storage.
Abstraction of jobs().insert() method for extract job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters:
  • source_project_id (string) – Unique project identifier of source table.
  • source_dataset_id (string) – Unique dataset identifier of source table.
  • source_table_id (string) – Unique table identifier of source table.
  • dest_uris (string or list) – One or more uris referencing GCS objects
  • wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
  • sleep_time (integer) – Time to pause (seconds) between polls.
  • destinationFormat – (Optional) Config kwarg that indicates format of output data.
  • compression – (Optional) Config kwarg for type of compression applied.
  • fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
  • printHeader – (Optional) Config kwarg indicating if table headers should be written.
Returns:

JobResponse object

copy_table(source_data, dest_project_id, dest_dataset_id, dest_table_id, wait_finish=True, sleep_time=1, **kwargs)[source]
For copying existing table(s) to a new or existing table.
Abstraction of jobs().insert() method for copy job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters:
  • source_data – Representations of single or multiple existing tables to copy from.
  • source_date – dictionary or list of dictionaries
  • dest_project_id (string) – Unique project identifier of destination table.
  • dest_dataset_id (string) – Unique dataset identifier of destination table.
  • dest_table_id (string) – Unique table identifier of destination table.
  • wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
  • sleep_time (integer) – Time to pause (seconds) between polls.
  • writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
Returns:

JobResponse object

load_from_string(dest_project_id, dest_dataset_id, dest_table_id, schema, load_string, wait_finish=True, sleep_time=1, **kwargs)[source]
For loading data from string representation of a file/object.
Can be used in conjunction with gwrappy.bigquery.utils.file_to_string()
Parameters:
  • dest_project_id (string) – Unique project identifier of destination table.
  • dest_dataset_id (string) – Unique dataset identifier of destination table.
  • dest_table_id (string) – Unique table identifier of destination table.
  • schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
  • load_string (string) – String representation of an object.
  • wait_finish (boolean) – Flag whether to poll job till completion. If set to false, multiple jobs can be submitted, repsonses stored, iterated over and polled till completion afterwards.
  • sleep_time (integer) – Time to pause (seconds) between polls.
  • writeDisposition – (Optional) Config kwarg that determines table writing behaviour.
  • sourceFormat – (Optional) Config kwarg that indicates format of input data.
  • skipLeadingRows – (Optional) Config kwarg for leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
  • fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
  • allowQuotedNewlines – (Optional) Config kwarg indicating presence of quoted newlines in fields.
  • allowJaggedRows – Accept rows that are missing trailing optional columns. (Only CSV)
  • ignoreUnknownValues – Allow extra values that are not represented in the table schema.
  • maxBadRecords – Maximum number of bad records that BigQuery can ignore when running the job.
Returns:

JobResponse object

write_federated_table(dest_project_id, dest_dataset_id, dest_table_id, schema, source_uris, overwrite_existing=True, **kwargs)[source]
Imagine a View for Google Cloud Storage object(s).
Abstraction of jobs().insert() method for load job. [https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert]
Parameters:
  • dest_project_id (string) – Unique project identifier of destination table.
  • dest_dataset_id (string) – Unique dataset identifier of destination table.
  • dest_table_id (string) – Unique table identifier of destination table.
  • schema (list of dictionaries) – Schema of input data (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
  • source_uris (string or list) – One or more uris referencing GCS objects
  • overwrite_existing – Safety flag, would raise HttpError if table exists and overwrite_existing=False
  • sourceFormat – (Optional) Config kwarg that indicates format of input data.
  • skipLeadingRows – (Optional) Config kwarg for leading rows to skip. Defaults to 1 to account for headers if sourceFormat is CSV or default, 0 otherwise.
  • fieldDelimiter – (Optional) Config kwarg that indicates field delimiter.
  • compression – (Optional) Config kwarg for type of compression applied.
  • allowQuotedNewlines – (Optional) Config kwarg indicating presence of quoted newlines in fields.
Returns:

TableResponse object

update_table_info(project_id, dataset_id, table_id, table_description=None, schema=None)[source]

Abstraction of tables().patch() method. [https://cloud.google.com/bigquery/docs/reference/v2/tables/patch]

Parameters:
  • project_id (string) – Unique project identifier.
  • dataset_id (string) – Unique dataset identifier.
  • table_id (string) – Unique table identifier.
  • table_description (string) – Optional description for table. If None, would not overwrite existing description.
  • schema_fields
  • schema (list of dictionaries) – Schema fields to change (schema.fields[]) [https://cloud.google.com/bigquery/docs/reference/v2/tables]
Returns:

TableResponse

poll_resp_list(response_list, sleep_time=1)[source]

Convenience function for iterating and polling list of responses collected with jobs wait_finish=False.

If any job fails, its respective Error object is returned to ensure errors would not break polling subsequent responses.

Parameters:
  • response_list (list of dicts or JobResponse objects) – List of response objects
  • sleep_time (integer) – Time to pause (seconds) between polls.
Returns:

List of JobResponse or Error (JobError/HttpError) objects representing job resource’s final state.

Misc Classes/Functions

class gwrappy.bigquery.utils.JobResponse(resp, description=None)[source]

Wrapper for Bigquery job responses, mainly for calculating/parsing job statistics into human readable formats for logging.

Parameters:
  • resp (dictionary) – Dictionary representation of a job resource.
  • description – Optional string descriptor for specific function of job.
class gwrappy.bigquery.utils.TableResponse(resp, description=None)[source]

Wrapper for Bigquery table resources, mainly for calculating/parsing job statistics into human readable formats for logging.

Parameters:
  • resp (dictionary) – Dictionary representation of a table resource.
  • description – Optional string descriptor for table.
gwrappy.bigquery.utils.read_sql(read_path, **kwargs)[source]

Reads text file, performing string substitution using str.format() method if necessary.

Parameters:
  • read_path – File path containing SQL query.
  • kwargs – Key-Value pairs referencing {key} within query for substitution.
Returns:

Query string.

gwrappy.bigquery.utils.bq_schema_from_df(input_df)[source]

Derive Bigquery Schema from Pandas Dataframe object.

Parameters:input_df – Pandas Dataframe object
Returns:List of dictionaries which can be fed directly as Bigquery schemas.
gwrappy.bigquery.utils.file_to_string(f, source_format='csv')[source]

Specifically for BigqueryUtility().load_from_string()

Parameters:
  • f (file path, list of lists/dicts, dataframe, or string representation of json list) – Object to convert to string.
  • source_format (string) – Indicates format of input data. Accepted values are “csv” and “json”.
Returns:

String representation of object/file contents