Dataproc

DataprocUtility

class gwrappy.dataproc.DataprocUtility(project_id, **kwargs)[source]

Initializes object for interacting with Dataproc API.

By default, Application Default Credentials are used.
If gcloud SDK isn’t installed, credential files have to be specified using the kwargs json_credentials_path and client_id.
Parameters:
  • project_id – Project ID linked to Dataproc.
  • client_secret_path – File path for client secret JSON file. Only required if credentials are invalid or unavailable.
  • json_credentials_path – File path for automatically generated credentials.
  • client_id – Credentials are stored as a key-value pair per client_id to facilitate multiple clients using the same credentials file. For simplicity, using one’s email address is sufficient.
list_clusters(max_results=None, filter=None)[source]

Abstraction of projects().regions().clusters().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/list]

Parameters:
Returns:

List of dictionary objects representing cluster resources.

get_cluster(cluster_name)[source]

Abstraction of projects().regions().clusters().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/get]

Parameters:cluster_name – Cluster name.
Returns:Dictionary object representing cluster resource.
diagnose_cluster(cluster_name)[source]

Abstraction of projects().regions().clusters().diagnose() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/diagnose]

Parameters:cluster_name – Cluster name.
Returns:Dictionary object representing operation resource.
list_operations(max_results=None, filter=None)[source]

Abstraction of projects().regions().operations().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/list]

Parameters:
Returns:

List of dictionary objects representing operation resources.

get_operation(operation_name)[source]

Abstraction of projects().regions().operations().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/get]

Parameters:operation_name – Name of operation resource.
Returns:Dictionary object representing operation resource.
poll_operation_status(operation_resp, sleep_time=3)[source]

Abstraction of projects().regions().operations().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/get]

Parameters:
  • operation_resp – Representation of operation resource.
  • sleep_time – If wait_finish is set to True, sets polling wait time.
Returns:

Dictionary object representing operation resource.

create_cluster(zone, cluster_name, wait_finish=True, sleep_time=3, **kwargs)[source]

Abstraction of projects().regions().clusters().create() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create]

Parameters:
  • zone – Dataproc zone.
  • cluster_name – Cluster name.
  • wait_finish – If set to True, operation will be polled till completion.
  • sleep_time – If wait_finish is set to True, sets polling wait time.
  • config_bucket – Google Cloud Storage staging bucket used for sharing generated SSH keys and config.
  • network – Google Compute Engine network to be used for machine communications.
  • master_boot_disk – Size in GB of the master boot disk.
  • master_num – The number of master VM instances in the instance group.
  • master_machine_type – Google Compute Engine machine type used for master cluster instances.
  • worker_boot_disk – Size in GB of the worker boot disk.
  • worker_num – The number of VM worker instances in the instance group.
  • worker_machine_type – Google Compute Engine machine type used for worker cluster instances.
  • init_actions – Google Cloud Storage URI of executable file(s).
Returns:

Dictionary object or OperationResponse representing cluster resource.

delete_cluster(cluster_name, wait_finish=True, sleep_time=3)[source]

Abstraction of projects().regions().clusters().delete() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete]

Parameters:
  • cluster_name – Cluster name.
  • wait_finish – If set to True, operation will be polled till completion.
  • sleep_time – If wait_finish is set to True, sets polling wait time.
Returns:

Dictionary object or OperationResponse representing cluster resource.

list_jobs(cluster_name=None, job_state='ACTIVE', max_results=None, filter=None)[source]

Abstraction of projects().regions().jobs().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list]

Parameters:
Returns:

List of dictionary objects representing cluster resources.

get_job(job_id)[source]

Abstraction of projects().regions().jobs().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/get]

Parameters:job_id – Job Id.
Returns:Dictionary object representing job resource.
poll_job_status(job_resp, sleep_time=3)[source]
Parameters:
  • job_resp – Representation of job resource.
  • sleep_time – If wait_finish is set to True, sets polling wait time.
Returns:

Dictionary object representing job resource.

submit_spark_job(cluster_name, main_class, wait_finish=True, sleep_time=5, **kwargs)[source]

Abstraction of projects().regions().jobs().submit() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit]

Body parameters can be found here: [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#resource-job]

Parameters:
  • cluster_name – The name of the cluster where the job will be submitted.
  • main_class – The name of the driver’s main class.
  • wait_finish – If set to True, operation will be polled till completion.
  • sleep_time – If wait_finish is set to True, sets polling wait time.
  • args – The arguments to pass to the driver.
  • jar_uris – HCFS URIs of jar files to add to the CLASSPATHs of the Spark driver and tasks.
  • file_uris – HCFS URIs of files to be copied to the working directory of Spark drivers and distributed tasks.
  • archive_uris – HCFS URIs of archives to be extracted in the working directory of Spark drivers and tasks.
  • properties – A mapping of property names to values, used to configure Spark.
Returns:

Dictionary object or JobResponse representing job resource.

submit_pyspark_job(cluster_name, main_py_uri, wait_finish=True, sleep_time=5, **kwargs)[source]

Abstraction of projects().regions().jobs().submit() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit]

Body parameters can be found here: [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#resource-job]

Parameters:
  • cluster_name – The name of the cluster where the job will be submitted.
  • main_py_uri – The HCFS URI of the Python file to use as the driver.
  • wait_finish – If set to True, operation will be polled till completion.
  • sleep_time – If wait_finish is set to True, sets polling wait time.
  • args – The arguments to pass to the driver.
  • python_uris – HCFS file URIs of Python files to pass to the PySpark framework.
  • jar_uris – HCFS URIs of jar files to add to the CLASSPATHs of the Python driver and tasks.
  • file_uris – HCFS URIs of files to be copied to the working directory of Python drivers and distributed tasks.
  • archive_uris – HCFS URIs of archives to be extracted in the working directory of Spark drivers and tasks.
  • properties – A mapping of property names to values, used to configure PySpark.
Returns:

Dictionary object or JobResponse representing job resource.

Misc Classes/Functions

class gwrappy.dataproc.utils.OperationResponse(resp)[source]

Wrapper for Dataproc Operation responses, mainly for calculating/parsing statistics into human readable formats for logging.

Parameters:resp (dictionary) – Dictionary representation of an operation resource.
class gwrappy.dataproc.utils.JobResponse(resp)[source]

Wrapper for Dataproc Job responses, mainly for calculating/parsing statistics into human readable formats for logging.

Parameters:resp (dictionary) – Dictionary representation of an operation resource.