Dataproc¶

DataprocUtility¶

class gwrappy.dataproc.DataprocUtility(project_id, **kwargs)[source]¶

Initializes object for interacting with Dataproc API.

By default, Application Default Credentials are used.

If gcloud SDK isn’t installed, credential files have to be specified using the kwargs json_credentials_path and client_id.

Parameters:

project_id – Project ID linked to Dataproc.
client_secret_path – File path for client secret JSON file. Only required if credentials are invalid or unavailable.
json_credentials_path – File path for automatically generated credentials.
client_id – Credentials are stored as a key-value pair per client_id to facilitate multiple clients using the same credentials file. For simplicity, using one’s email address is sufficient.

list_clusters(max_results=None, filter=None)[source]¶

Abstraction of projects().regions().clusters().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/list]

Parameters:	max_results (integer) – If None, all results are iterated over and returned. filter (String) – Query param [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/list#query-parameters]
Returns:	List of dictionary objects representing cluster resources.

get_cluster(cluster_name)[source]¶

Abstraction of projects().regions().clusters().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/get]

Parameters:	cluster_name – Cluster name.
Returns:	Dictionary object representing cluster resource.

diagnose_cluster(cluster_name)[source]¶

Abstraction of projects().regions().clusters().diagnose() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/diagnose]

Parameters:	cluster_name – Cluster name.
Returns:	Dictionary object representing operation resource.

list_operations(max_results=None, filter=None)[source]¶

Abstraction of projects().regions().operations().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/list]

Parameters:	max_results (integer) – If None, all results are iterated over and returned. filter (String) – Query param [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/list#query-parameters]
Returns:	List of dictionary objects representing operation resources.

get_operation(operation_name)[source]¶

Abstraction of projects().regions().operations().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/get]

Parameters:	operation_name – Name of operation resource.
Returns:	Dictionary object representing operation resource.

poll_operation_status(operation_resp, sleep_time=3)[source]¶

Abstraction of projects().regions().operations().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/get]

Parameters:	operation_resp – Representation of operation resource. sleep_time – If wait_finish is set to True, sets polling wait time.
Returns:	Dictionary object representing operation resource.

create_cluster(zone, cluster_name, wait_finish=True, sleep_time=3, **kwargs)[source]¶

Abstraction of projects().regions().clusters().create() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create]

Parameters:

zone – Dataproc zone.
cluster_name – Cluster name.
wait_finish – If set to True, operation will be polled till completion.
sleep_time – If wait_finish is set to True, sets polling wait time.
config_bucket – Google Cloud Storage staging bucket used for sharing generated SSH keys and config.
network – Google Compute Engine network to be used for machine communications.
master_boot_disk – Size in GB of the master boot disk.
master_num – The number of master VM instances in the instance group.
master_machine_type – Google Compute Engine machine type used for master cluster instances.
worker_boot_disk – Size in GB of the worker boot disk.
worker_num – The number of VM worker instances in the instance group.
worker_machine_type – Google Compute Engine machine type used for worker cluster instances.
init_actions – Google Cloud Storage URI of executable file(s).

Returns:

Dictionary object or OperationResponse representing cluster resource.

delete_cluster(cluster_name, wait_finish=True, sleep_time=3)[source]¶

Abstraction of projects().regions().clusters().delete() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete]

Parameters:	cluster_name – Cluster name. wait_finish – If set to True, operation will be polled till completion. sleep_time – If wait_finish is set to True, sets polling wait time.
Returns:	Dictionary object or OperationResponse representing cluster resource.

list_jobs(cluster_name=None, job_state='ACTIVE', max_results=None, filter=None)[source]¶

Abstraction of projects().regions().jobs().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list]

Parameters:

cluster_name – Cluster name, if unset, will return jobs from all clusters.
job_state – Category of jobs to return. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list#JobStateMatcher]
max_results (integer) – If None, all results are iterated over and returned.
filter (String) – Query param [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list#query-parameters]

Returns:

List of dictionary objects representing cluster resources.

get_job(job_id)[source]¶

Abstraction of projects().regions().jobs().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/get]

Parameters:	job_id – Job Id.
Returns:	Dictionary object representing job resource.

poll_job_status(job_resp, sleep_time=3)[source]¶

Parameters:	job_resp – Representation of job resource. sleep_time – If wait_finish is set to True, sets polling wait time.
Returns:	Dictionary object representing job resource.

submit_spark_job(cluster_name, main_class, wait_finish=True, sleep_time=5, **kwargs)[source]¶

Abstraction of projects().regions().jobs().submit() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit]

Body parameters can be found here: [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#resource-job]

Parameters:

cluster_name – The name of the cluster where the job will be submitted.
main_class – The name of the driver’s main class.
wait_finish – If set to True, operation will be polled till completion.
sleep_time – If wait_finish is set to True, sets polling wait time.
args – The arguments to pass to the driver.
jar_uris – HCFS URIs of jar files to add to the CLASSPATHs of the Spark driver and tasks.
file_uris – HCFS URIs of files to be copied to the working directory of Spark drivers and distributed tasks.
archive_uris – HCFS URIs of archives to be extracted in the working directory of Spark drivers and tasks.
properties – A mapping of property names to values, used to configure Spark.

Returns:

Dictionary object or JobResponse representing job resource.

submit_pyspark_job(cluster_name, main_py_uri, wait_finish=True, sleep_time=5, **kwargs)[source]¶

Abstraction of projects().regions().jobs().submit() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit]

Body parameters can be found here: [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#resource-job]

Parameters:

cluster_name – The name of the cluster where the job will be submitted.
main_py_uri – The HCFS URI of the Python file to use as the driver.
wait_finish – If set to True, operation will be polled till completion.
sleep_time – If wait_finish is set to True, sets polling wait time.
args – The arguments to pass to the driver.
python_uris – HCFS file URIs of Python files to pass to the PySpark framework.
jar_uris – HCFS URIs of jar files to add to the CLASSPATHs of the Python driver and tasks.
file_uris – HCFS URIs of files to be copied to the working directory of Python drivers and distributed tasks.
archive_uris – HCFS URIs of archives to be extracted in the working directory of Spark drivers and tasks.
properties – A mapping of property names to values, used to configure PySpark.

Returns:

Dictionary object or JobResponse representing job resource.

Misc Classes/Functions¶

class gwrappy.dataproc.utils.OperationResponse(resp)[source]¶

Wrapper for Dataproc Operation responses, mainly for calculating/parsing statistics into human readable formats for logging.

Parameters:	resp (dictionary) – Dictionary representation of an operation resource.

class gwrappy.dataproc.utils.JobResponse(resp)[source]¶

Wrapper for Dataproc Job responses, mainly for calculating/parsing statistics into human readable formats for logging.

Parameters:	resp (dictionary) – Dictionary representation of an operation resource.