Dataproc¶
DataprocUtility¶
-
class
gwrappy.dataproc.
DataprocUtility
(project_id, **kwargs)[source]¶ Initializes object for interacting with Dataproc API.
By default, Application Default Credentials are used.If gcloud SDK isn’t installed, credential files have to be specified using the kwargs json_credentials_path and client_id.Parameters: - project_id – Project ID linked to Dataproc.
- client_secret_path – File path for client secret JSON file. Only required if credentials are invalid or unavailable.
- json_credentials_path – File path for automatically generated credentials.
- client_id – Credentials are stored as a key-value pair per client_id to facilitate multiple clients using the same credentials file. For simplicity, using one’s email address is sufficient.
-
list_clusters
(max_results=None, filter=None)[source]¶ Abstraction of projects().regions().clusters().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/list]
Parameters: - max_results (integer) – If None, all results are iterated over and returned.
- filter (String) – Query param [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/list#query-parameters]
Returns: List of dictionary objects representing cluster resources.
-
get_cluster
(cluster_name)[source]¶ Abstraction of projects().regions().clusters().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/get]
Parameters: cluster_name – Cluster name. Returns: Dictionary object representing cluster resource.
-
diagnose_cluster
(cluster_name)[source]¶ Abstraction of projects().regions().clusters().diagnose() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/diagnose]
Parameters: cluster_name – Cluster name. Returns: Dictionary object representing operation resource.
-
list_operations
(max_results=None, filter=None)[source]¶ Abstraction of projects().regions().operations().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/list]
Parameters: - max_results (integer) – If None, all results are iterated over and returned.
- filter (String) – Query param [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/list#query-parameters]
Returns: List of dictionary objects representing operation resources.
-
get_operation
(operation_name)[source]¶ Abstraction of projects().regions().operations().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/get]
Parameters: operation_name – Name of operation resource. Returns: Dictionary object representing operation resource.
-
poll_operation_status
(operation_resp, sleep_time=3)[source]¶ Abstraction of projects().regions().operations().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.operations/get]
Parameters: - operation_resp – Representation of operation resource.
- sleep_time – If wait_finish is set to True, sets polling wait time.
Returns: Dictionary object representing operation resource.
-
create_cluster
(zone, cluster_name, wait_finish=True, sleep_time=3, **kwargs)[source]¶ Abstraction of projects().regions().clusters().create() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create]
Parameters: - zone – Dataproc zone.
- cluster_name – Cluster name.
- wait_finish – If set to True, operation will be polled till completion.
- sleep_time – If wait_finish is set to True, sets polling wait time.
- config_bucket – Google Cloud Storage staging bucket used for sharing generated SSH keys and config.
- network – Google Compute Engine network to be used for machine communications.
- master_boot_disk – Size in GB of the master boot disk.
- master_num – The number of master VM instances in the instance group.
- master_machine_type – Google Compute Engine machine type used for master cluster instances.
- worker_boot_disk – Size in GB of the worker boot disk.
- worker_num – The number of VM worker instances in the instance group.
- worker_machine_type – Google Compute Engine machine type used for worker cluster instances.
- init_actions – Google Cloud Storage URI of executable file(s).
Returns: Dictionary object or OperationResponse representing cluster resource.
-
delete_cluster
(cluster_name, wait_finish=True, sleep_time=3)[source]¶ Abstraction of projects().regions().clusters().delete() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete]
Parameters: - cluster_name – Cluster name.
- wait_finish – If set to True, operation will be polled till completion.
- sleep_time – If wait_finish is set to True, sets polling wait time.
Returns: Dictionary object or OperationResponse representing cluster resource.
-
list_jobs
(cluster_name=None, job_state='ACTIVE', max_results=None, filter=None)[source]¶ Abstraction of projects().regions().jobs().list() method with inbuilt iteration functionality. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list]
Parameters: - cluster_name – Cluster name, if unset, will return jobs from all clusters.
- job_state – Category of jobs to return. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list#JobStateMatcher]
- max_results (integer) – If None, all results are iterated over and returned.
- filter (String) – Query param [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list#query-parameters]
Returns: List of dictionary objects representing cluster resources.
-
get_job
(job_id)[source]¶ Abstraction of projects().regions().jobs().get() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/get]
Parameters: job_id – Job Id. Returns: Dictionary object representing job resource.
-
poll_job_status
(job_resp, sleep_time=3)[source]¶ Parameters: - job_resp – Representation of job resource.
- sleep_time – If wait_finish is set to True, sets polling wait time.
Returns: Dictionary object representing job resource.
-
submit_spark_job
(cluster_name, main_class, wait_finish=True, sleep_time=5, **kwargs)[source]¶ Abstraction of projects().regions().jobs().submit() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit]
Body parameters can be found here: [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#resource-job]
Parameters: - cluster_name – The name of the cluster where the job will be submitted.
- main_class – The name of the driver’s main class.
- wait_finish – If set to True, operation will be polled till completion.
- sleep_time – If wait_finish is set to True, sets polling wait time.
- args – The arguments to pass to the driver.
- jar_uris – HCFS URIs of jar files to add to the CLASSPATHs of the Spark driver and tasks.
- file_uris – HCFS URIs of files to be copied to the working directory of Spark drivers and distributed tasks.
- archive_uris – HCFS URIs of archives to be extracted in the working directory of Spark drivers and tasks.
- properties – A mapping of property names to values, used to configure Spark.
Returns: Dictionary object or JobResponse representing job resource.
-
submit_pyspark_job
(cluster_name, main_py_uri, wait_finish=True, sleep_time=5, **kwargs)[source]¶ Abstraction of projects().regions().jobs().submit() method. [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit]
Body parameters can be found here: [https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#resource-job]
Parameters: - cluster_name – The name of the cluster where the job will be submitted.
- main_py_uri – The HCFS URI of the Python file to use as the driver.
- wait_finish – If set to True, operation will be polled till completion.
- sleep_time – If wait_finish is set to True, sets polling wait time.
- args – The arguments to pass to the driver.
- python_uris – HCFS file URIs of Python files to pass to the PySpark framework.
- jar_uris – HCFS URIs of jar files to add to the CLASSPATHs of the Python driver and tasks.
- file_uris – HCFS URIs of files to be copied to the working directory of Python drivers and distributed tasks.
- archive_uris – HCFS URIs of archives to be extracted in the working directory of Spark drivers and tasks.
- properties – A mapping of property names to values, used to configure PySpark.
Returns: Dictionary object or JobResponse representing job resource.