AWS Glue Job Using Scala

Yesaswi Avula
2 min readJul 8, 2021

Running an AWS Glue job using Scala

AWS GLUE is a serverless service which can be used to perform many engineering and analysis tasks like data cleaning, extracting, transforming, combining and for maintaining data in different locations.

The purpose of this blog is to show an example of using Scala to run a Glue job and parse through a list of tables in the given set of databases in the Glue catalog and printing their contents to the console.

Steps
1. Sign in to the Glue console.

2. Go to “Jobs” section, click on “Add Job”

3. A “Configure the job properties” box opens.

4. Give the required details starting from a sample job name, an IAM role (that has permission to your s3 directories and scripts and also to any libraries used by the job”.

5. Let the “Type” be Spark and change the “Glue version” to the option as in the picture below.

6. In the “This job runs” box, to make this task much easier, select the option which says “A new script to be authored by you” and in the “S3 path where the script is stored” box, give a script location.

7. Give any “Scala class name” and remember the value to use it later in the Scala program.

8. Click Next and then click “Save job and edit script”.

Now, an empty script editor opens.

  1. Inside the script section, start typing the Scala code.

In this case, the idea is to create a Glue client to retrieve the required databases from the Glue Catalog and then parse through the tables inside the particular database(s).

The name “Program” is the “Scala class name” which comes from the Step 7 above

Use getCatalogSource() to get the metadata from the Glue catalog. Read it as a Dynamic frame and them convert it to a dataframe and print it.

2. Save and Run the job to see the required output in the console.

--

--

Yesaswi Avula

I am a data enthusiast & an aspiring data engineer. I love to learn and help people learn the vital data science elements. https://www.linkedin.com/in/y-avula/