Reading / Extracting Data from Databricks Database (hive_metastore ) with PySpark – Pyspark

by
Ali Hasan
apache-spark-sql azure-databricks llama-cpp-python pyspark

Quick Fix: Use the following code to access the table in the samples catalog:

df = spark.table("samples.nyctaxi.trips")

The Problem:

How to read data from a table in the Databricks Database (hive_metastore) using PySpark? The table is located in the nyctaxi database and is named trips. The user is unsure how to specify the metastore and database when reading the data.

The Solutions:

Solution 1: Using `spark.table(“db.table”)`

To read data from a Databricks metastore database using PySpark, you can use the `spark.table(“db.table”)` method. Here’s an example:

df = spark.table("db.nyctaxi.trips")

In this example, df is a DataFrame that contains the data from the trips table in the nyctaxi database.

Note: If you are working directly in Databricks notebooks, the Spark session is already available as spark. You do not need to get or create a session.

Solution 2: Databricks hive_metastore data extraction

To read data from Databricks Hive Metastore using PySpark, follow these steps:

  1. Enable Hive Support: Ensure that Hive support is enabled in the Spark Session using enableHiveSupport().
spark = SparkSession \
            .builder \
            .appName("HiveTest") \
            .enableHiveSupport() \
            .getOrCreate()
  1. Verify Hive Metastore: Run the following command to verify if databases are accessible from the Hive Metastore:
spark.sql("show databases").show()
  1. Read Data from Hive Metastore: Use the sql() function to read data from a table in the Hive Metastore.
df = spark.sql("select * from db_name.table_name")
df.show()
  1. Handle Errors: If errors occur, check the Hive Metastore properties configurations for the Databricks environment as per their documentation.

Q&A

How to read in the table using PySpark from Databricks Database?

The samples catalog can be accessed in using spark.table("catalog.schema.table").

Video Explanation:

The following video, titled "Process Excel files in Azure with Data Factory and Databricks ...", provides additional insights and in-depth exploration related to the topics discussed in this post.

Play video

Excel files are one of the most commonly used file format on the market. Popularity of the tool itself among the business users, ...