Reading and Writing Data in Azure Databricks (2024)

In this blog, we are going to cover Reading and Writing Data in Azure Databricks. Azure Databricks supports day-to-day data-handling functions, such as reading, writing, and querying.

Topics we’ll Cover:

Azure Databricks
Types to read and write data in data bricks
Table batch read and write
Perform read and write operations in Azure Databricks

We use Azure Databricks to read multiple file types, both with and without a Schema. Combine inputs from files and data stores, such as Azure SQL Database. Transform and store that data for advanced analytics.

What is Azure Databricks

Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks offers three environments for developing data-intensive applications: Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine Learning.

Check out our related blog here: Azure Databricks For Beginners

Azure Databricks, is a fully managed service that provides powerful ETL, analytics, and machine learning capabilities. Unlike other vendors, it is a first-party service on Azure that integrates seamlessly with other Azure services such as event hubs and Cosmos DB.

Read: Structured Vs Unstructured Data

Types to Read and Write the Data in Azure Databricks

CSV Files
JSON Files
Parquet Files

CSV Files

When reading CSV files with a specified schema, it is possible that the data in the files does not match the schema. For example, a field containing the name of the city will not parse as an integer. The consequences depend on the mode that the parser runs in:

PERMISSIVE (default): nulls are inserted for fields that could not be parsed correctly
DROPMALFORMED: drops lines that contain fields that could not be parsed
FAIL FAST: aborts the reading if any malformed data is found

JSON Files

You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel.

Multi-Line Mode

This JSON object occupies multiple lines:[ {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}, {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}, { "string": "string3", "int": 3, "array": [ 3, 6, 9 ], "dict": { "key": "value3", "extra_key": "extra_value3" } }]

Single-Line Mode

Single-line mode In this example, there is one JSON object per line: {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}} {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}} {"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}

Parquet Files

Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON.

Table Batch Read and Writes

Delta Lake supports most of the options provided by Apache Spark DataFrame read and write APIs for performing batch reads and writes on tables.

1.) Read a Table

You can load a Delta table as a DataFrame by specifying a table name or a path:

spark.table("default.people10m") # query table in the metastorespark.read.format("delta").load("/tmp/delta/people10m") # query table by path

2.) Write to a Table

To atomically add new data to an existing Delta table, use append mode

df.write.format("delta").mode("append").save("/tmp/delta/people10m")df.write.format("delta").mode("append").saveAsTable("default.people10m")

Perform Read and Write Operation In Azure Databricks

By the below step we can perform the Read and write operation in azure data bricks.

1. Provision of The Resources Required

1. From the Azure portal provision Azure Databricks Workspace, select Create a resource → Analytics → Databricks. Enter the required details and Click on Review+Create.

Reading and Writing Data in Azure Databricks (3)

2. Create a Spark Cluster

1. Open the Azure Databricks Workspace and click on the Create Compute.

Reading and Writing Data in Azure Databricks (4)

2. Give a meaningful name to Cluster and select the Runtime version and Worker Type based on your preference and click on Create Cluster.

3. Upload the Sample file to Databricks (DBFS). Open the Databricks workspace and click on the ‘Import Data’.

4. Click on the ‘Drop files to upload and select the file you want to process.

5. The Country sales data file is uploaded and ready to use.

3. Read and Write The Data

1. Open the Azure data bricks workspace and create a notebook.

2. Now its time to write some python code to read the ‘CountrySales.csv’ file and create a data frame.

# File location and typefile_location = “/FileStore/tables/Country_Sales_Records.csv”file_type = “csv”# CSV optionsinfer_schema = “false”first_row_is_header = “false”delimiter = “,”# The applied options are for CSV files. For other file types, these will be ignored.df = spark.read.format(file_type) \.option(“inferSchema”, infer_schema) \.option(“header”, first_row_is_header) \.option(“sep”, delimiter) \.load(file_location)display(df)Copy and Paste the above code in the cell, change the file name to your file name and make sure the cluster is running and attached to the notebook

3. Run it by clicking on the Run Cell or CTRL + ENTER. The code was executed successfully and I see the cluster created 2 spark jobs to read and display the data from the ‘Country Sale’ data file. Also if you notice the schema is not exactly right, it shows String for all the columns, and the Header doesn’t seem right ( _c0,_c1..etc).

Reading and Writing Data in Azure Databricks (11)

4. Create a Table and Query The Data Using SQL

1. Create a temporary view using the data frame and query the data using SQL language.

2. Add a new cell to the notebook, paste the above code and then run the cell

# Create a view or tabletblCountrtySales = “Country_Sales”df.createOrReplaceTempView(tblCountrtySales)

%sqlselect * from `Country_Sales`

Now you can use the regular SQL scripting language on top of the temporary view and query the data in whatever way you want. But the view is temporary in nature, which means it will only be available to this particular notebook and will not be available once the cluster restarts.

Created a new notebook and tried to access the view that we just created, but it’s not accessible from this notebook

So to make it available across the notebooks and to all the users we have to create a permanent table. So let’s create a permanent, table by executing the below code

tbl_name = “tbl_Country_Sales”# df.write.format(“parquet”).saveAsTable(tbl_name)

Now the permanent table is created and it will persist across cluster restarts as well as allow various users across different notebooks to query this data. we can access the table from other notebooks as well.

Related/References

Azure Data Lake For Beginners: All you Need To Know
Batch Processing Vs Stream Processing: All you Need To Know
Microsoft Power BI VS Tableau | Which one is Better?
Introduction To Data Analysis Expression (DAX) In Power BI
Azure Data Lake For Beginners: All You Need To Know
Introduction to Big Data and Big Data Architectures

Next Task For You

In ourAzure Data on Cloud Job-Orientedtraining program, we will cover50+ Hands-On Labs.If you want to begin your journey towards becoming aMicrosoft Certified Associate and Get High-Paying Jobscheck out ourFREE CLASS.

Reading and Writing Data in Azure Databricks (2024)

FAQs

How do I read data from Azure? ›

On the lower ribbon of your KQL database, select Get Data. In the Get data window, the Source tab is selected. Select the data source from the available list. In this example, you're ingesting data from Azure storage.

Read On ›

How do I get data into Azure Databricks? ›

Add data from local files

Click Create or modify table to upload CSV, TSV, JSON, XML, Avro, Parquet, or text files into Delta Lake tables. ...
Click Upload files to volume to upload files in any format to a Unity Catalog volume, including structured, semi-structured, and unstructured data.

Aug 13, 2024

Discover More Details ›

How do you write a notebook in Databricks? ›

Creating a new Notebook

Click the triangle on the right side of a folder to open the folder menu.
Select Create > Notebook.
Enter the name of the notebook, the language (Python, Scala, R or SQL) for the notebook, and a cluster to run it on.

How to write SQL in Databricks notebook? ›

With Databricks SQL, this is simple, there is a format tool built into the editor. The keyboard shortcut is Shift + Command + F, or you can click the kabob menu next to the warehouse drop-down for the Format button, or check out the other keyboard shortcuts.

See Details ›

Does Databricks have an ETL tool? ›

Purpose: Azure Databricks is a collaborative analytics platform that combines Apache Spark with Azure services. Features: ETL Pipelines: You can create ETL pipelines using Databricks notebooks, which allow you to write Spark code (Scala, Python, or SQL).

Find Out More ›

How to read a CSV file in Azure Databricks? ›

Databricks recommends the read_files table-valued function for SQL users to read CSV files. read_files is available in Databricks Runtime 13.3 LTS and above. You can also use a temporary view.

Tell Me More ›

How to upload Excel data into Databricks? ›

- Click on the "Data" tab in the Databricks workspace and select the folder where you want to upload the file. - Click the "Upload" button and select your Excel file from your local machine. Make sure to replace `"/FileStore/your_excel_file.

Show Me More ›

What coding language do Databricks use? ›

With Databricks notebooks, you can: Develop code using Python, SQL, Scala, and R. Customize your environment with the libraries of your choice. Create regularly scheduled jobs to automatically run tasks, including multi-notebook workflows.

Explore More ›

How to write text in Databricks notebook? ›

Create cells

%md ### Libraries Import the necessary libraries. To create a new cell, hover over a cell at the top or bottom. Click Code or Text to create a code or Markdown cell, respectively.

How to use Databricks step by step? ›

Sign up for a free trial.
Set up your first workspace.
Navigate the workspace.
Create a table.
Query and visualize data from a notebook.
Import and visualize CSV data from a notebook.
Ingest and insert additional data.
Cleanse and enhance data.

More items...

Sep 4, 2024

Show Me More ›

How is data read and written? ›

The hard drive contains a spinning platter with a thin magnetic coating. A "head" moves over the platter, writing 0's and 1's as tiny areas of magnetic North or South on the platter. To read the data back, the head goes to the same spot, notices the North and South spots flying by, and so deduces the stored 0's and 1's.

Read The Full Story ›

How to read a file from Databricks? ›

Databricks File System (DBFS):

Databricks provides a distributed file system called DBFS.
You can use the dbfs prefix to read files from DBFS.
For example:
import pandas as pd dbfs_file_location = "/dbfs/Workspace/Users/[email protected]/csv files/f1. csv" df = pd. read_csv(dbfs_file_location)

See Details ›

How do I read Excel data in Databricks? ›

How to read excel file using databricks

Step 1: Set Up Databricks Environment. ...
Step 2: Upload Excel File to DBFS. ...
Step 3: Create a Databricks Notebook. ...
Step 4: Import Required Libraries In your Databricks notebook, import the required libraries to work with Excel files.

More items...

Jul 26, 2023

Get More Info Here ›

What is read data and write data? ›

Reading data means looking at it. Writing data means changing it. This is fairly basic computing jargon. For example, when you look at your bank statement online, that is a read; when you send money to someone, that is a write.