Extracting data from excel in datalake storage using openpyxl

vanessafvg — Thu, 15 Jun 2023 17:49:09 GMT

i am trying to extract some data into databricks but tripping all over openpyxl, newish user of databricks..

from openpyxl import load_workbook

directory_id="hidden"

scope="hidden"

client_id="hidden"

service_credential_key="hidden"

container_name="hidden"

storage_account="hidden"

application_id = dbutils.secrets.get(scope= scope, key=client_id)

service_credential = dbutils.secrets.get(scope= scope, key=service_credential_key)

adls_path = f"abfss://{container_name}@{storage_account}.dfs.core.windows.net/"

spark.sql("set spark.databricks.delta.optimizeWrite.enabled = True")

project = "hidden/hidden/"

path = f"abfss://{container_name}@{storage_account}.dfs.core.windows.net/"

raw_path = f"{path}raw/{project}"

raw_schema_path = f"{raw_path}/schema/"

bronze_path = f"{path}bronze/{project}"

silver_path = f"{path}silver/{project}"

gold_path = f"{path}gold/{project}"

temp_path = f"{path}{project}"

print(temp_path)

dbutils.fs.ls(temp_path)

user = load_workbook(f"{temp_path}staff.xlsx")

FileNotFoundError: [Errno 2] No such file or directory: 'hidden'

it definitely does it exist and the link is correct, but not quite understanding why its not picking it up. if i convert it to csv and use

user = spark.read.format("csv").options(**{"sep":",","header":True,"inferSchema":True}).load(f"{temp_path}staff.csv")

it works fine, but trying to avoid the extra hop. working with excel seems a bit painful however. Fairly new on databricks so might be missing something obvious.

Re: Extracting data from excel in datalake storage using openpyxl

Anonymous — Sat, 17 Jun 2023 09:30:22 GMT

Hi @Vanessa Van Gelder

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

topic Extracting data from excel in datalake storage using openpyxl in Data Engineering

Extracting data from excel in datalake storage using openpyxl

Re: Extracting data from excel in datalake storage using openpyxl