I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record...
Hi, I think you can use this option for the csvReadeespark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")especially the unescapedQuoteHandling. You can search for the other options at this l...
i am trying to read data from azure sql database from databricks. azure sql database is created with private link endpoint.Using DBR 10.4 LTS Cluster and expectation is the connector is pre installed as per documentation.using the below code to fetch...
It seems that .option("databaseName", "test") is redundant here as you need to include the db name in the url.Please verify that you use a connector compatible to your cluster's Spark version : Apache Spark connector: SQL Server & Azure SQL
March Madness + Data Here at Databricks we like to use (you guessed it) data in our daily lives. Today kicks off a series called Databrags Databrags are glimpses into how Bricksters and community folks like you use data to solve everyday problems, e...
Folks , when I want to push data to snowflake I need to use a stage for files before copying data over. However, when I utilise the net.snowflake.spark.snowflake.Utils library and do a spark.write as in...spark.read.format("csv")
.option("header", ...
Hi Team,I am trying to run a streaming job in databricks, used Autoloader approach for reading the files from the Azure Datalake Gen2 which is in parquet format. I have created a new checkpoint, so first offset is getting created but throwing an erro...
When attempting to edit the schedule cron expression on one of our jobs we receive the following error message:Cluster validation error: Validation failed for spark_conf, spark.databricks.acl.dfAclsEnabled must be false (is "true") The spark.databric...
Hi there, I am trying to build a delta live tables pipeline that ingests gzip compressed archives as they're uploaded to S3. The archives contain 2 files in a proprietary format, and one is needed to determine how to parse the other. Once the file co...
So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.Should we enable "optimized writes" by setting the following at a workspace level?spark.conf.set...
@AKSHAY PALLERLA​ Just checking in to see if you got a solution to the issue you shared above. Let us know!Thanks to @Werner Stinckens​ for jumping in, as always!
Hi Team, we have a scenario where we have to connect to the DataBricks SQL instance 1 from another DataBricks instance 2 using notebook or Azure Data Factory. Can you please help?
Looking for best practises/examples on how to pull data (epics, features, PBIs) from Azure Boards into databricks for analysis.Any ideas/help appreciated!
I have a DataFrame that I have created based on a couple of datasets and multiple operations. The DataFrame has multiple columns, one of which is a array of strings. But when I take the DataFrame and try to filter based upon the size of this array co...
strange, works fine here. what version of databricks are you on?What you could do to identify the issue is to output the query plan (.explain).And also creating a new df for each transformation could help. Like that you can check step by step where...
We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles. And it is necessary to access the file modification timestamp of the file. As documented here, we tried selecting `_metadata` column in a task in delta live p...
Update: We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.
Hello, currently we have a process that builds with delta table the zones of bronze, silver and when it reaches gold we must create specific zones for each client because the schema changes, for this we create databases and separate tables, but when ...
Hi @alexander grajales vanegas​ Are you creating all the databases and tables in gold zone manually?If so, please check out DLT https://docs.databricks.com/data-engineering/delta-live-tables/index.html, it will take care of your complete pipeline by ...
We have a Denodo big data platform hosted on Databricks. Recently we have been facing the exception with message '[Simba][SparkJDBCDriver](500550)' with the Databricks which interrupts the Databricks connection after the certain time Interval usuall...
Hi All,We are also experiencing the same behavior:[Simba][SimbaSparkJDBCDriver] (500550) The next rowset buffer is already marked as consumed. The fetch thread might have terminated unexpectedly. Foreground thread ID: xxxx. Background thread ID: yyyy...
Hi Team,I am trying to get the latest files from an ADLS mount point directory. I am not sure how to extract latest files ,Last modified Date using Pyspark from ADLS Gen2 storage account. Please let me know asap. Thanks!
I am looking forward your re...
Hi @pankaj92​ ,I wrote a Python code to pick a latest file from mnt location ,import ospath = "/dbfs/mnt/xxxx"filelist=[]for file_item in os.listdir(path): filelist.append(file_item)file=len(filelist)print(filelist[file-1])Thanks