Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hello,I have a remote azure sql warehouse serverless instance that I can access using databricks-sql-connector. I can read/write/update tables no problem.But, I'm also trying to read/write/update tables using local pyspark + jdbc drivers. But when I ...
Hi @amelia1 how are you?
What you got was indeed the top 5 rows (see that it was the Row class). What does it show when you run display(df)?
I'm thinking it might be something related to your schema, since you did not defined that, it can read the da...
This code fails with exception:[NOT_COLUMN_OR_STR] Argument `col` should be a Column or str, got Column.File <command-4420517954891674>, line 7 4 spark = DatabricksSession.builder.getOrCreate() 6 df = spark.read.table("samples.nyctaxi.trips") ---->...
We are also seeing this error in 14.3 LTS from a simple example:from pyspark.sql.functions import coldf = spark.table('things')things = df.select(col('thing_id')).collect()[NOT_COLUMN_OR_STR] Argument `col` should be a Column or str, got Column.
If you have a large dataset, you might want to export it to a bucket in parquet format from your notebook:%python
df = spark.sql("select * from your_table_name")
df.write.parquet(your_s3_path)
I am trying to send email alerts to a non databricks user. I am using Alerts feature available in SQL. Can someone help me with the steps.Do I first need to first add Notification Destination through Admin settings and then use this newly added desti...
Hi @Mitali Lad Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers yo...
I'm using databricks rls functions on my tables, and i need to make some merges into, but tables with rls functions does not support merge operations (https://docs.databricks.com/en/data-governance/unity-catalog/row-and-column-filters.html#limitation...
Hi All,I'm facing an issue with my Spark Streaming Job. It gets stuck in the "Stream Initializing" phase for more than 3 hours.Need your help here to understand what happens internally at the "Stream Initializing" phase of the Spark Streaming job tha...
Backgroundand requirements: We are reading data from our factory and storing it in a DLT table called telemetry with columns sensorid, timestamp and value. We need to get rows where sensorid is “qrreader-x” and join with some other data from that sam...
Hi @Mathias,
I'd say that watermarking might be a good solution for your use case. Please check Control late data threshold with multiple watermark policy in Structured Streaming.
If you want to dig-in further there's also: Spark Structured Streami...
Greetings community, I am new to using databricks and for some time I have tried some scripts in notebook. I would like your help on a task: Carry out a personalized mailing where, First, a query of the number of records in the test table is performe...
Hi @EcuaCrisCar,
To query the number of records in your test table, you can use SQL or DataFrame APIs in Databricks.Next, you’ll need to check if the record count falls within the specified range (80,000 to 90,000). If it does, proceed with the note...
I am trying to convert JSON string stored in variable into spark dataframe without specifying schema, because I have a big number of different tables, so it has to be dynamically. I managed to do it with sc.parallelize, but since we are moving to Uni...
Hi @filipjankovic, Since you have multiple tables and need dynamic schema inference, I recommend using the following approach:
Schema Inference from JSON String: You can infer the schema from the JSON string and then create a DataFrame.
Schema I...
Hi,I applied for Databricks Certified: Data Engineer Professional certification on 5th July 2023. The test was going fine for me but suddenly there was an alert from the system (I think I was in proper angle in front of camera and was genuinely givin...
Hi @NikhilK1998, I'm sorry to hear your exam was suspended. Thank you for filing a ticket with our support team. Please allow the support team 24-48 hours to resolve.
In the meantime, you can review the following documentation:
Room requirements
Beh...
Despite following the steps mentioned in the provided link to create an instance profile, we encountered a problem in step 6 where we couldn't successfully add the instance profile to Databricks(Step 6: Add the instance profile to Databricks).https:/...
Hi @Avinash_Narala, The error message you provided indicates that the verification of the instance profile failed due to an AWS authorization issue. Specifically, the user associated with the assumed role arn:aws:sts::755231362028:assumed-role/databr...
Background:I'm working on a data pipeline to insert JSON files as quickly as possible. Here are the details of my setup: File Size: 1.5 - 2 kB eachFile Volume: Approximately 30,000 files per hourPipeline: Using Databricks Delta Live Tables (DLT) in c...
Hi @MiBjorn,
Confirm that you're using the appropriate DLT product edition (Core, Pro, or Advanced) based on your workload requirements1.You'll receive an error message if your pipeline includes features that are not supported by the selected edit...
Hi @sukanya09,
The query you provided includes a LocalTableScan node, which Photon does not fully support.The specific node you mentioned has several attributes, such as path, partitionValues, size, modificationTime, and more.Unfortunately, Photon e...
We tried upgrading to JDK 17Using Spark version 3.0.5 and runtime 14.3 LTSGetting this exception using parallelstream()With Java 17 I am not able to parallel process different partitions at the same time. This means when there is more than 1 partiti...