cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Prashanth24
by New Contributor III
  • 1538 Views
  • 2 replies
  • 0 kudos

Databricks Autoloader processing old files

I have implemented Databricks Autoloader and found that every time i executes the code, it is still reading all old existing files + new files. As per the concept of Autoloader, it should read and process only new files. Below is the code. Please hel...

  • 1538 Views
  • 2 replies
  • 0 kudos
Latest Reply
RameshChejarla
New Contributor III
  • 0 kudos

Hi Prashanth, Auto loader for me its reading only new files, can you pls go through the below script.df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("cloudFiles.schemaLocation", "path").option("recursiveFileLooku...

  • 0 kudos
1 More Replies
kaushalshelat
by New Contributor II
  • 892 Views
  • 2 replies
  • 4 kudos

Resolved! I cannot see the output when using pandas_api() on spark dataframe

Hi all, I started learning spark and databricks recently along with python. while running below line of code it did not throw any error and seem to run ok but didn't show me output either.test=cust_an_inc1.pandas_api()test.show()where cust_an_inc1 is...

  • 892 Views
  • 2 replies
  • 4 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 4 kudos

Hi @kaushalshelat Ideally, `test.show()` should've thrown an error as test is the pandas dataframe now.`.show()` is a spark df method and wouldn't work with pandas.If you want to see a subset of the data try `.head()` or `.tail(n)` rather then `.show...

  • 4 kudos
1 More Replies
jeremy98
by Honored Contributor
  • 1146 Views
  • 4 replies
  • 0 kudos

how to fallback the entire job in case of failure of the cluster?

Hi community,My team and I are using a job that is triggered based on dynamic scheduling, with the schedule defined within some of the job's tasks. However, this job is attached to a cluster that is always running and never terminated.I understand th...

  • 1146 Views
  • 4 replies
  • 0 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 0 kudos

Hey @jeremy98 Have you had a chance to experiment with Databricks Serverless offering? Ideally, serverless would spin up times are around ~1 min. It has inbuilt autoscaling based on the workload, seems good fit for your usecase. Check out more info f...

  • 0 kudos
3 More Replies
suja
by New Contributor
  • 885 Views
  • 1 replies
  • 0 kudos

Exploring parallelism for multiple tables

I am new to databricks. The app we need to build reads from hive tables, go thru bronze, silver and gold layers and store in relational db tables.  There are multiple hive tables with no dependencies. What is the best way to achieve parallelism. Do w...

  • 885 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Honored Contributor III
  • 0 kudos

Hi @suja Use Databricks Workflows (Jobs) with Task ParallelismInstead of using threads within a single notebook, leverage Databricks Jobs to define multiple tasks, each responsible for a table. Tasks can:                     1. Run in parallel       ...

  • 0 kudos
ABINASH
by New Contributor
  • 715 Views
  • 1 replies
  • 0 kudos

Flattening VARIANT column.

Hi Team, I am facing an issue, i have a json file which is around 700kb and it contains only 1 record, so after reading the data and flattening the file the record is now 620 million. Now while i am writing the dataframe into delta lake it is taking ...

  • 715 Views
  • 1 replies
  • 0 kudos
Latest Reply
samshifflett46
New Contributor III
  • 0 kudos

Hey @ABINASH, The JSON file being flattened to 620 million records seems like the area of optimization would be to restructure the JSON file. My initial thought being that the JSON file is extremely nested which is causing a large amount of redundant...

  • 0 kudos
sondergaard
by New Contributor II
  • 1266 Views
  • 2 replies
  • 0 kudos

Simba ODBC driver // .Net Core

Hi,I have been looking into the Simba Spark ODBC driver to see if it can simplify our integration with .Net Core. The first results were promising, but when I started to process larger queries I started to notice out-of-memory exceptions in the conta...

  • 1266 Views
  • 2 replies
  • 0 kudos
Latest Reply
Rjdudley
Honored Contributor
  • 0 kudos

Something we're considering for a similar purpose (.NET Core service pulling data from Databricks) is the ADO.NET connector from CData: Databricks Driver: ADO.NET Provider | Create & integrate .NET apps

  • 0 kudos
1 More Replies
ashraf1395
by Honored Contributor
  • 1171 Views
  • 1 replies
  • 0 kudos

Fething the catalog and schema which is set in dlt pipeline configuration

I have a dlt pipeline and the notebook which is running on the dlt pipeline has some requirements.I want to get the catalog and schema which is set my dlt pipeline. Reason for it: I have to specify my volume files paths etc and my volume is on the sa...

  • 1171 Views
  • 1 replies
  • 0 kudos
Latest Reply
SP_6721
Honored Contributor
  • 0 kudos

Hi @ashraf1395 Can you try this to get the catalog and schema set by your DLT pipeline in the notebookcatalog = spark.conf.get("pipelines.catalog")schema = spark.conf.get("pipelines.schema")

  • 0 kudos
ankit001mittal
by New Contributor III
  • 718 Views
  • 1 replies
  • 0 kudos

DLT Pipeline Stats on Object level

Hi Guys,I want to create a table where I want to store information about each DLT pipelines on object/table id level details about how much time it took for waiting for resources and how much time it took to run for each object and numbers or records...

Data Engineering
dlt
system
  • 718 Views
  • 1 replies
  • 0 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 0 kudos

Hi @ankit001mittal DLT Event logs helps you to gather most of the information you've mentioned above. Below is the documentation to the DLT Event Logs:https://docs.databricks.com/aws/en/dlt/observabilityLet me know if any questions.Best,

  • 0 kudos
Ekaterina_Paste
by New Contributor III
  • 20415 Views
  • 12 replies
  • 2 kudos

Resolved! Can't login to databricks community edition

I enter my valid login and password here https://community.cloud.databricks.com/login.html but it says "Invalid email address or password"

  • 20415 Views
  • 12 replies
  • 2 kudos
Latest Reply
Venkat124488
New Contributor II
  • 2 kudos

data bricks cluster is terminating each 15 sec in community edition. Could you please help me on this issue.  

  • 2 kudos
11 More Replies
madrhr
by New Contributor III
  • 4890 Views
  • 4 replies
  • 3 kudos

Resolved! SparkContext lost when running %sh script.py

I need to execute a .py file in Databricks from a notebook (with arguments which for simplicity i exclude here). For this i am using:%sh script.pyscript.py:from pyspark import SparkContext def main(): sc = SparkContext.getOrCreate() print(sc...

Data Engineering
%sh
.py
bash shell
SparkContext
SparkShell
  • 4890 Views
  • 4 replies
  • 3 kudos
Latest Reply
madrhr
New Contributor III
  • 3 kudos

I got it eventually working with a combination of:from databricks.sdk.runtime import *spark.sparkContext.addPyFile("/path/to/your/file")sys.path.append("path/to/your")   

  • 3 kudos
3 More Replies
cookiebaker
by New Contributor III
  • 3543 Views
  • 7 replies
  • 6 kudos

Resolved! Some DLTpipelines suddely seem to take different runtime 16.1 instead of 15.4 since last night (CET)

Hello, Suddenly since last night on some of our DLT pipelines we're getting failures saying that our hive_metastore control table cannot be found. All of our DLT's are set up the same (serverless), and one Shared Compute on runtime version 15.4. For ...

  • 3543 Views
  • 7 replies
  • 6 kudos
Latest Reply
cookiebaker
New Contributor III
  • 6 kudos

@voo-rodrigo Hello, thanks for updating the progress on your end! I've tested as well and confirmed that the DLT can read the hive_metastore via Serverless again. 

  • 6 kudos
6 More Replies
BrendanTierney
by New Contributor II
  • 6055 Views
  • 6 replies
  • 3 kudos

Resolved! Community Edition is not allocating Cluster

I've been trying to use the Community edition for the past 3 days without success.I go to run a Notebook and it begins to allocated the Cluster, but it it never finishes. Sometimes it times out after 15 minutes.Waiting for cluster to start: Finding i...

ezgif.com-gif-maker
  • 6055 Views
  • 6 replies
  • 3 kudos
Latest Reply
JD2001
New Contributor II
  • 3 kudos

I am running into the same issue since today. It worked fine till yesterday.

  • 3 kudos
5 More Replies
ZacayDaushin
by New Contributor
  • 2685 Views
  • 3 replies
  • 0 kudos

How to access system.access.table_lineage

I try to make a select from system.access.table_lineage but i dont have to see the tablewhat permission to i have 

  • 2685 Views
  • 3 replies
  • 0 kudos
Latest Reply
Nivethan_Venkat
Contributor III
  • 0 kudos

Hi @ZacayDaushin,To query the table in system catalog, you need to have SELECT permission on top of the table to query and see the results.Best Regards,Nivethan V

  • 0 kudos
2 More Replies
smpa01
by Contributor
  • 742 Views
  • 1 replies
  • 1 kudos

Resolved! tbl name as paramater marker

I am getting an error here, when I do this//this works fine declare sqlStr = 'select col1 from catalog.schema.tbl LIMIT (?)'; declare arg1 = 500; EXECUTE IMMEDIATE sqlStr USING arg1; //this does not declare sqlStr = 'select col1 from (?) LIMIT (?)';...

  • 742 Views
  • 1 replies
  • 1 kudos
Latest Reply
lingareddy_Alva
Honored Contributor III
  • 1 kudos

@smpa01 In SQL EXECUTE IMMEDIATE, you can only parameterize values, not identifiers like table names, column names, or database names.That is, placeholders (?) can only replace constant values, not object names (tables, schemas, columns, etc.).SELECT...

  • 1 kudos
BF7
by Contributor
  • 1011 Views
  • 2 replies
  • 2 kudos

Resolved! Using cloudFiles.inferColumnTypes with inferSchema and without defining schema checkpoint

Two Issues:1. What is the behavior of cloudFiles.inferColumnTypes with and without cloudFiles.inferSchema? Why would you use both?2. When can cloudFiles.inferColumnTypes be used without a schema checkpoint?  How does that affect the behavior of cloud...

  • 1011 Views
  • 2 replies
  • 2 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 2 kudos

Behavior of cloudFiles.inferColumnTypes with and without cloudFiles.inferSchema:When cloudFiles.inferColumnTypes is enabled, Auto Loader attempts to identify the appropriate data types for columns instead of defaulting everything to strings, which i...

  • 2 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels