cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

AzureDatabricks
by New Contributor III
  • 13244 Views
  • 7 replies
  • 2 kudos

Resolved! Can we store 300 million records and what is the preferable compute type and config?

How we can persist 300 million records? What is the best option to persist data databricks hive metastore/Azure storage/Delta table?What is the limitations we have for deltatables of databricks in terms of data?We have usecase where testers should be...

  • 13244 Views
  • 7 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

You can certainly store 300 million records without any problem.The best option kinda depends on the use case. If you want to do a lot of online querying on the table, I suggest using delta lake, which is optimeized (using z-order, bloom filter, par...

  • 2 kudos
6 More Replies
AzureDatabricks
by New Contributor III
  • 8054 Views
  • 8 replies
  • 4 kudos

Resolved! Need to see all the records in DeltaTable. Exception - java.lang.OutOfMemoryError: GC overhead limit exceeded

Truncate False not working in Delta table.  df_delta.show(df_delta.count(),False)Computer size Single Node - Standard_F4S - 8GB Memory, 4 coresHow much max data we can persist in Delta table in Parquet file and How fast we can retrieve data.

  • 8054 Views
  • 8 replies
  • 4 kudos
Latest Reply
AzureDatabricks
New Contributor III
  • 4 kudos

thank you !!!

  • 4 kudos
7 More Replies
Hola1801
by New Contributor
  • 3314 Views
  • 3 replies
  • 3 kudos

Resolved! Float Value change when Load with spark? Full Path?

Hello,I have created my table in Databricks, at this point everything is perfect i got the same value than in my CSV. for my column "Exposure" I have :0 0,00 1 0,00 2 0,00 3 0,00 4 0,00 ...But when I load my fi...

  • 3314 Views
  • 3 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 3 kudos

Hi @Anis Ben Salem​ ,How do you read your CSV file? do you use Pandas or Pyspark APIs? also, how do you created your table?could you share more details on the code you are trying to run?

  • 3 kudos
2 More Replies
Abela
by New Contributor III
  • 3970 Views
  • 3 replies
  • 3 kudos

Resolved! Specify cluster name in notebook

Anyway to specify to use a particular cluster in the python cell of a notebook?

  • 3970 Views
  • 3 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@Alina Bella​ - If werners' answer solved the issue, would you be happy to mark their answer as best? That will help others find the solution more easily in the future.

  • 3 kudos
2 More Replies
sarvesh
by Contributor III
  • 8679 Views
  • 3 replies
  • 6 kudos

Resolved! Can we use spark-stream to read/write data from mysql? I can't find an example.

If someone can link me an example where stream is used to read or write to mysql please do.

  • 8679 Views
  • 3 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 6 kudos

Regarding writing (sink) is possible without problem via foreachBatch .I use it in production - stream autoload csvs from data lake and writing foreachBatch to SQL (inside foreachBatch function you have temporary dataframe with records and just use w...

  • 6 kudos
2 More Replies
AzureDatabricks
by New Contributor III
  • 9682 Views
  • 5 replies
  • 1 kudos

Parallel processing of json files in databricks pyspark

How we can read files from azure blob storage and process parallel in databricks using pyspark.As of now we are reading all 10 files at a time into dataframe and flattening it.Thanks & Regards,Sujata

  • 9682 Views
  • 5 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

spark.read.json("/mnt/dbfs/<ENTER PATH OF JSON DIR HERE>/*.jsonyou first have to mount your blob storage to databricks, I assume that is already done.https://spark.apache.org/docs/latest/sql-data-sources-json.html

  • 1 kudos
4 More Replies
Anonymous
by Not applicable
  • 5664 Views
  • 5 replies
  • 0 kudos

Resolved! How to use from standalone Spark Jar running from Intellij Idea the library installed in Databricks DBR?

Hello, I tried without success to use several libraries installed by use in the Databricks 9.1 cluster (not provived by default in DBR) from a standalone Spark application runs from Intellij Idea. For instance, for connecting to Redshift it works onl...

  • 5664 Views
  • 5 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Unfortunately, I did not find any solution. We have to package JAR and run it from Databricks job for test/debug. Not efficient but as no solution for remote debug has been found/provided.

  • 0 kudos
4 More Replies
Vibhor
by Contributor
  • 9550 Views
  • 5 replies
  • 13 kudos

Resolved! ADF Pipeline - Notebook Run time

In adf/pipeline can we specify to exit notebook and proceed to another notebook after some threshold value like 15 minutes. For example I have a pipeline with notebooks scheduled in sequence, want the pipeline to keep running that notebook for a cert...

  • 9550 Views
  • 5 replies
  • 13 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 13 kudos

Hi @Vibhor Sethi​ ,There is a global timeout in Azure Data Factory (ADF) that you can use to stop the pipeline. In addition, you can use the notebook timeout in case you want to control it from your Databricks job.

  • 13 kudos
4 More Replies
pantelis_mare
by Contributor III
  • 11745 Views
  • 2 replies
  • 1 kudos

Resolved! Dynamic Partition Pruning override

Hello everybody,Another strange issue I have and I would like to confirm me if this is a bug or expected behaviour:I'm joining a large dataset with a dimension table and as expected DPP is activated.I was trying to deactivate the feature as it change...

  • 11745 Views
  • 2 replies
  • 1 kudos
Latest Reply
pantelis_mare
Contributor III
  • 1 kudos

Hello @Kaniz Fatma​ Thank you for taking the time to answer.The issue in this case was that spark.databricks.optimizer.deltaTableFilesThreshold was activating DPP even if it was formally deactivated by setting all available "enabled" properties to f...

  • 1 kudos
1 More Replies
chrisreve89
by New Contributor II
  • 2109 Views
  • 1 replies
  • 2 kudos

Resolved! Databricks Spark Certification

Hello, I have been preparing for the for a while. I have seen here that the exam is mostly about remembering syntax details and some general understanding of the spark's internal architecture. I am VidMate just wondering if there are some exa Mobdro...

  • 2109 Views
  • 1 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 2 kudos

I recommend practice tests on Udemy. There is also available practice exam from data-bricks training.I haven't found others.

  • 2 kudos
Mahalakshmi
by New Contributor II
  • 2167 Views
  • 1 replies
  • 1 kudos

Resolved! Spark UI is not working for completed jobs

Spark UI is not working for completed jobs

  • 2167 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 1 kudos

Jobs executed from API jobs or Azure data factory are for example not available in spark management console.It can be also issue with community edition or spark settings.

  • 1 kudos
lprevost
by Contributor III
  • 3827 Views
  • 1 replies
  • 1 kudos

Resolved! Schema inferrence CSV picks up \r carriage returns

I'm using: frame = spark.read.csv(path=bucket+folder, inferSchema = True, header = True, multiLine=True ) to read in a series of CSV ...

  • 3827 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 1 kudos

Files saved in Windows operation system contain carriage return and line feed in every line.Please add following option it can help: .option("ignoreTrailingWhiteSpace", true)

  • 1 kudos
missyT
by New Contributor III
  • 4410 Views
  • 1 replies
  • 4 kudos

Resolved! How to distinguish arrow-key from escape character with getch in C?

I want to know weather an arrow key or the escape character has ben pressed. But in order to check which arrow key has been pressed I need to do multiple blocking getch-calls bc the arrow-key sequence is bigger than 1 char. This is a problem when I c...

  • 4410 Views
  • 1 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 4 kudos

getch () function returns two keycodes for arrow keys. Arrow put to getch '\033' and '[' and letter from A to D (up, down, right, left) so code will be something like:if (getch() == '\033') { getch(); // [ value switch(getch()) { ...

  • 4 kudos
sarvesh
by Contributor III
  • 6850 Views
  • 3 replies
  • 4 kudos

Resolved! Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot modify the value of a Spark config: spark.executor.memory;

I am trying to read a 16mb excel file and I was getting a gc overhead limit exceeded error to resolve that i tried to increase my executor memory with,spark.conf.set("spark.executor.memory", "8g")but i got the following stack :Using Spark's default l...

  • 6850 Views
  • 3 replies
  • 4 kudos
Latest Reply
Prabakar
Databricks Employee
  • 4 kudos

On the cluster configuration page, go to the advanced options. Click it to expand the field. There you will find the Spark tab and you can set the values there in the "Spark config".

  • 4 kudos
2 More Replies
Labels