cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 4019 Views
  • 3 replies
  • 7 kudos

Resolved! How does 73% of the data go unused for analytics or decision-making?

Is Lakehouse the answer? Here's a good resource that was just published: https://dbricks.co/3q3471X

  • 4019 Views
  • 3 replies
  • 7 kudos
Latest Reply
Anonymous
Not applicable
  • 7 kudos

@Alexis Lopez​ - If @Dan Zafar​ 's or @Harikrishnan Kunhumveettil​'s answers solved the issue, would you be happy to mark one of their answers as best so other members can find the solution more easily?

  • 7 kudos
2 More Replies
Jreco
by Contributor
  • 18716 Views
  • 13 replies
  • 3 kudos

Event hub streaming improve processing rate

Hi all,I'm working with event hubs and data bricks to process and enrich data in real-time.Doing a "simple" test, I'm getting some weird values (input rate vs processing rate) and I think I'm losing data:If you can see, there is a peak with 5k record...

image image
  • 18716 Views
  • 13 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 3 kudos

hi @Jhonatan Reyes​ ,How many Event hubs partitions are you readying from? your micro-batch takes a few milliseconds to complete, which I think is good time, but I would like to undertand better what are you trying to improve here.Also, in this case ...

  • 3 kudos
12 More Replies
Data_Bricks1
by New Contributor III
  • 6612 Views
  • 7 replies
  • 0 kudos

data from 10 BLOB containers and multiple hierarchical folders(every day and every hour folders) in each container to Delta lake table in parquet format - Incremental loading for latest data only insert no updates

I am able to load data for single container by hard coding, but not able to load from multiple containers. I used for loop, but data frame is loading only last container's last folder record only.Here one more issue is I have to flatten data, when I ...

  • 6612 Views
  • 7 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Databricks MVP
  • 0 kudos

for sure function (def) should be declared outside loop, move it after importing libraries,logic is a bit complicated you need to debug it using display(Flatten_df2) (or .show()) and validating json after each iteration (using break or sleep etc.)

  • 0 kudos
6 More Replies
BorislavBlagoev
by Valued Contributor III
  • 3800 Views
  • 4 replies
  • 7 kudos

Resolved! Visualization of Structured Streaming in job.

Does Databricks have feature or good pattern to visualize the data from Structured Streaming? Something like display in the notebook.

  • 3800 Views
  • 4 replies
  • 7 kudos
Latest Reply
BorislavBlagoev
Valued Contributor III
  • 7 kudos

I didn't know about that. Thanks!

  • 7 kudos
3 More Replies
FMendez
by New Contributor III
  • 18581 Views
  • 3 replies
  • 6 kudos

Resolved! How can you mount an Azure Data Lake (gen2) using abfss and Shared Key?

I wanted to mount a ADLG2 on databricks and take advantage on the abfss driver which should be better for large analytical workloads (is that even true in the context of DB?).Setting an OAuth is a bit of a pain so I wanted to take the simpler approac...

  • 18581 Views
  • 3 replies
  • 6 kudos
Latest Reply
User16753724663
Databricks Employee
  • 6 kudos

Hi @Fernando Mendez​ ,The below document will help you to mount the ADLS gen2 using abfss:https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started.htmlCould you please check if this helps?

  • 6 kudos
2 More Replies
fymaterials_199
by New Contributor II
  • 1795 Views
  • 1 replies
  • 0 kudos

pyspark intermediate dataframe consumes many memory

I have pyspark code running in my local mac, which has 6 cores and 16 GB. I run it in pycharm to do first test.spark = ( SparkSession.builder.appName("loc") .master("local[2]") .config("spark.driver.bindAddress","localhost") .config("...

  • 1795 Views
  • 1 replies
  • 0 kudos
Latest Reply
fymaterials_199
New Contributor II
  • 0 kudos

Here is my input fileEID,EffectiveTime,OrderHistory,dummy_col,Period_Start_Date11,2019-04-19T02:50:42.6918667Z,"[{'Codes': [{'CodeSystem': 'sys_1', 'Code': '1-2'}], 'EffectiveDateTime': '2019-04-18T23:48:00Z', 'ComponentResults': [{'Codes': [{'CodeSy...

  • 0 kudos
User16826994223
by Databricks Employee
  • 3370 Views
  • 1 replies
  • 1 kudos
  • 3370 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16826994223
Databricks Employee
  • 1 kudos

# providing a starting version spark.readStream.format("delta") \ .option("readChangeFeed", "true") \ .option("startingVersion", 0) \ .table("myDeltaTable")   # providing a starting timestamp spark.readStream.format("delta") \ .option("readCh...

  • 1 kudos
User16826994223
by Databricks Employee
  • 2888 Views
  • 1 replies
  • 0 kudos
  • 2888 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Databricks Employee
  • 0 kudos

ou must explicitly enable the change data feed option using one of the following methods:New table: Set the table property  delta.enableChangeDataFeed = true in the CREATE TABLE command.CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIE...

  • 0 kudos
User16826994223
by Databricks Employee
  • 1798 Views
  • 1 replies
  • 0 kudos

spark is reading data from source even I am persisting the data

hI allI am reading data and I am caching the data and then I am performing Action Count to get the data in memory, but still, in dag I found out that every time it reads data from SOURCE.

  • 1798 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Databricks Employee
  • 0 kudos

It looks like the the spark memory is not sufficient to cache all the data so it read always from source

  • 0 kudos
User16826987838
by Databricks Employee
  • 1847 Views
  • 1 replies
  • 0 kudos
  • 1847 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Databricks Employee
  • 0 kudos

Yes, in your write stream you can save it as a table in the delta format without a problem. In DBR 8, the default table format is delta. See this code, please note that the "..." is supplied to show that additional options may be required: df.writeSt...

  • 0 kudos
User16826992666
by Databricks Employee
  • 3053 Views
  • 1 replies
  • 1 kudos

Resolved! Does Databricks integrate with Immuta?

My company uses Immuta for data governance. Will Databricks be able to fit into our existing security patterns?

  • 3053 Views
  • 1 replies
  • 1 kudos
Latest Reply
Ryan_Chynoweth
Databricks Employee
  • 1 kudos

Yes, check out the immuta web page on the Databricks Integration. https://www.immuta.com/integrations/databricks

  • 1 kudos
brickster_2018
by Databricks Employee
  • 2597 Views
  • 1 replies
  • 0 kudos

Resolved! Why do I see data loss with Structured streaming jobs?

I have a Spark structured streaming job reading data from Kafka and loading it to the Delta table. I have some transformations and aggregations on the streaming data before writing to Delta table

  • 2597 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The typical reason for data loss on a Structured streaming application is having an incorrect value set for watermarking. The watermarking is done to ensure the application does not develop the state for a long period, However, it should be ensured ...

  • 0 kudos
User16826994223
by Databricks Employee
  • 1642 Views
  • 1 replies
  • 0 kudos

How do we manage data recency in Databricks

I want to know how databricks maintain data recency in databricks

  • 1642 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Databricks Employee
  • 0 kudos

When using delta tables in databricks, you have the advantage of delta cache which accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. At the beginning of each query delta tables au...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 3241 Views
  • 1 replies
  • 0 kudos

Resolved! Z-order or Partitioning? Which is better for Data skipping?

For Delta tables, among Z-order and Partioning which is recommended technique for efficient Data Skipping

  • 3241 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

Partition pruning is the most efficient way to ensure Data skipping. However, choosing the right column for partitioning is very important. It's common to see choosing the wrong column for partitioning can cause a large number of small file problems ...

  • 0 kudos
Labels