06-17-2023 03:53 AM
I learning data bricks for the first time following the book that is copywrited in 2020 so I imagine it might be a little outdated at this point. What I am trying to do is move data from an online source (in this specific case using shell script but i want to do it using python too) to the data storage so that I can use it later in the workflow. Attached is the code that the book gives. Everything runs just fine but when I go to add the data using the DBFS method the data does not exist. Is there another way to do this? I am set up on Azure and they have set my data storage up as hive_metastore with default as the database. Putting it here is fine for now. I have attached the .dbc file. I hope that helps. If not, I can attach something else. This is my first time here. I have to figure out how to change this stuff later. Thank you in advance.
06-17-2023 05:28 PM
Well you can do a put, writing the file to local storage but its recommended to setup an external location to store files and tables within your own managed cloud storage. But for now try this...
dbutils.fs.put("<your_file>","</your/desired/path>")
Let me know if you have any issues. As you start setting up the external location let me know and ill assist you.
06-18-2023 04:44 AM
Hi @Chris Sarrico
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
06-18-2023 05:13 AM
The response came in the middle of the night so I had not had a chance to try it. No, I have not been successful with this. I put my desired path 'hive_metastore'.'default' got a response of ok but the file never showed up. So this response didn't work either. Seems as if data bricks needs some actual training videos or decent documentation on how to do things. I need to be able to access my data. Please help.
06-18-2023 05:20 AM
The file wont show up in blob until you setup the external location and use thr blob path to write/reas filea. Thr hive metastore is only internal to the databricks workspace. Did you set one up yet? Let me know.
06-18-2023 05:35 AM
No, I do not know how to do that. I think that is what I am trying to do with the data lake issue that you are helping me with on the other thread. I did however figure out how to import from mysql but not export to it yet so there is some progress.
06-18-2023 05:50 AM
Okay good just read the docs i sent you and work through that other thread and it should solve both. Let me know what roadblocks you hit.
01-18-2024 06:53 AM - edited 01-18-2024 06:54 AM
In Databricks, you can install external libraries by going to the Clusters tab, selecting your cluster, and then adding the Maven coordinates for Deequ. This represents the best b2b data enrichment services in Databricks.
In your notebook or script, you need to create a Spark session with the Deequ library added as a dependency. This can be done using the spark.jars.packages configuration option.
spark.conf.set("spark.jars.packages", "com.amazon.deequ:deequ:1.4.0")
Write your data quality checks using Deequ functions. For example:
import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationSuite._
val verificationResult: VerificationResult = VerificationSuite()
.onData(yourDataFrame)
.addCheck(
check = Check(yourColumn, "yourConstraint") // Define your data quality constraint here
)
.run()
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group