cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to get data scraped from the web into your data storage

ChrisS
New Contributor III

I learning data bricks for the first time following the book that is copywrited in 2020 so I imagine it might be a little outdated at this point. What I am trying to do is move data from an online source (in this specific case using shell script but i want to do it using python too) to the data storage so that I can use it later in the workflow. Attached is the code that the book gives. Everything runs just fine but when I go to add the data using the DBFS method the data does not exist. Is there another way to do this? I am set up on Azure and they have set my data storage up as hive_metastore with default as the database. Putting it here is fine for now. I have attached the .dbc file. I hope that helps. If not, I can attach something else. This is my first time here. I have to figure out how to change this stuff later. Thank you in advance.

7 REPLIES 7

etsyal1e2r3
Honored Contributor

Well you can do a put, writing the file to local storage but its recommended to setup an external location to store files and tables within your own managed cloud storage. But for now try this...

dbutils.fs.put("<your_file>","</your/desired/path>")

Let me know if you have any issues. As you start setting up the external location let me know and ill assist you.

Anonymous
Not applicable

Hi @Chris Sarricoโ€‹ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

ChrisS
New Contributor III

The response came in the middle of the night so I had not had a chance to try it. No, I have not been successful with this. I put my desired path 'hive_metastore'.'default' got a response of ok but the file never showed up. So this response didn't work either. Seems as if data bricks needs some actual training videos or decent documentation on how to do things. I need to be able to access my data. Please help.

etsyal1e2r3
Honored Contributor

The file wont show up in blob until you setup the external location and use thr blob path to write/reas filea. Thr hive metastore is only internal to the databricks workspace. Did you set one up yet? Let me know.

ChrisS
New Contributor III

No, I do not know how to do that. I think that is what I am trying to do with the data lake issue that you are helping me with on the other thread. I did however figure out how to import from mysql but not export to it yet so there is some progress.

etsyal1e2r3
Honored Contributor

Okay good just read the docs i sent you and work through that other thread and it should solve both. Let me know what roadblocks you hit.

CharlesReily
New Contributor III

In Databricks, you can install external libraries by going to the Clusters tab, selecting your cluster, and then adding the Maven coordinates for Deequ. This represents the best b2b data enrichment services in Databricks.

In your notebook or script, you need to create a Spark session with the Deequ library added as a dependency. This can be done using the spark.jars.packages configuration option.

spark.conf.set("spark.jars.packages", "com.amazon.deequ:deequ:1.4.0")

Write your data quality checks using Deequ functions. For example:

import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationSuite._

val verificationResult: VerificationResult = VerificationSuite()
.onData(yourDataFrame)
.addCheck(
check = Check(yourColumn, "yourConstraint") // Define your data quality constraint here
)
.run()

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group