cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

SQL table convert to R dataframe

tt_mmo
New Contributor

I have a table with ~6 million rows. I am attempting to convert this from a sql table on my catalog to an R dataframe to use the tableone package. I separate my table into 3 tables each containing about 2 million rows then ran it through tbl() and as.data.frame(). Finally merged the 3 tables back together. This worked, but is there a way to perform this without the subsetting my original 6 million row dataset? Been troubleshoot for about 1 week now. Thanks 

Code below:

 

%r
# Install packages if not already installed
 
if (!require("sparklyr")) install.packages("sparklyr")
if (!require("dplyr")) install.packages("dplyr")

library(sparklyr)
library(dplyr)

#configure
config <- spark_config()
config$spark.driver.memory <- "32G"
config$spark.executor.memory <- "32G"
config$spark.memory.fraction <- 0.8

#connect to databricks
sc = spark_connect(method = 'databricks', version = '2.0.0', config = config)


# Copy table from SQL to R object with tables already separated out
m1 <- tbl(sc, "final_lab_mstr_3_1")
m2 <- tbl(sc, "final_lab_mstr_3_2")
m3 <- tbl(sc, "final_lab_mstr_3_3")

# Convert Spark DataFrame to R DataFrame
df_1 <- as.data.frame(collect(m1))
df_2 <- as.data.frame(collect(m2))
df_3 <- as.data.frame(collect(m3))
 
# Merge the data frames
df3 <- rbind(df_1, df_2, df_3)

 

1 REPLY 1

BigRoux
Databricks Employee
Databricks Employee

To handle a large SQL table (~6 million rows) and convert it into an R dataframe without splitting it into smaller subsets, you can use more efficient strategies and tools that are optimized for large datasets. Here are some recommendations:

1. Use `dbplyr` for Lazy Loading
Instead of loading the entire dataset into R memory, you can use the `dbplyr` package to interact with the SQL table directly. This allows you to perform operations on the database without fully loading the data into R. The `collect()` function can then be used to retrieve only the required subset.

```R
library(dplyr)
library(dbplyr)

# Connect to the database
con <- DBI::dbConnect(odbc::odbc(), Driver = "SQL Server", Server = "your_server", Database = "your_database", UID = "your_username", PWD = "your_password")

# Reference the SQL table
sql_table <- tbl(con, "final_lab_mstr")

# Perform operations directly on the database
filtered_data <- sql_table %>%
filter(some_condition) %>% # Apply filters if needed
select(column1, column2) # Select specific columns if possible

# Collect data into an R dataframe when ready
df <- collect(filtered_data)
```

This approach minimizes memory usage by pushing computations to the database and only pulling the required data into R.

2. Optimize Memory Usage with `data.table`**
If you need to load the entire dataset into R, consider using the `data.table` package, which is highly optimized for handling large datasets in memory.

```R
library(data.table)

# Convert SQL query result to a data.table directly
df <- setDT(collect(tbl(con, "final_lab_mstr")))
```

The `data.table` package is faster and more memory-efficient than base R data frames.

3. Increase Memory Efficiency**
Ensure your R session has sufficient memory allocated. If you're using Spark as in your example, configure Spark with higher memory limits (as you've done with `spark.driver.memory` and `spark.executor.memory`). Additionally:
- Use column selection in your SQL query or R script to reduce unnecessary data transfer.
- Avoid using `as.data.frame()` unnecessarily; stick with `data.table` or tibbles.

4. Use Chunked Reading**
If you cannot avoid splitting due to memory constraints, use chunked reading with packages like `data.table::fread()` or custom SQL queries that limit rows per chunk.

```R
library(data.table)

# Example: Read chunks of 1 million rows at a time
chunk_size <- 1e6
offset <- 0

repeat {
query <- paste0("SELECT * FROM final_lab_mstr LIMIT ", chunk_size, " OFFSET ", offset)
chunk <- dbGetQuery(con, query)

if (nrow(chunk) == 0) break

# Process or append chunk to main dataframe
df <- rbindlist(list(df, setDT(chunk)), use.names = TRUE, fill = TRUE)

offset <- offset + chunk_size
}
```

5. Consider Using Arrow or Parquet Formats**
For very large datasets, consider exporting the SQL table as a Parquet file and then using the `arrow` package in R for efficient reading.

```R
library(arrow)

# Load Parquet file into an R dataframe
df <- read_parquet("path_to_parquet_file")
```

Parquet files are highly compressed and optimized for large-scale data processing.

6. Pre-Aggregate Data in SQL**
If possible, perform summarization or filtering directly in SQL before pulling data into R. This reduces both memory usage and transfer time.

```SQL
SELECT column1, column2, AVG(column3) AS avg_col3
FROM final_lab_mstr
GROUP BY column1, column2;
```

Then load the aggregated result into R using `dbGetQuery()` or similar functions.

By leveraging these techniques, you can handle large datasets more efficiently without splitting them manually. The choice of method depends on your specific use case and computational resources available

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group