Women in Data & AI
Group Hub Activity
Hello, Data & AI Enthusiasts! ๐
In todayโs data-driven world, organizations are managing enormous volumes of data. With this vast amount of data comes the responsibility to protect personal information, which is becoming increasingly important due to data privacy compliance like GDPR and CCPA. For us, as women in tech, itโs vital to be at the forefront of understanding and implementing these regulations to ensure both privacy and innovation.
What are GDPR and CCPA?
GDPR (General Data Protection Regulation):
Introduced by the European Union (EU), GDPR regulates how the personal data of EU citizens is collected, processed, and stored.
GDPR outlines 8 fundamental data subject rights, including the right to be informed, erasure, rectification, access, and withdrawal of consent.
Non-compliance can result in fines up to โฌ20 million or 4% of global annual turnover.
CCPA (California Consumer Privacy Act):
Similar to GDPR but focused on California residents. CCPA gives individuals control over how their personal information is used.
CCPA shares similar fundamental data subject rights to GDPR, such as the right to know, delete, Opt-out, Limit the Use and Disclosure of Sensitive Personal Information.
Both regulations are designed to protect individuals' privacy and ensure ethical handling of personal data.
Challenges in Managing Personal Data in Data Lakes
While data lakes store vast amounts of information, they pose several challenges:
Data Spread: Personal data is often difficult to locate across large datasets.
Slow Queries: Finding personal data can be slow and costly.
No Row-level Updates/Deletes: Data lakes lack efficient ways to delete or update personal data.
Lack of ACID Transactions: Updates can cause data inconsistencies.
Data Hygiene Issues: Ensuring clean, compliant data is difficult.
How Delta Lake Addresses These Challenges
Databricks Delta Lake offers solutions to these challenges, making it easier to stay compliant with GDPR and CCPA while managing large volumes of data
Data Anonymization: Delta Lake anonymizes personal data, making it untraceable.
Automatic Data Removal: Delta Lake allows setting pipelines and bucket policies to remove raw data, which helps comply with the rules.
Locate and Remove Personal Identifiers: Provides the mechanism to locate and remove identifiers to destroy the linkage.
ACID Transactions: Ensures safe, consistent updates and deletions of personal data.
Data Hygiene: Supports data cleanup and compliance with privacy regulations
As we address these challenges, it's crucial to increase women's participation in AI and data privacy. Women's unique perspectives can lead to more ethical and effective privacy solutions in our data-driven world๐ฉโ๐ป. Let's discuss and share our experiences to collectively improve our data management practices.
... View more
As we embrace the season of joy and reflection, letโs take a moment to celebrate the amazing women in Data and AI who are shaping the future! ๐
Whatโs been your most memorable achievement in Data & AI this year? Whoโs been your biggest source of inspiration or support on this journey?
This season is all about gratitude and connectionโso letโs celebrate your successes, honor your mentors, and spotlight the community that empowers us all.
Tag a fellow woman in tech who inspires you or share your story in the comments below. Together, letโs spread the cheer and break even more barriers in the year ahead! ๐ช
... View more
๐๐ข๐ฌ๐ญ ๐จ๐ ๐ญ๐ซ๐๐ง๐ฌ๐๐จ๐ซ๐ฆ๐๐ญ๐ข๐จ๐ง๐ฌ ๐๐ง๐ ๐๐๐ญ๐ข๐จ๐ง๐ฌ ๐ฎ๐ฌ๐๐ ๐ข๐ง ๐๐ฉ๐๐๐ก๐ ๐๐ฉ๐๐ซ๐ค ๐๐๐ญ๐๐
๐ซ๐๐ฆ๐๐ฌ ๐๐จ๐ซ ๐ ๐๐๐ญ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ ๐ซ๐จ๐ฅ๐: ๐๐ซ๐๐ง๐ฌ๐๐จ๐ซ๐ฆ๐๐ญ๐ข๐จ๐ง๐ฌ: Transformations are operations on DataFrames that return a new DataFrame. They are lazily evaluated, meaning they do not execute immediately but build a logical plan that is executed when an action is performed. ๐. ๐๐๐ฌ๐ข๐ ๐๐ซ๐๐ง๐ฌ๐๐จ๐ซ๐ฆ๐๐ญ๐ข๐จ๐ง๐ฌ: ๐ฌ๐๐ฅ๐๐๐ญ(): Select specific columns. ๐๐ข๐ฅ๐ญ๐๐ซ(): Filter rows based on a condition. ๐ฐ๐ข๐ญ๐ก๐๐จ๐ฅ๐ฎ๐ฆ๐ง():Add or replace a column. ๐๐ซ๐จ๐ฉ(): Remove columns. ๐ฐ๐ก๐๐ซ๐(๐๐จ๐ง๐๐ข๐ญ๐ข๐จ๐ง): Equivalent to filter(condition). ๐๐ซ๐จ๐ฉ(*๐๐จ๐ฅ๐ฌ): Returns a new DataFrame with columns dropped. ๐๐ข๐ฌ๐ญ๐ข๐ง๐๐ญ():Remove duplicate rows. ๐ฌ๐จ๐ซ๐ญ(): Sort the DataFrame by columns. ๐จ๐ซ๐๐๐ซ๐๐ฒ(): Order the DataFrame by columns. ๐. ๐๐ ๐ ๐ซ๐๐ ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐ซ๐จ๐ฎ๐ฉ๐ข๐ง๐ : ๐ ๐ซ๐จ๐ฎ๐ฉ๐๐ฒ(): Group rows by column values. ๐๐ ๐ (): Aggregate data using functions. ๐๐จ๐ฎ๐ง๐ญ(): Count rows. ๐ฌ๐ฎ๐ฆ(*๐๐จ๐ฅ๐ฌ):Computes the sum for each numeric column. ๐๐ฏ๐ (*๐๐จ๐ฅ๐ฌ): Computes the average for each numeric column. ๐ฆ๐ข๐ง(*๐๐จ๐ฅ๐ฌ):Computes the minimum value for each column. ๐ฆ๐๐ฑ(*๐๐จ๐ฅ๐ฌ): Computes the maximum value for each column. ๐. ๐๐จ๐ข๐ง๐ข๐ง๐ ๐๐๐ญ๐๐
๐ซ๐๐ฆ๐๐ฌ: ๐ฃ๐จ๐ข๐ง(๐จ๐ญ๐ก๐๐ซ, ๐จ๐ง=๐๐จ๐ง๐, ๐ก๐จ๐ฐ=๐๐จ๐ง๐): Joins with another DataFrame using the given join expression. ๐ฎ๐ง๐ข๐จ๐ง(): Combine two DataFrames with the same schema. ๐ข๐ง๐ญ๐๐ซ๐ฌ๐๐๐ญ(): Return common rows between DataFrames. ๐. ๐๐๐ฏ๐๐ง๐๐๐ ๐๐ซ๐๐ง๐ฌ๐๐จ๐ซ๐ฆ๐๐ญ๐ข๐จ๐ง๐ฌ: ๐ฐ๐ข๐ญ๐ก๐๐จ๐ฅ๐ฎ๐ฆ๐ง๐๐๐ง๐๐ฆ๐๐(): Rename a column. ๐๐ซ๐จ๐ฉ๐๐ฎ๐ฉ๐ฅ๐ข๐๐๐ญ๐๐ฌ(): Drop duplicate rows based on columns. ๐ฌ๐๐ฆ๐ฉ๐ฅ๐(): Sample a fraction of rows. ๐ฅ๐ข๐ฆ๐ข๐ญ(): Limit the number of rows. ๐. ๐๐ข๐ง๐๐จ๐ฐ ๐
๐ฎ๐ง๐๐ญ๐ข๐จ๐ง๐ฌ: ๐จ๐ฏ๐๐ซ(๐ฐ๐ข๐ง๐๐จ๐ฐ๐๐ฉ๐๐): Defines a window specification for window functions. ๐ซ๐จ๐ฐ_๐ง๐ฎ๐ฆ๐๐๐ซ().๐จ๐ฏ๐๐ซ(๐ฐ๐ข๐ง๐๐จ๐ฐ๐๐ฉ๐๐): Assigns a row number starting at 1 within a window partition. rank().over(windowSpec): Provides the rank of rows within a window partition. ๐๐๐ญ๐ข๐จ๐ง๐ฌ: Actions trigger the execution of the transformations and return a result to the driver program or write data to an external storage system. 1. Basic Actions: show(): Display the top rows of the DataFrame. collect(): Return all rows as an array. count(): Count the number of rows. take(): Return the first N rows as an array. first(): Return the first row. head(): Return the first N rows. 2. Writing Data: write(): Write the DataFrame to external storage. write.mode(): Specify save mode (e.g., overwrite, append). save(): Save the DataFrame to a specified path. toJSON(): Convert the DataFrame to a JSON dataset. 3. Other Actions: foreach(): Apply a function to each row. foreachPartition(): Apply a function to each partition.
... View more
๐๐จ๐๐ ๐๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐๐ญ๐ข๐จ๐ง ๐ข๐ง ๐๐ฒ๐๐ฉ๐๐ซ๐ค Please refer attach document While exploring, I discovered an excellent document by Deepa Vasanthkumar that effectively explains spark code optimization techniques. Like and reshare โ
... View more
Hello, amazing women of the Databricks AI community! ๐
As we continue to break barriers and make strides in the AI space, it's essential we come together to share our experiences, insights, and challenges. Whether youโre a seasoned expert or just starting your journey in AI, weโre all here to support and uplift one another.
โจ Letโs kick off a conversation:
What was the pivotal moment that inspired you to dive into AI? ๐ก
What challenges have you faced as a woman in the tech space, and how did you overcome them? ๐
Any tips or resources that have helped you in your AI journey? ๐
๐ Drop your thoughts below! Letโs spark some inspiring conversations and build a network that champions women in AI.
... View more