Data Science | 4 articles | Tech News, Tutorials & Expert Insights

article-image-enhancing-data-quality-with-cleanlab

11 Dec 2024

10 min read

Enhancing Data Quality with Cleanlab

11 Dec 2024

IntroductionIt is a well-established fact that your machine-learning model is only as good as the data it is fed. ML model trained on bad-quality data usually has a number of issues. Here are a few ways that bad data might affect machine-learning models -1. Predictions that are wrong may be made as a result of errors, missing numbers, or other irregularities in low-quality data. The model's predictions are likely to be inaccurate if the data used to train is unreliable.2. Bad data can also bias the model. The ML model can learn and reinforce these biases if the data is not representative of the real-world situations, which can result in predictions that are discriminating.3. Poor data also disables the the ability of ML model to generalize on fresh data. Poor data may not effectively depict the underlying patterns and relationships in the data.4. Models trained on bad-quality data might need more retraining and maintenance. The overall cost and complexity of model deployment could rise as a result.As a result, it is critical to devote time and effort to data preprocessing and cleaning in order to decrease the impact of bad data on ML models. Furthermore, to ensure the model's dependability and performance, it is often necessary to use domain knowledge to recognize and address data quality issues.It might come as a surprise, but gold-standard datasets like ImageNet, CIFAR, MNIST, 20News, and more also contain labeling issues. I have put in some examples below for reference -The above snippet is from the Amazon sentiment review dataset , where the original label was Neutral in both cases, whereas Cleanlab and Mechanical turk said it to be positive (which is correct).The above snippet is from the MNIST dataset, where the original label was marked to be 8 and 0 respectively, which is incorrect. Instead, both Cleanlab and Mechanical Turk said it to be 9 and 6 (which is correct).Feel free to check out labelerrors to explore more such cases in similar datasets.Introducing CleanlabThis is where Cleanlab can come in handy as your best bet. It helps by automatically identifying problems in your ML dataset, it assists you in cleaning both data and labels. This data centric AI software uses your existing models to estimate dataset problems that can be fixed to train even better models. The graphic below depicts the typical data-centric AI model development cycle:Apart from the standard way of coding all the way through finding data issues, it also offers Cleanlab Studio - a no-code platform for fixing all your data errors. For the purpose of this blog, we will go the former way on our sample use case.Getting Hands-on with CleanlabInstallationInstalling cleanlab is as easy as doing a pip install. I recommend installing optional dependencies as well, you never know what you need and when. I also installed sentence transformers, as I would be using them for vectorizing the text. Sentence transformers come with a bag of many amazing models, we particularly use ‘all-mpnet-base-v2’ as our choice of sentence-transformers for vectorizing text sequences. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. Feel free to check out this for the list of all models and their comparisons.pip install ‘cleanlab[all]’ pip install sentence-transformersDatasetWe picked the SMS Spam Detection dataset as our choice of dataset for doing the experimentation. It is a public set of labeled SMS messages that have been collected for mobile phone spam research with total instances of roughly ~5.5k. The below graphic gives a sneak peek of some of the samples from the dataset.Data PreviewCodeLet’s now delve into the code. For demonstration purposes, we inject a 5% noise in the dataset, and see if we are able to detect them and eventually train a better model.Note: I have also annotated every segment of the code wherever necessary for better understanding.import pandas as pd from sklearn.model_selection import train_test_split, cross_val_predict from sklearn.preprocessing import LabelEncoder from sklearn.linear_model import LogisticRegression from sentence_transformers import SentenceTransformer from cleanlab.classification import CleanLearning from sklearn.metrics import f1_score # Reading and renaming data. Here we set sep=’\t’ because the data is tab separated. data = pd.read_csv('SMSSpamCollection', sep='\t') data.rename({0:'label', 1:'text'}, inplace=True, axis=1) # Dropping any instance of duplicates that could exist data.drop_duplicates(subset=['text'], keep=False, inplace=True) # Original data distribution for spam and not spam (ham) categories print (data['label'].value_counts(normalize=True)) ham 0.865937 spam 0.134063 # Adding noise. Switching 5% of ham data to ‘spam’ label tmp_df = data[data['label']=='ham'] examples_to_change = int(tmp_df.shape[0]*0.05) print (f'Changing examples: {examples_to_change}') examples_text_to_change = tmp_df.head(examples_to_change)['text'].tolist() changed_df = pd.DataFrame([[i, 'spam'] for i in examples_text_to_change]) changed_df.rename({0:'text', 1:'label'}, axis=1, inplace=True) left_data = data[~data['text'].isin(examples_text_to_change)] final_df = pd.concat([left_data, changed_df]) final_df.reset_index(drop=True, inplace=True) Changing examples: 216 # Modified data distribution for spam and not spam (ham) categories print (final_df['label'].value_counts(normalize=True)) ham 0.840016 spam 0.159984 raw_texts, raw_labels = final_df["text"].values, final_df["label"].values # Converting label into integers encoder = LabelEncoder() encoder.fit(raw_train_labels) train_labels = encoder.transform(raw_train_labels) test_labels = encoder.transform(raw_test_labels) # Vectorizing text sequence using sentence-transformers transformer = SentenceTransformer('all-mpnet-base-v2') train_texts = transformer.encode(raw_train_texts) test_texts = transformer.encode(raw_test_texts) # Instatiating model instance model = LogisticRegression(max_iter=200) # Wrapping the sckit model around CL cl = CleanLearning(model) # Finding label issues in the train set label_issues = cl.find_label_issues(X=train_texts, labels=train_labels) # Picking top 50 samples based on confidence scores identified_issues = label_issues[label_issues["is_label_issue"] == True] lowest_quality_labels = label_issues["label_quality"].argsort()[:50].to_numpy() # Beauty print the label issue detected by CleanLab def print_as_df(index): return pd.DataFrame( { "text": raw_train_texts, "given_label": raw_train_labels, "predicted_label": encoder.inverse_transform(label_issues["predicted_label"]), }, ).iloc[index] print_as_df(lowest_quality_labels[:5]) As we can see, Cleanlab assisted us in automatically removing the incorrect labels and training a better model with the same parameters and settings. In my experience, people frequently ignore data concerns in favor of building more sophisticated models to increase accuracy numbers. Improving data, on the other hand, is a pretty simple performance win. And, thanks to products like Cleanlab, it's become really simple and convenient.Feel free to access and play around with the above code in the Colab notebook hereConclusionIn conclusion, Cleanlab offers a straightforward solution to enhance data quality by addressing label inconsistencies, a crucial step in building more reliable and accurate machine learning models. By focusing on data integrity, Cleanlab simplifies the path to better performance and underscores the significance of clean data in the ever-evolving landscape of AI. Elevate your model's accuracy by investing in data quality, and explore the provided code to see the impact for yourself.Author BioPrakhar has a Master’s in Data Science with over 4 years of experience in industry across various sectors like Retail, Healthcare, Consumer Analytics, etc. His research interests include Natural Language Understanding and generation, and has published multiple research papers in reputed international publications in the relevant domain. Feel free to reach out to him on LinkedIn

2
0
6622

article-image-unlocking-insights-how-power-bi-empowers-analytics-for-all-users

Gogula Aryalingam

29 Nov 2024

5 min read

Unlocking Insights: How Power BI Empowers Analytics for All Users

Gogula Aryalingam

29 Nov 2024

5 min read

IntroductionIn today’s data-driven world, businesses rely heavily on robust tools to transform raw data into actionable insights. Among these tools, Microsoft Power BI stands out as a leader, renowned for its versatility and user-friendliness. From its humble beginnings as an Excel add-in, Power BI has evolved into a comprehensive enterprise business intelligence platform, competing with industry giants like Tableau and Qlik. This journey of transformation reflects not only Microsoft’s innovation but also the growing need for accessible, scalable analytics solutions.As a data specialist who has transitioned from traditional data warehousing to modern analytics platforms, I’ve witnessed firsthand how Power BI empowers both technical and non-technical users. It has become an indispensable tool, offering capabilities that bridge the gap between data modeling and visualization, catering to everyone from citizen developers to seasoned data analysts. This article explores the evolution of Power BI, its role in democratizing data analytics, and its integration into broader solutions like Microsoft Fabric, highlighting why mastering Power BI is critical for anyone pursuing a career in analytics.The Changing Tide for Data Analysts When you think of business intelligence in the modern era, Power BI is often the first tool that comes to mind. However, this wasn't always the case. Originally launched as an add-in for Microsoft Excel, Power BI quickly evolved into a comprehensive enterprise business intelligence platform in a few years competing with the likes of Qlik and Tableau—a true testament to its capabilities. As a data specialist, what really impresses me about Power BI's evolution is how Microsoft has continuously improved its user-friendliness, making both data modeling and visualizing more accessible, catering to both technical professionals and business users. As a data specialist, initially working with traditional data warehousing, and now with modern data lakehouse-based analytics platforms, I’ve come to appreciate the capabilities that Power BI brings to the table. It empowers me to go beyond the basics, allowing me to develop detailed semantic layers and create impactful visualizations that turn raw data into actionable insights. This capability is crucial in delivering truly comprehensive, end-to-end analytics solutions. For technical folk like me, by building on our experiences working with these architectures and the deep understanding of the technologies and concepts that drive them, integrating Power BI into the workflow is a smooth and intuitive process. The transition to including Power BI in my solutions feels almost like a natural progression, as it seamlessly complements and enhances the existing frameworks I work with. It's become an indispensable tool in my data toolkit, helping me to push the boundaries of what's possible in analytics. In recent years, there has been a noticeable increase in the number of citizen developers and citizen data scientists. These are non-technical professionals who are well-versed in their business domains and dabble with technology to create their own solutions. This trend has driven the development of a range of low-code/no-code, visual tools such as Coda, Appian, OutSystems, Shopify, and Microsoft’s Power Platform. At the same time, the role of the data analyst has significantly expanded. More organizations are now entrusting data analysts with responsibilities that were traditionally handled by technology or IT departments. These include tasks like reporting, generating insights, data governance, and even managing the organization’s entire analytics function. This shift reflects the growing importance of data analytics in driving business decisions and operations. As a data specialist, I’ve been particularly impressed by how Power BI has evolved in terms of user-friendliness, catering not just to tech-savvy professionals but also to business users. Microsoft has continuously refined Power BI, simplifying complex tasks and making it easy for users of all skill levels to connect, model, and visualize data. This focus on usability is what makes Power BI such a powerful tool, accessible to a wide range of users. For non-technical users, Power BI offers a short learning curve, enabling them to connect to and model data for reporting without needing to rely on Excel, which they might be more familiar with. Once the data is modeled, they can explore a variety of visualization options to derive insights. Moreover, Power BI’s capabilities extend beyond simple reporting, allowing users to scale their work into a full-fledged enterprise business intelligence system. Many data analysts are now looking to deepen their understanding of the broader solutions and technologies that support their work. This is where Microsoft Fabric becomes essential. Fabric extends Power BI by transforming it into a comprehensive, end-to-end analytics platform, incorporating data lakes, data warehouses, data marts, data engineering, data science, and more. With these advanced capabilities, technical work becomes significantly easier, enabling data analysts to take their skills to the next level and realize their full potential in driving analytics solutions. If you're considering a career in analytics and business intelligence, it's crucial to master the fundamentals and gain a comprehensive understanding of the necessary skills. With the field rapidly evolving, staying ahead means equipping yourself with the right knowledge to confidently join this dynamic industry. The Complete Power BI Interview Guide is designed to guide you through this process, providing the essential insights and tools you need to jump on board and thrive in your analytics journey. ConclusionConclusionMicrosoft Power BI has redefined the analytics landscape by making advanced business intelligence capabilities accessible to a wide audience, from technical professionals to business users. Its seamless integration into modern analytics workflows and its ability to support end-to-end solutions make it an invaluable tool in today’s data-centric environment. With the rise of citizen developers and expanded responsibilities for data analysts, tools like Power BI and platforms like Microsoft Fabric are paving the way for more innovative and comprehensive analytics solutions.For aspiring professionals, understanding the fundamentals of Power BI and its ecosystem is key to thriving in the analytics field. If you're looking to master Power BI and gain the confidence to excel in interviews and real-world scenarios, The Complete Power BI Interview Guide is an invaluable resource. From the core PowerBI concepts to interview preparation and onboarding tips and tricks, The Complete Power BI Interview Guide is the ultimate resource for beginners and aspiring Power BI job seekers who want to stand out from the competition.Author BioGogula is an analytics and BI architect born and raised in Sri Lanka. His childhood was spent dreaming, while most of his adulthood was and is spent working with technology. He currently works for a technology and services company based out of Colombo. He has accumulated close to 20 years of experience working with a diverse range of customers across various domains, including insurance, healthcare, logistics, manufacturing, fashion, F&B, K-12, and tertiary education. Throughout his career, he has undertaken multiple roles, including managing delivery, architecting, designing, and developing data & AI solutions. Gogula is a recipient of the Microsoft MVP award more than 15 times, has contributed to the development and standardization of Microsoft certifications, and holds over 15 data & AI certifications. In his leisure time, he enjoys experimenting with and writing about technology, as well as organizing and speaking at technology meetups.

0
0
3873

article-image-how-to-face-a-critical-rag-driven-generative-ai-challenge

Mr. Denis Rothman

06 Nov 2024

15 min read

How to Face a Critical RAG-driven Generative AI Challenge

Mr. Denis Rothman

06 Nov 2024

15 min read

This article is an excerpt from the book, "RAG-Driven Generative AI", by Denis Rothman. Explore the transformative potential of RAG-driven LLMs, computer vision, and generative AI with this comprehensive guide, from basics to building a complex RAG pipeline.IntroductionOn a bright Monday morning, Dakota sits down to get to work and is called by the CEO of their software company, who looks quite worried. An important fire department needs a conversational AI agent to train hundreds of rookie firefighters nationwide on drone technology. The CEO looks dismayed because the data provided is spread over many websites around the country. Worse, the management of the fire department is coming over at 2 PM to see a demonstration to decide whether to work with Dakata’s company or not. Dakota is smiling. The CEO is puzzled. Dakota explains that the AI team can put a prototype together in a few hours and be more than ready by 2 PM and get to work. The strategy is to divide the AI team into three sub-teams that will work in parallel on three pipelines based on the reference Deep Lake, LlamaIndex and OpenAI RAG program* they had tested and approved a few weeks back. Pipeline 1: Collecting and preparing the documents provided by the fire department for this Proof of Concept(POC). Pipeline 2: Creating and populating a Deep Lake vector store with the first batch of documents while the Pipeline 1 team continues to retrieve and prepare the documents. Pipeline 3: Indexed-based RAG with LlamaIndex’s integrated OpenAI LLM performed on the first batch of vectorized documents. The team gets to work at around 9:30 AM after devising their strategy. The Pipeline 1 team begins by fetching and cleaning a batch of documents. They run Python functions to remove punctuation except for periods and noisy references within the content. Leveraging the automated functions they already have through the educational program, the result is satisfactory. By 10 AM, the Pipeline 2 team sees the first batch of documents appear on their file server. They run the code they got from the RAG program* to create a Deep Lake vector store and seamlessly populate it with an OpenAI embedding model, as shown in the following excerpt: from llama_index.core import StorageContext vector_store_path = "hub://denis76/drone_v2" dataset_path = "hub://denis76/drone_v2" # overwrite=True will overwrite dataset, False will append it vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True) Note that the path of the dataset points to the online Deep Lake vector store. The fact that the vector store is serverless is a huge advantage because there is no need to manage its size, storage process and just begin to populate it in a few seconds! Also, to process the first batch of documents, overwrite=True, will force the system to write the initial data. Then, starting the second batch, the Pipeline 2 team can run overwrite=False, to append the following documents. Finally, LlamaIndex automatically creates a vector store index: storage_context = StorageContext.from_defaults(vector_store=vector_store) # Create an index over the documents index = VectorStoreIndex.from_documents(documents, storage_context=storage_context) By 10:30 AM, the Pipeline 3 team can visualize the vectorized(embedded) dataset in their Deep Lake vector store. They create a LlamaIndex query engine on the dataset: from llama_index.core import VectorStoreIndex vector_store_index = VectorStoreIndex.from_documents(documents) … vector_query_engine = vector_store_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt) Note that the OpenAI Large Language Model is seamlessly integrated into LlamaIndex with the following parameters: k, in this case, k=3, specifies the number of documents to retrieve from the vector store. The retrieval is based on the similarity of embedded user inputs and embedded vectors within the dataset. temp, in this case temp=0.1, determines the randomness of the output. A low value such as 0.1 forces the similarity search to be precise. A higher value would allow for more diverse responses, which we do not want for this technological conversational agent. mt, in this case, mt=1024, determines the maximum number of tokens in the output. A cosine similarity function was added to make sure that the outputs matched the sample user inputs: from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') def calculate_cosine_similarity_with_embeddings(text1, text2):     embeddings1 = model.encode(text1)     embeddings2 = model.encode(text2)     similarity = cosine_similarity([embeddings1], [embeddings2])     return similarity[0][0] By 11:00 AM, all three pipeline teams are warmed up and ready to go full throttle! While the Pipeline 2 team was creating the vector store and populating it with the first batch of documents, the Pipeline 1 team prepared the next several batches. At 11:00 AM, Dakota gave the green light to run all three pipelines simultaneously. Within a few minutes, the whole RAG-driven generative AI system was humming like a beehive! By 1:00 PM, Dakota and the three pipeline teams were working on a PowerPoint slideshow with a copilot. Within a few minutes, it was automatically generated based on their scenario. At 1:30 PM, they had time to grab a quick lunch. At 2:00 pm, the fire department management, Dakota’s team, and the CEO of their software company were in the meeting room. Dakota’s team ran the PowerPoint slide show and began the demonstration with a simple input: user_input="Explain how drones employ real-time image processing and machine learning algorithms to accurately detect events in various environmental conditions." The response displayed was satisfactory: Drones utilize real-time image processing and machine learning algorithms to accurately detect events in various environmental conditions by analyzing data captured by their sensors and cameras. This technology allows drones to process visual information quickly and efficiently, enabling them to identify specific objects, patterns, or changes in the environment in real-time. By employing these advanced algorithms, drones can effectively monitor and respond to different situations, such as wildfires, wildlife surveys, disaster relief efforts, and agricultural monitoring with precision and accuracy. Dakota’s team then showed that the program could track and display the original documents the response was based on. At one point, the fire department’s top manager, Taylor, exclaimed, “Wow, this is impressive! It’s exactly what we were looking for! " Of course, Dakato’s CEO began discussing the number of users, cost, and timelines with Taylor. In the meantime, Dakota and the rest of the fire department’s team went out to drink some coffee and get to know each other. Fire departments intervene at short notice efficiently for emergencies. So can expert-level AI teams! https://github.com/Denis2054/RAG-Driven-Generative-AI/blob/main/Chapter03/Deep_Lake_LlamaIndex_OpenAI_RAG.ipynb ConclusionIn facing a high-stakes, time-sensitive challenge, Dakota and their AI team demonstrated the power and efficiency of RAG-driven generative AI. By leveraging a structured, multi-pipeline approach with tools like Deep Lake, LlamaIndex, and OpenAI’s advanced models, the team was able to integrate scattered data sources quickly and effectively, delivering a sophisticated, real-time conversational AI prototype tailored for firefighter training on drone technology. Their success showcases how expert planning, resourceful use of AI tools, and teamwork can transform a complex project into a streamlined solution that meets client needs. This case underscores the potential of generative AI to create responsive, practical solutions for critical industries, setting a new standard for rapid, high-quality AI deployment in real-world applications.Author Bio Denis Rothman graduated from Sorbonne University and Paris-Diderot University, and as a student, he wrote and registered a patent for one of the earliest word2vector embeddings and word piece tokenization solutions. He started a company focused on deploying AI and went on to author one of the first AI cognitive NLP chatbots, applied as a language teaching tool for Mo�t et Chandon (part of LVMH) and more. Denis rapidly became an expert in explainable AI, incorporating interpretable, acceptance-based explanation data and interfaces into solutions implemented for major corporate projects in the aerospace, apparel, and supply chain sectors. His core belief is that you only really know something once you have taught somebody how to do it.

0
0
4311

article-image-unlocking-excels-potential-extend-your-spreadsheets-with-r-and-python

Steven Sanderson, David Kun

17 Oct 2024

5 min read

Unlocking Excel's Potential: Extend Your Spreadsheets with R and Python

Steven Sanderson, David Kun

17 Oct 2024

5 min read

Introduction Are you an Excel user looking to push your data analysis capabilities beyond the familiar cells and formulas? If so, you're about to embark on a transformative journey. With the integration of R and Python, you can elevate Excel into a powerhouse of advanced data analysis and visualization. In this blog post, inspired by the book "Extending Excel with Python and R," co-authored by myself and David Kun, we will dive deep into practical implementation, focusing on how to automate data visualization in Excel using these powerful programming languages. Practical Implementation: Creating Advanced Data Visualizations In the world of data analysis, visual representation is key to understanding complex datasets. Excel, while equipped with basic charting tools, often requires enhancement for more sophisticated visuals. By integrating R and Python, you can create dynamic and detailed graphs that bring your data to life. Task: Automating Data Visualization with Python and R Step-by-Step Guide Step 1: Set Up Your Environment Before jumping into visualization, ensure you have the necessary tools installed. You will need: Excel: Ensure you have a version that supports VBA (Visual Basic for Applications). Python: Install Python on your computer. You can download it from the official Python website. R: Similarly, install R from the Comprehensive R Archive Network (CRAN). Libraries: For Python, install `pandas`, `matplotlib`, and `openpyxl` using pip. For R, install `ggplot2` and `readxl`. Step 2: Importing Data Begin by importing your Excel data into Python or R. Here’s a Python snippet using pandas: In R, use readxl: Step 3: Creating Visualizations Python Example Using Matplotlib, you can create a simple line plot: Python Example R Example With ggplot2, the process is equally straightforward where df is some data frame loaded in: Step 4: Integrating Visualizations into Excel Once your visualization is created, the next step is to integrate it back into Excel. This can be done manually, or you can automate it using VBA or an API endpoint. Python Integration Using openpyxl, you can embed images: R Integration For R, you might automate this process using R scripts that interact with Excel via VBA or other packages like `officer`. Step 5: Automating the Entire Workflow To automate, consider using Python scripts executed from Excel VBA or R scripts called through Excel's RExcel plugin. This way, you can refresh data and update visualizations with minimal effort. Conclusion By integrating R and Python with Excel, you unlock a realm of possibilities for data visualization and analysis, turning Excel from a simple spreadsheet tool into a comprehensive data analytics suite. This guide provides a snapshot of what you can achieve, and with further exploration, the potential is limitless. Author Bio Steven Sanderson is a Manager of Applications with a deep passion for data and its compliments: cleaning, analysis, visualization and communication. He is known primarily for his work in R. After his MPH, Steven continued his work in the healthcare industry as a clinical decision support analyst working his way up to Manager of Applications at Stony Brook Medicine for Patient Financial Services. He currently is focused on expanding functions in his healthyverse suite of packages while also slimming them down and expanding their robustness. He also now enjoys helping mentor junior employees to set them up for success. David Kun is a mathematician and actuary who has always worked in the gray zone between quantitative teams and ICT, aiming to build a bridge. He is a co-founder and director of Functional Analytics, the creator of the ownR infinity platform. As a data scientist, he also uses ownR for his daily work. His projects include time series analysis for demand forecasting, computer vision for design automation, and visualization. Looking to Master Excel with Python and R?If you're excited about extending Excel’s capabilities with powerful tools like Python and R, Extending Excel with Python and R, authored by Steven Sanderson, David Kun, offers an in-depth guide to seamlessly integrating these languages into your Excel workflow. It covers everything from automating data tasks to advanced visualizations, all tailored for Excel enthusiasts.

0
0
3331

Author Posts - Data Science

Enhancing Data Quality with Cleanlab

Unlocking Insights: How Power BI Empowers Analytics for All Users

How to Face a Critical RAG-driven Generative AI Challenge

Unlocking Excel's Potential: Extend Your Spreadsheets with R and Python

Trending Topics