Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

How-To Tutorials - AI Tools

91 Articles
article-image-vertex-ai-workbench-your-complete-guide-to-scaling-machine-learning-with-google-cloud
Jasmeet Bhatia, Kartik Chaudhary
04 Nov 2024
15 min read
Save for later

Vertex AI Workbench: Your Complete Guide to Scaling Machine Learning with Google Cloud

Jasmeet Bhatia, Kartik Chaudhary
04 Nov 2024
15 min read
This article is an excerpt from the book, "The Definitive Guide to Google Vertex AI", by Jasmeet Bhatia, Kartik Chaudhary. The Definitive Guide to Google Vertex AI is for ML practitioners who want to learn Google best practices, MLOps tooling, and turnkey AI solutions for solving large-scale real-world AI/ML problems. This book takes a hands-on approach to help you become an ML rockstar on Google Cloud Platform in no time.Introduction While working on an ML project, if we are running a Jupyter Notebook in a local environment, or using a web-based Colab- or Kaggle-like kernel, we can perform some quick experiments and get some initial accuracy or results from ML algorithms very fast. But we hit a wall when it comes to performing large-scale experiments, launching long-running jobs, hosting a model, and also in the case of model monitoring. Additionally, if the data related to a project requires some more granular permissions on security and privacy (fine-grained control over who can view/access the data), it’s not feasible in local or Colab-like environments. All these challenges can be solved just by moving to the cloud. Vertex AI Workbench within Google Cloud is a JupyterLab-based environment that can be leveraged for all kinds of development needs of a typical data science project. The JupyterLab environment is very similar to the Jupyter Notebook environment, and thus we will be using these terms interchangeably throughout the book. Vertex AI Workbench has options for creating managed notebook instances as well as user-managed notebook instances. User-managed notebook instances give more control to the user, while managed notebooks come with some key extra features. We will discuss more about these later in this section. Some key features of the Vertex AI Workbench notebook suite include the following: Fully managed–Vertex AI Workbench provides a Jupyter Notebook-based fully managed environment that provides enterprise-level scale without managing infrastructure, security, and user-management capabilities. Interactive experience–Data exploration and model experiments are easier as managed notebooks can easily interact with other Google Cloud services such as storage systems, big data solutions, and so on. Prototype to production AI–Vertex AI notebooks can easily interact with other Vertex AI tools and Google Cloud services and thus provide an environment to run end-to-end ML projects from development to deployment with minimal transition. Multi-kernel support–Workbench provides multi-kernel support in a single managed notebook instance including kernels for tools such as TensorFlow, PyTorch, Spark, and R. Each of these kernels comes with pre-installed useful ML libraries and lets us install additional libraries as required. Scheduling notebooks–Vertex AI Workbench lets us schedule notebook runs on an ad hoc and recurring basis. This functionality is quite useful in setting up and running large-scale experiments quickly. This feature is available through managed notebook instances. More information will be provided on this in the coming sections. With this background, we can now start working with Jupyter Notebooks on Vertex AI Workbench. The next section provides basic guidelines for getting started with notebooks on Vertex AI. Getting started with Vertex AI Workbench Go to the Google Cloud console and open Vertex AI from the products menu on the left pane or by using the search bar on the top. Inside Vertex AI, click on Workbench, and it will open a page very similar to the one shown in Figure 4.3. More information on this is available in the official  documentation (https://cloud.google.com/vertex-ai/docs/workbench/ introduction).  Figure 4.3 – Vertex AI Workbench UI within the Google Cloud console As we can see, Vertex AI Workbench is basically Jupyter Notebook as a service with the flexibility of working with managed as well as user-managed notebooks. User-managed notebooks are suitable for use cases where we need a more customized environment with relatively higher control. Another good thing about user-managed notebooks is that we can choose a suitable Docker container based on our development needs; these notebooks also let us change the type/size of the instance later on with a restart. To choose the best Jupyter Notebook option for a particular project, it’s important to know about the common differences between the two solutions. Table 4.1 describes some common differences between fully managed and user-managed notebooks: Table 4.1 – Differences between managed and user-managed notebook instances Let’s create one user-managed notebook to check the available options:  Figure 4.4 – Jupyter Notebook kernel configurations As we can see in the preceding screenshot, user-managed notebook instances come with several customized image options to choose from. Along with the support of tools such as TensorFlow Enterprise, PyTorch, JAX, and so on, it also lets us decide whether we want to work with GPUs (which can be changed later, of course, as per needs). These customized images come with all useful libraries pre-installed for the desired framework, plus provide the flexibility to install any third-party packages within the instance. After choosing the appropriate image, we get more options to customize things such as notebook name, notebook region, operating system, environment, machine types, accelerators, and so on (see the following screenshot):  Figure 4.5 – Configuring a new user-managed Jupyter Notebook Once we click on the CREATE button, it can take a couple of minutes to create a notebook instance. Once it is ready, we can launch the Jupyter instance in a browser tab using the link provided inside Workbench (see Figure 4.6). We also get the option to stop the notebook for some time when we are not using it (to reduce cost):  Figure 4.6 – A running Jupyter Notebook instance This Jupyter instance can be accessed by all team members having access to Workbench, which helps in collaborating and sharing progress with other teammates. Once we click on OPEN JUPYTERLAB, it opens a familiar Jupyter environment in a new tab (see Figure 4.7):  Figure 4.7 – A user-managed JupyterLab instance in Vertex AI Workbench A Google-managed JupyterLab instance also looks very similar (see Figure 4.8):  Figure 4.8 – A Google-managed JupyterLab instance in Vertex AI Workbench Now that we can access the notebook instance in the browser, we can launch a new Jupyter Notebook or terminal and get started on the project. After providing sufficient permissions to the service account, many useful Google Cloud services such as BigQuery, GCS, Dataflow, and so on can be accessed from the Jupyter Notebook itself using SDKs. This makes Vertex AI Workbench a one-stop tool for every ML development need. Note: We should stop Vertex AI Workbench instances when we are not using them or don’t plan to use them for a long period of time. This will help prevent us from incurring costs from running them unnecessarily for a long period of time. In the next sections, we will learn how to create notebooks using custom containers and how to schedule notebooks with Vertex AI Workbench. Custom containers for Vertex AI Workbench Vertex AI Workbench gives us the flexibility of creating notebook instances based on a custom container as well. The main advantage of a custom container-based notebook is that it lets us customize the notebook environment based on our specific needs. Suppose we want to work with a new TensorFlow version (or any other library) that is currently not available as a predefined kernel. We can create a custom Docker container with the required version and launch a Workbench instance using this container. Custom containers are supported by both managed and user-managed notebooks. Here is how to launch a user-managed notebook instance using a custom container: 1. The first step is to create a custom container based on the requirements. Most of the time, a derivative container (a container based on an existing DL container image) would be easy to set up. See the following example Dockerfile; here, we are first pulling an existing TensorFlow GPU image and then installing a new TensorFlow version from the source: FROM gcr.io/deeplearning-platform-release/tf-gpu:latest RUN pip install -y tensorflow2. Next, build and push the container image to Container Registry, such that it should be accessible to the Google Compute Engine (GCE) service account. See the following source to build and push the container image: export PROJECT=$(gcloud config list project --format "value(core.project)") docker build . -f Dockerfile.example -t "gcr.io/${PROJECT}/ tf-custom:latest" docker push "gcr.io/${PROJECT}/tf-custom:latest"Note that the service account should be provided with sufficient permissions to build and push the image to the container registry, and the respective APIs should be enabled. 3. Go to the User-managed notebooks page, click on the New Notebook button, and then select Customize. Provide a notebook name and select an appropriate Region and Zone value. 4. In the Environment field, select Custom Container. 5. In the Docker Container Image field, enter the address of the custom image; in our case, it would look like this: gcr.io/${PROJECT}/tf-custom:latest 6. Make the remaining appropriate selections and click the Create button. We are all set now. While launching the notebook, we can select the custom container as a kernel and start working on the custom environment. Conclusion Vertex AI Workbench stands out as a powerful, cloud-based environment that streamlines machine learning development and deployment. By leveraging its managed and user-managed notebook options, teams can overcome local development limitations, ensuring better scalability, enhanced security, and integrated access to Google Cloud services. This guide has explored the foundational aspects of working with Vertex AI Workbench, including its customizable environments, scheduling features, and the use of custom containers. With Vertex AI Workbench, data scientists and ML practitioners can focus on innovation and productivity, confidently handling projects from inception to production. Author BioJasmeet Bhatia is a machine learning solution architect with over 18 years of industry experience, with the last 10 years focused on global-scale data analytics and machine learning solutions. In his current role at Google, he works closely with key GCP enterprise customers to provide them guidance on how to best use Google's cutting-edge machine learning products. At Google, he has also worked as part of the Area 120 incubator on building innovative data products such as Demand Signals, and he has been involved in the launch of Google products such as Time Series Insights. Before Google, he worked in similar roles at Microsoft and Deloitte.When not immersed in technology, he loves spending time with his wife and two daughters, reading books, watching movies, and exploring the scenic trails of southern California.He holds a bachelor's degree in electronics engineering from Jamia Millia Islamia University in India and an MBA from the University of California Los Angeles (UCLA) Anderson School of Management.Kartik Chaudhary is an AI enthusiast, educator, and ML professional with 6+ years of industry experience. He currently works as a senior AI engineer with Google to design and architect ML solutions for Google's strategic customers, leveraging core Google products, frameworks, and AI tools. He previously worked with UHG, as a data scientist, and helped in making the healthcare system work better for everyone. Kartik has filed nine patents at the intersection of AI and healthcare.Kartik loves sharing knowledge and runs his own blog on AI, titled Drops of AI.Away from work, he loves watching anime and movies and capturing the beauty of sunsets.
Read more
  • 0
  • 0
  • 802

article-image-microsoft-ais-skeleton-key-automl-with-autogluon-multion-ais-retrieve-api-narrative-bis-hybrid-ai-pythons-duck-typing-gibbs-diffusion
05 Jul 2024
13 min read
Save for later

Microsoft AI’s Skeleton Key, AutoML with AutoGluon, MultiOn AI's Retrieve API, Narrative BI’s Hybrid AI, Python's Duck Typing, Gibbs Diffusion

05 Jul 2024
13 min read
Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Happy Friday! Welcome to DataPro#101—Your Essential Data Science & ML Update! 🚀  This week, we’ve curated the latest techniques in data extraction, transforming unstructured data into structured formats, best practices for prompt engineering in NL2SQL, and much more. Consider this your all-in-one guide to staying informed in the ever-evolving world of data science and machine learning. Now, dive in and explore these exciting new ideas! ⚡ Tech Highlights: Stay Updated! Prompt Engineering with Claude 3: Learn hands-on techniques on Amazon Bedrock. Accelerated PyTorch: Boost models with torch.compile on AWS Graviton. BigQuery Data Canvas: Perfect your prompts. Skeleton Key AI: New AI jailbreak method. GraphRAG: Complex data discovery tool on GitHub. 📚 New from Packt Library Data Science for Web3 - Guide to blockchain data analysis and ML. 🔍 Latest in LLMs & GPTs NASA-IBM's INDUS Models: Advanced science LLMs. EvoAgent: Evolutionary multi-agent systems. Kyutai's Moshi: Real-time AI model. MultiOn AI's Retrieve API: Accurate web search. Gibbs Diffusion (GDiff): Bayesian image denoising. Narrative BI’s Hybrid AI: Business data analysis. WildGuard: Safe LLM interactions. ProgressGym: Ethical AI alignment. OmniParse: Structuring unstructured data for GenAI. ✨ What's Fresh Claude 3.5 Sonnet Use Cases: Future AI capabilities. Explainability in ML: Make models understandable. Group-By Aggregation: Powerful EDA tool. OpenAI and PandasAI: Series operations. AutoML with AutoGluon: ML in four lines of code. Python's Duck Typing: Flexible coding concept. 🔰 GitHub Finds: Add These Repos fal/AuraSR arcee-ai/Arcee-Spark-GGUF pprp/Pruner-Zero ruiyiw/patient-psi hrishioa/rakis ragapp/ragapp Doriandarko/claude-engineer hao-ai-lab/MuxServe DataPro Newsletter is not just a publication; it’s a complete toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 Data Science Tool Kit ➔ ️ fal/AuraSR: AuraSR, a GAN-based super-resolution model for upscaling images. Implemented in PyTorch, it's inspired by the GigaGAN paper, enhancing image quality significantly. ➔ arcee-ai/Arcee-Spark-GGUF: Arcee Spark, a 7B model from Qwen2, excels with fine-tuning and DPO, outperforming GPT-3.5 on tasks, ideal for efficient AI deployment. ➔ pprp/Pruner-Zero: Pruner-Zero automates symbolic pruning metric discovery for Large Language Models, surpassing current methods in language modeling and zero-shot tasks. ➔ ruiyiw/patient-psi: Patient-Ψ uses Large Language Models to simulate patient interactions for training mental health professionals, emphasizing cognitive modeling and practical deployment. ➔ hrishioa/rakis: Rakis is a browser-based permissionless AI inference network enabling decentralized consensus without servers, emphasizing open-source and educational use. ➔ ragapp/ragapp: RAGapp simplifies enterprise use of Agentic RAG models, configurable like OpenAI's custom GPTs, deployable via Docker on cloud infrastructure. ➔ Doriandarko/claude-engineer: Claude Engineer, powered by Anthropic's Claude-3.5-Sonnet, aids software development through an interactive CLI blending AI model capabilities with file operations and web search. ➔ hao-ai-lab/MuxServe: MuxServe efficiently serves multiple LLMs using spatial-temporal multiplexing, optimizing memory and computation resources based on LLM popularity and characteristics. 📚 Expert Insights from Packt CommunityData Science for Web3: A comprehensive guide to decoding blockchain data with data analysis basics and machine learning cases By Gabriela Castillo Areco Understanding the blockchain ingredients If you have a background in blockchain development, you may skip this section. Web3 represents a new generation of the World Wide Web that is based on decentralized databases, permissionless and trustless interactions, and native payments. This new concept of the internet opens up various business possibilities, some of which are still in their early stages. Currently, we are in the Web2 stage, where centralized companies store significant amounts of data sourced from our interactions with apps. The promise of Web3 is that we will interact with Decentralized Apps (dApps) that store only the relevant information on the blockchain, accessible to everyone. As of the time of writing, Web3 has some limitations recognized by the Ethereum organization: Velocity: The speed at which the blockchain is updated poses a scalability challenge. Multiple initiatives are being tested to try to solve this issue. Intuition: Interacting with Web3 is still difficult to understand. The logic and user experience are not as intuitive as in Web2 and a lot of education will be necessary before users can start utilizing it on a massive scale. Cost: Recording an entire business process on the chain is expensive. Having multiple smart contracts as part of a dApp costs a lot for the developer and the user. Blockchain technology is a foundational technology that underpins Web3. It is based on Distributed Ledger Technology (DLT), which stores information once it is cryptographically verified. Once reflected on the ledger, each transaction cannot be modified and multiple parties have a complete copy of it. Two structural characteristics of the technology are the following: It is structured as a set of blocks, where each block contains information (cryptographically hashed – we will learn more about this in this chapter) about the previous block, making it impossible to alter it at a later stage. Each block is chained to the previous one by this cryptographic sharing mechanism. It is decentralized. The copy of the entire ledger is distributed among several servers, which we will call nodes. Each node has a complete copy of the ledger and verifies consistency every time it adds a new block on top of the blockchain. This structure provides the solution to double spending, enabling for the first time the decentralized transfer of value through the internet. This is why Web3 is known as the internet of value. Since the complete version of the ledger is distributed among all the participants of the blockchain, any new transaction that contradicts previously stored information will not be successfully processed (there will be no consensus to add it). This characteristic facilitates transactions among parties that do not know each other without the need for an intermediary acting as a guarantor between them, which is why this technology is known as trustless. The decentralized storage also takes control away from each server and, thus, there is no sole authority with sufficient power to change any data point once the transaction is added to the blockchain. Since taking down one node will not affect the network, if a hacker wants to attack the database, they would require such high computing power that the attempt would be economically unfeasible. This adds a security level that centralized servers do not have. This excerpt is from the latest book, "Data Science for Web3: A comprehensive guide to decoding blockchain data with data analysis basics and machine learning cases” written by Gabriela Castillo Areco. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!   Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS  ➤ Prompt engineering techniques and best practices: Learn by doing with Anthropic’s Claude 3 on Amazon Bedrock. In this blog post, the focus is on crafting effective prompts for generative AI models to achieve desired outputs. It emphasizes the importance of well-constructed prompts in guiding models like Claude 3 Haiku on Amazon Bedrock to produce accurate and relevant responses, showcasing examples of prompt variations and their impact. ➤ Accelerated PyTorch inference with torch.compile on AWS Graviton processors. In this blog post, AWS optimized PyTorch's torch.compile feature for AWS Graviton3 processors, significantly enhancing performance for Hugging Face and TorchBench model inference compared to the default eager mode. These optimizations, available from PyTorch 2.3.1, aim to streamline model execution on Graviton3-based Amazon EC2 instances. Google➤ How to write prompts for BigQuery data canvas?  This blog post focuses on leveraging generative AI, specifically Gemini in BigQuery, to perform data tasks via natural language queries (NL2SQL and NL2Chart). It highlights how refining NL prompts can enhance query accuracy, promoting collaboration and efficiency among data professionals using BigQuery's data canvas tool. Microsoft➤ Microsoft AI Unveils Skeleton Key: A Novel Generative AI Jailbreak Method. This blog post discusses a newly discovered type of attack in generative AI called Skeleton Key, also known as Master Key. It explores how this attack bypasses AI guardrails, allowing models to generate unauthorized content, and outlines Microsoft's mitigation strategies using Prompt Shields in Azure AI. ➤ GraphRAG: New tool for complex data discovery now on GitHub. The update introduces GraphRAG, a graph-based approach to retrieval-augmented generation (RAG), now available on GitHub. It enhances information retrieval and response generation by automating knowledge graph extraction from text datasets, offering structured insights for global queries. An Azure-hosted API facilitates easy deployment without coding. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🔸 NASA-IBM Collaboration Develops INDUS Large Language Models for Advanced Science Research. The blog explores NASA's collaboration with IBM to develop INDUS, a suite of specialized language models (LLMs) tailored for scientific domains. INDUS enhances data analysis, retrieval, and curation across Earth science, heliophysics, and more, advancing research capabilities in diverse scientific disciplines. 🔸 EvoAgent: Expanding Expert Agents to Multi-Agent Systems with Evolutionary Algorithms. EvoAgent automates the extension of expert agents to multi-agent systems using evolutionary algorithms, applicable to any LLM-based agent framework. It enhances agent diversity and performance across tasks, exemplified in debates by generating varied opinions and improving content quality dynamically. 🔸 Kyutai Releases Moshi: A Real-Time AI Model that Understands and Speaks. Kyutai introduces Moshi, a real-time native multimodal foundation model surpassing GPT-4o functionalities. Moshi understands emotions, speaks with accents like French, and handles dual audio streams, enabled by joint pre-training on text and audio. It supports open-source transparency and runs efficiently on consumer hardware. 🔸 MultiOn AI's Retrieve API Boosts Web Search with Real-Time Accuracy for Advanced Applications. MultiOn AI has launched the Retrieve API, a cutting-edge tool for autonomous web information retrieval. It enhances data extraction from web pages with real-time processing, catering to diverse applications such as personalized shopping assistants, automated lead generation, and content creation tools, setting new standards in web data extraction technology. 🔸 Gibbs Diffusion (GDiff): A Bayesian Blind Denoising Method for Images and Cosmology. The study introduces Gibbs Diffusion (GDiff) as an innovative method for blind denoising with deep generative models. It enables simultaneous sampling of signal and noise parameters, improving Bayesian inference for scenarios like natural image denoising and cosmological data analysis, enhancing accuracy in noise characterization and signal recovery. 🔸 Narrative BI Introduces Hybrid AI Approach for Business Data Analysis: The research explores hybrid approaches in business data analysis, combining rule-based systems' precision with Large Language Models' (LLMs) pattern recognition. This integration aims to generate actionable insights from complex datasets, improving efficiency and accuracy in decision-making processes for businesses. 🔸 WildGuard: A Lightweight Moderation Tool for User Safety in LLM Interactions. The paper introduces WildGuard, an open and lightweight moderation tool for enhancing safety in Large Language Models (LLMs). It focuses on identifying malicious intent in user prompts, detecting safety risks in model responses, and evaluating model refusal rates. WildGuard achieves state-of-the-art performance across these tasks, addressing critical gaps in existing moderation tools.  🔸 ProgressGym: ML Framework for Ethical Alignment in Frontier AI. This research addresses the influence of AI systems, particularly large language models (LLMs), on human epistemology and societal values. It introduces progress alignment as a technical solution to prevent AI reinforcement of problematic moral beliefs. ProgressGym, an experimental framework, facilitates learning from historical data to advance real-world moral decision-making challenges. 🔸 OmniParse: AI Platform for Structuring Unstructured Data for GenAI Applications. OmniParse tackles the challenge of managing diverse unstructured data types—documents, images, audio, video, and web content—by converting them into structured formats optimized for AI applications. It integrates various tools like Surya OCR and Florence-2 for accurate data extraction, enhancing workflow efficiency and data usability across platforms. ✨ On the Radar: Catch Up on What's Fresh🔹 10 Use Cases of Claude 3.5 Sonnet: Unveiling the Future of Artificial Intelligence AI with Revolutionary Capabilities. Claude 3.5 Sonnet by Anthropic AI marks a leap forward in AI capabilities, showcasing versatility across diverse domains. It excels in generating n-body particle animations, interactive learning dashboards, escape room experiences, virtual psychiatry, interactive poster designs, educational visual demonstrations, customizable calendar applications, real-time object detection, financial tools, and advanced physics simulations. 🔹 Explainability, Interpretability and Observability in Machine Learning: The article explores the nuances of machine learning (ML) transparency through concepts like explainability, interpretability, and observability. It discusses their definitions, distinctions, and importance in fostering trust, accountability, and effective deployment of ML models across various industries and applications. 🔹 A Powerful EDA Tool: Group-By Aggregation. The article dives into Exploratory Data Analysis (EDA) techniques, focusing on group-by aggregation in Pandas. Using the Metro Interstate Traffic dataset as an example, it demonstrates how to derive insights such as monthly traffic progression, daily traffic profiles, hourly traffic patterns by weekday versus weekend, and identifying top weather conditions associated with congestion rates. 🔹 Using OpenAI and PandasAI for Series Operations: This article explores PandasAI, leveraging AI models like OpenAI to enhance Pandas data manipulation tasks. It covers querying Series values, creating new Series, conditional value setting, and reshaping data using natural language commands. Examples include summarizing statistics, conditional operations, and reshaping COVID-19 and NLS youth study datasets efficiently. 🔹 AutoML with AutoGluon: ML workflow with Just Four Lines of Code. The article explores AutoGluon, an automated machine-learning framework developed by Amazon Web Services (AWS). It discusses how AutoGluon simplifies the entire machine-learning process—from data preprocessing to model selection and hyperparameter tuning—making it accessible and efficient for users across various data types like tabular, text, and image data. 🔹 Understanding Python's Duck Typing: The article explores the concept of duck typing in Python, emphasizing behavior over type. It allows objects to be used based on their methods rather than explicit types, promoting flexibility and polymorphism. Duck typing simplifies code but requires careful handling to avoid runtime errors. See you next time!
Read more
  • 0
  • 0
  • 1093

article-image-top-100-essential-data-science-tools-repos-streamline-your-workflow-today
Merlyn Shelley
27 Jun 2024
14 min read
Save for later

Top 100+ Essential Data Science Tools & Repos: Streamline Your Workflow Today!

Merlyn Shelley
27 Jun 2024
14 min read
IntroductionAs data professionals, navigating the vast sea of Big Data often leaves us searching for the right tools to harness its potential. Whether we're defining intricate problems, identifying emerging trends, or crafting innovative solutions, the challenge is undeniable. Too often, this quest has us wandering aimlessly through the web, seeking elusive answers. Here at the DataPro Newsletter team, we understand this all too well. That's why, in celebration of our 100th edition, we're thrilled to present a special gift to our valued readers—a thorough reference module brimming with resources. This carefully curated collection features over 100 of the most popular tools and GitHub repositories. Each one is not only widely used and trusted but is also consistently updated with the latest breakthroughs to enhance your data processing capabilities. Think of this module as your treasure chest, designed to streamline your workflow and inspire innovative solutions. Bookmark this page for quick access whenever you encounter challenges in any area of data science and machine learning, from DataOps to Recommender Systems to Quantitative Finance—we've got it all covered! So, dive into this one-stop reference module, explore its depths, and let the spirit of data kinship propel you forward. Here's to more empowering tools and transformative insights from your DataPro team—cheers! DataOps/MLOps kestra-io/kestra: Kestra is an open-source orchestrator for scheduled and event-driven workflows, leveraging Infrastructure as Code for reliable management. open-metadata/OpenMetadata: OpenMetadata is a unified platform for data discovery, observability, and governance, featuring a central repository, column lineage, and team collaboration. dolthub/dolt: Dolt is a SQL database with Git-like version control features, accessible via MySQL or a command line interface. iterative/dvc: DVC is a tool for reproducible machine learning, enabling data and model versioning, lightweight pipelines, experiment tracking, and easy sharing. quiltdata/quilt: Quilt allows creating versioned datasets with Python and an S3 bucket. It supports data-driven teams, aiding rapid experimentation and collaboration. Real-time Data Processing allinurl/goaccess: GoAccess is a real-time web log analyzer for *nix systems and browsers, offering fast HTTP statistics. More details: goaccess.io. feathersjs/feathers: Feathers is a TypeScript/JavaScript framework for building APIs and real-time apps, compatible with various backends and frontends. apache/age: Apache AGE extends PostgreSQL with graph database capabilities, supporting both relational SQL and openCypher graph queries seamlessly. zephyrproject-rtos/zephyr: Real-time OS for diverse hardware, from IoT sensors to smart watches, emphasizing scalability, security, and resource efficiency. hazelcast/hazelcast: Hazelcast integrates stream processing and fast data storage for real-time insights, enabling immediate action on data-in-motion within unified platform. Data Quality Management WeBankFinTech/Qualitis: Qualitis manages data quality through verification, notification, and management across various data sources, solving data processing-related quality issues. raystack/optimus: Optimus is a robust workflow orchestrator for data transformation, modeling, pipelines, and quality management, emphasizing ease of use and reliability. Toloka/crowd-kit: Crowd-Kit is a Python library for crowdsourced annotation, featuring aggregation methods, metrics, and datasets to simplify working with crowd data. ydataai/ydata-profiling: ydata-profiling offers a streamlined, fast EDA solution akin to pandas' df.describe(), providing detailed DataFrame analysis exportable in formats like HTML and JSON. cleanlab/cleanlab: cleanlab automates data and label cleaning by detecting issues in ML datasets, enhancing model training with real-world data. Predictive Analytics spring-cloud/spring-cloud-dataflow: Spring Cloud Data Flow enables microservices-driven data processing pipelines on Cloud Foundry and Kubernetes, supporting diverse use cases like streaming and batch processing. ScottfreeLLC/AlphaPy: AlphaPy, a Python ML framework, caters to speculators and data scientists with scikit-learn, pandas, and additional tools for feature engineering and visualization. retentioneering/retentioneering-tools: Retentioneering simplifies analyzing clickstreams and user paths, offering deeper insights than funnel analysis, benefiting data and marketing analysts. genular/pandora: PANDORA offers advanced analytics for biomedical research, employing machine learning tools like clustering, PCA, UMAP, and interpretable models for discovery. nabeel-oz/qlik-py-tools: Qlik's SSE integrates modern data science into Qlik Sense, enabling business users to leverage advanced analytics through Python-based functions. Deep Learning Lightning-AI/pytorch-lightning: Lightning 2.0 simplifies PyTorch workflows with a stable API, enabling scalable training and deployment of AI models efficiently. ultralytics/yolov5: YOLOv5 by Ultralytics is a leading vision AI model, built on extensive open-source research and development for advanced performance. hpcaitech/ColossalAI: Colossal-AI simplifies distributed deep learning with user-friendly tools, enabling easy parallel training and inference similar to local model development. naptha/tesseract.js: Tesseract.js simplifies OCR with a webassembly-based Tesseract engine, supporting both browser and Node.js environments with easy integration and setup. microsoft/DeepSpeed: DeepSpeed enables efficient training of models like ChatGPT with significant speed improvements and cost reductions across all scales. Reinforcement Learning ray-project/ray: Ray is a unified framework that scales AI and Python applications with a distributed runtime and specialized AI libraries. d2l-ai/d2l-en: An open-source book using Jupyter notebooks to make deep learning accessible, blending concepts, context, and interactive code examples. Unity-Technologies/ml-agents: Unity ML-Agents enables games and simulations for training intelligent agents with deep reinforcement learning and imitation learning, fostering innovation in AI. google/trax: Trax is a Google Brain-endorsed deep learning library known for clear code and speed, demonstrated in a Colab notebook. wandb/wandb: The repository includes a CLI and Python API for visualizing and tracking machine learning experiments effectively. VowpalWabbit/vowpal_wabbit: Vowpal Wabbit advances machine learning with online, hashing, allreduce, and active learning techniques, pushing the frontier of ML capabilities. Time Series Analysis taosdata/TDengine: TDengine is a high-performance, open-source time-series database designed for IoT, connected cars, industrial IoT, and DevOps environments. timescale/timescaledb: An open-source SQL database for time-series data, optimized for rapid data ingestion and complex querying, available as a PostgreSQL extension. influxdata/telegraf: Telegraf is an agent for gathering and processing metrics, logs, and data, featuring 300+ plugins and community-driven development for flexibility. questdb/questdb: QuestDB is an open-source time-series database known for high throughput ingestion, fast SQL queries, and operational simplicity, ideal for various high-cardinality datasets. ccfos/nightingale: Nightingale is an all-in-one, open-source, cloud-native monitoring system combining data collection, visualization, and alerting capabilities seamlessly. Data Engineering  PrefectHQ/prefect: Prefect simplifies Python data pipeline orchestration, transforming scripts into dynamic workflows that react to changes and ensure resilience. airbytehq/airbyte: Airbyte, an open-source data integration platform, offers 300+ connectors for seamless ELT pipelines between diverse data sources and destinations. argoproj/argo-workflows: Argo Workflows orchestrates parallel jobs on Kubernetes via container-native workflows, supporting DAGs and accelerating compute-intensive tasks like ML and data processing. dagster-io/dagster:  Dagster is a cloud-native data pipeline orchestrator with integrated lineage, observability, declarative programming, and robust testability across the lifecycle. Avaiga/taipy: Taipy simplifies web app development for data scientists & ML engineers using Python, focusing on AI algorithms with no extra languages. Business Intelligence ankane/blazer: SQL-based tool for data exploration, chart creation, dashboard sharing. Supports various data sources, variables, checks, audits, and security integrations. evidence-dev/evidence: Open-source BI tool uses Markdown with SQL queries for data sourcing, rendering charts, and generating templated, dynamic web pages. lightdash/lightdash: Empower teams with self-service data insights using dbt: define metrics, visualize data, and share dashboards seamlessly across your organization. TuiQiao/CBoard: User-friendly open BI platform for self-service reporting and dashboards, simplifying data insights and sharing across teams effortlessly. quarylabs/quary: BI platform for engineers to connect databases, write SQL for table transformations, create charts, dashboards, and reports with collaboration and deployment capabilities. Data Visualization netdata/netdata: Real-time metrics collection and visualization for servers, cloud, Kubernetes, and edge/IoT devices, scaling effortlessly across diverse environments. directus/directus: Open-source API and dashboard for managing SQL database content with REST & GraphQL interfaces, supporting various databases, and customizable for on-premises or cloud deployment. airbnb/visx: Reusable low-level visualization components combining d3's power with React's DOM updating capabilities for dynamic data visualization. uber/react-vis: React component library for diverse data visualizations: line, bar, scatter, heatmaps, pie charts, sunbursts, radar charts, and more. bokeh/bokeh: Interactive visualization library for web browsers, offering versatile graphics creation and high-performance interactivity for large datasets and dashboards. apache/echarts: Free JavaScript library for intuitive, interactive, and customizable charts, ideal for enhancing commercial products with powerful visualizations. Recommender Systems NicolasHug/Surprise: Python scikit for building recommender systems with explicit rating data, emphasizing experiment control, dataset handling, and diverse prediction algorithms. gorse-io/gorse: Open-source recommendation system in Go, designed for universal integration into online services, automating model training based on user interaction data. recommenders-team/recommenders: Recommenders, a Linux Foundation project, offers Jupyter notebooks for building classic and cutting-edge recommendation systems, covering data prep, modeling, evaluation, optimization, and production deployment on Azure. alibaba/Alink: Alink, developed by Alibaba's PAI team, integrates Flink for ML algorithms. PyAlink supports various Flink versions, maintaining compatibility up to Flink 1.13. RUCAIBox/RecBole: RecBole, built on Python and PyTorch, facilitates research with 91 recommendation algorithms across general, sequential, context-aware, and knowledge-based categories. Quantitative Finance AI4Finance-Foundation/FinGPT: FinGPT is a cost-effective, adaptable financial large language model for quick updates and fine-tuning, enhancing accessibility compared to BloombergGPT. google/tf-quant-finance: This library leverages TensorFlow's hardware acceleration and automatic differentiation for high-performance mathematical methods, mid-level functions, and pricing models support. goldmansachs/gs-quant: GS Quant, a Python toolkit by Goldman Sachs, aids in developing quantitative trading strategies and risk management solutions with robust market experience. domokane/FinancePy: A Python finance library specializing in pricing and managing financial derivatives across fixed-income, equity, FX, and credit markets. romanmichaelpaolucci/Q-Fin: QFin is evolving with enhanced object-oriented principles, deprecating old modules like PDEs/SDEs, introducing 'stochastics' for model calibration and option pricing. avhz/RustQuant: This Rust library for quantitative finance covers diverse modules from autodiff and data handling to instruments pricing and stochastic processes. Responsible AI microsoft/responsible-ai-toolbox: Responsible AI Toolbox offers interfaces and libraries for model and data exploration, enabling developers to monitor and improve AI responsibly. Giskard-AI/giskard: Giskard, an open-source Python library, detects performance, bias, and security issues in AI applications, spanning LLMs to traditional ML models. fairlearn/fairlearn: Fairlearn, a Python package, helps developers assess and mitigate fairness issues in AI systems with algorithms and assessment metrics provided. Azure/PyRIT: PyRIT is an open-access Python tool for generative AI, aiding security professionals and ML engineers in identifying system risks. ModelOriented/DALEX: DALEX enhances model transparency to prevent failure through its explainability tools, supporting understanding and trust in complex AI systems. JohnSnowLabs/langtest: LangTest simplifies testing of AI models with over 60 tests in one line, covering robustness, bias, fairness, and accuracy across various NLP frameworks. Explainable AI (XAI) SeldonIO/alibi: Alibi is a Python library focused on machine learning model inspection, offering diverse explanation methods for classification and regression models. Trusted-AI/AIX360: AI Explainability 360 offers an open-source Python toolkit for detailed model interpretability across various data types, supporting diverse explanation methods. dssg/aequitas: Aequitas is an open-source toolkit for bias auditing and Fair ML, aiding data scientists and researchers in assessing and correcting model biases. albermax/innvestigate: iNNvestigate is a Python library providing a unified interface for various methods to analyze neural networks' predictions and understand their internal workings. mindsdb/lightwood: Lightwood is an AutoML framework simplifying machine learning pipelines with JSON-AI syntax, allowing customization and automation across diverse data types. Anomaly Detection SeldonIO/alibi-detect: Alibi Detect is a Python library for detecting outliers, adversarial attacks, and drift in tabular, text, image, and time series data. datamllab/tods: TODS automates outlier detection in multivariate time-series data with modules for data processing, feature analysis, and diverse detection algorithms. pygod-team/pygod: PyGOD is a Python library using PyTorch Geometric for graph outlier detection, offering 10+ algorithms and easy integration with PyOD. Jingkang50/OpenOOD: This repository replicates methods from the Generalized Out-of-Distribution Detection Framework for fair comparison across anomaly, novelty, and out-of-distribution detection methods. yzhao062/pyod: PyOD is a Python library for detecting anomalies in multivariate data, offering diverse algorithms for various project scales and datasets. chaos-genius/chaos_genius: Chaos Genius is an open-source ML-powered analytics engine for outlier detection and root cause analysis at scale. Supply Chain Analytics guacsec/guac: GUAC creates a high fidelity graph database for software security, facilitating organizational outcomes like audit, policy, and risk management. owasp-dep-scan/blint: BLint is a Binary Linter using lief to verify executable security and capabilities, now supporting SBOM generation for compatible binaries. samirsaci/picking-route: This repository focuses on improving warehouse productivity through Python-based tools and methodologies, particularly addressing order batching and optimizing picking routes using the Single Picker Routing Problem (SPRP). ragamarkely/scanalytics: Scanalytics automates Supply Chain Analytics & Design tasks in Python, streamlining analyses and reducing manual spreadsheet work for assignments. aitechtools/SunFlow: SunFlow optimizes supply chain design with comprehensive modeling of materials, components, suppliers, manufacturers, and customers, integrating costs, capacities, and constraints. CIOL-SUST/SupplyGraph: This repository introduces a benchmark dataset for applying Graph Neural Networks (GNNs) to supply chain networks, enabling research in optimization and prediction. Network Optimization ray-project/ray: Ray is a scalable framework with a distributed runtime and AI libraries designed to accelerate AI and Python applications. svg/svgo: SVGO optimizes SVG files by removing redundant metadata, comments, and hidden elements to improve file efficiency and rendering performance. zeux/meshoptimizer: meshoptimizer is a C/C++ library optimizing GPU rendering by reducing mesh complexity and storage overhead, compatible with Rust via meshopt crate. cvxpy/cvxpy: CVXPY is a Python-based modeling language designed for convex optimization problems, providing a natural expression format aligned with mathematical conventions. guofei9987/scikit-opt: The repository provides Python implementations of various swarm intelligence algorithms such as Genetic Algorithm, Particle Swarm Optimization, and others for optimization tasks. Speech Processing espnet/espnet: ESPnet is a detailed speech processing toolkit using PyTorch, covering recognition, synthesis, translation, enhancement, diarization, and understanding tasks. mozilla/DeepSpeech: DeepSpeech is an open-source Speech-To-Text engine based on Baidu's research, implemented using TensorFlow for accessibility and performance. microsoft/SpeechT5: The repository proposes SpeechT5, adapting T5's text-to-text approach for self-supervised speech and text representation learning using shared encoders and modality-specific nets. sloria/TextBlob: Python library simplifying NLP tasks like POS tagging, sentiment analysis, and classification with a straightforward API for textual data. pytorch/audio: Torchaudio integrates PyTorch with audio processing, emphasizing GPU acceleration, trainable features via autograd, and maintaining a consistent tensor-based style. Graph Data Science neo4j/graph-data-science: The Neo4j Graph Data Science (GDS) library offers graph algorithms, transformations, and ML pipelines, accessible via Cypher within Neo4j. cncf/landscape-graph: This repository explores open source project dynamics, evolution, and collaboration using a Graph Data Model for insightful community analysis. BlueBrain/nexus: Blue Brain Nexus organizes and enhances data with a Knowledge Graph ecosystem, featuring various products, libraries, and tools for comprehensive use. lynxkite/lynxkite: LynxKite is a robust graph data science platform with a user-friendly interface and powerful Python API for large datasets. dgraph-io/dgraph: Dgraph is a scalable GraphQL database optimized for performance, offering ACID transactions and distributed architecture for real-time queries. arangodb/arangodb: ArangoDB is a versatile multi-model database supporting documents, graphs, and key-values, empowering high-performance applications with SQL-like queries and JavaScript extensions. ETL/ELT (Extract, Transform, Load / Extract, Load, Transform) redpanda-data/connect: Redpanda Connect is a robust stream processor for seamless data integration, featuring a powerful mapping language and easy deployment options. turbot/steampipe: Steampipe simplifies data access from APIs with CLI, Postgres FDWs, SQLite extensions, export tools, and cloud-based Turbot Pipes. risingwavelabs/risingwave: RisingWave is a cost-efficient streaming database compatible with Postgres, designed for real-time event streaming data processing and analysis. apache/dolphinscheduler: Apache DolphinScheduler is a modern data orchestration platform with low-code workflow creation, robust task management, and cloud-native capabilities. rudderlabs/rudder-server: RudderStack is a privacy-focused, Segment-alternative platform in Golang and React. It simplifies data collection and integrates with warehouses and tools for enriched customer data pipelines. We hope this extensive collection of tools and techniques proves to be a valuable asset in your daily data practice. May it help you achieve smoother workflows and better outcomes! 
Read more
  • 1
  • 0
  • 1215

article-image-fabrics-code-first-automl-and-hyperparameter-tuning-google-cloud-cortex-framework-snowflakes-data-metric-functions-qliks-ai-accelerator
Merlyn Shelley
29 Apr 2024
12 min read
Save for later

Fabric’s Code-First AutoML and Hyperparameter Tuning, Google Cloud Cortex Framework, Snowflake’s Data Metric Functions, Qlik's AI Accelerator

Merlyn Shelley
29 Apr 2024
12 min read
Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to BI-Pro #54: Your Premier Destination for Data and Business Intelligence Insights! 🌟 In this edition, we dive deep into the cutting-edge solutions of business intelligence, data modeling, and advanced analytics. Prepare to explore an array of transformative topics and industry insights that will redefine how you interact with technology and data. 🧩 Highlights of This Issue: Python Practice Platforms: The top 7 platforms where you can sharpen your Python skills. Innovative Experiments: Dive into hands-on experiments with MLFlow and Microsoft Fabric to enhance your project’s efficiency. SAP Expertise: Master the complex data models of SAP and leverage them for optimal performance. AI-Powered Business Management: Learn how to integrate AI to streamline and enhance business management functions. Snowflake’s Surveillance: Monitor your data pipelines effectively using Snowflake’s Data Metric Functions. 🧬 Stay Informed with Industry Highlights: Power BI: Learn about the significant deprecation of AutoML in Power BI using Dataflows V1. Microsoft Fabric: Get the scoop on the new code-first AutoML and hyperparameter tuning, now available in public preview. AWS BI: Discover how to build SAP Golden AMIs with EC2 Image Builder and Ansible and explore the transformative impact of Amazon Q on business experiences. Google Cloud Data: Catch up with the latest updates from the Google Cloud Cortex Framework. Tableau: Uncover how Einstein Copilot for Tableau is building the next generation of AI-driven analytics. From the Experts at Packt Community: Gain insights from industry leaders on the fundamentals of Analytics Engineering. 🧮 What’s the Latest from the BI Community? Explore real-time AI capabilities with Datorios’ new observability tool. Learn about Snowflake's launch of Arctic, an enterprise-grade LLM. Discover how Qlik's AI Accelerator is integrating generative AI to deliver customer outcomes. Witness the future of AI with Avant Technologies’ new supercomputing advancements. Join us as we unpack these topics to keep you at the forefront of the data and BI world. Stay curious, stay informed! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktPackt BI-Pro is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Upgrade to paidSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🧩 pixiedust/pixiedust: PixieDust is an open-source library enhancing Jupyter notebooks, improving data work experience, particularly for cloud-hosted notebooks without configuration access. 🧩 plotly/plotly.py: plotly.py is an interactive, open-source graphing library for Python, offering over 30 chart types, including scientific, 3D, statistical, and financial charts. 🧩 AykutSarac/jsoncrack.com: JSON Crack is a free, open-source data visualization app for JSON, YAML, XML, CSV, etc., offering interactive graphs for easy data exploration and analysis. 🧩 apexcharts/apexcharts.js: ApexCharts is a JavaScript charting library with a simple API, 100+ samples, and over a dozen chart types for beautiful, responsive visualizations in apps and dashboards. 🧩 antvis/G2: G2 is a visualization library inspired by "The Grammar of Graphics," offering an introduction, examples, tutorials, and API reference for learning and using its core concepts. 🧩 visgl/deck.gl: deck.gl simplifies high-performance, WebGL2/WebGPU-based visualization of large datasets. It offers pre-built layers for easy setup or customizable architecture for tailored needs. Email Forwarded? Join BI-Pro Here!🔮 Revolutionizing Analytics: New BI Tools🧬 7 Best Platforms to Practice Python: The article lists seven platforms—Practice Python, Edabit, CodeWars, Exercism, PYnative, LeetCode, and HackerRank—that offer various levels of programming challenges for learning and practicing Python, particularly for coding interviews and skill improvement. 🧬 Experimenting with MLFlow and Microsoft Fabric: The blog discusses the importance of systematic experimentation in machine learning (ML) to improve model performance, highlighting the use of MLFlow within Fabric for managing ML experiments. It covers setting up experiments, running them, logging results, and analyzing them, emphasizing the importance of tracking configurations and outcomes for iterative improvement in ML models. 🧬 Mastering SAP’s data models: The article discusses challenges faced in understanding SAP data models for analytics, focusing on integrating procurement data. It explains SAP's ERP software, data architecture basics, table types (master vs. transaction), and data mapping for procurement tables. 🧬 Building an AI-Powered Business Manager: The post explores the concept of consolidating business management into a single, chat-based platform powered by Large Language Models (LLMs). It discusses the advantages for small businesses, outlines project structure, sets up the database, and updates the Tool class to handle SQLModel instances. 🧬 Monitor Data Pipelines Using Snowflake’s Data Metric Functions: The author emphasizes the importance of data quality in gaining trust with stakeholders and focuses on using Google's Site Reliability Engineering principles to measure the health of data systems. It discusses defining service level indicators and objectives for data quality dimensions and provides a technical implementation example in Snowflake. ⚡Stay Informed with Industry HighlightsPower BI🧮 Deprecation of AutoML in Power BI using Dataflows V1: The update announces the deprecation of Power BI Automated Machine Learning (AutoML) models for Dataflows V1 in all regions as of the third week of April. Customers are encouraged to migrate to the AutoML solution based on Synapse Data Science in Microsoft Fabric, offering a more customizable AutoML experience with advanced tools and features. Microsoft Fabric🧮 Introducing Code-First AutoML and Hyperparameter Tuning: Now in Public Preview for Fabric Data Science: The update introduces code-first automated machine learning (AutoML) and hyperparameter tuning in Public Preview for Fabric Data Science. Users can access both AutoML and Tune capabilities seamlessly within the Fabric 1.2 runtime, enhancing machine learning model optimization and accessibility. 🧮 Fabric Change the Game: Embracing Azure Cosmos DB for NoSQL. The post explores setting up Azure Cosmos DB for NoSQL and leveraging Vector Search capabilities of AI Search Services through Microsoft Fabric's Lakehouse features. It also discusses integrating Cosmos DB Mirror and using Python coding facilitated through Lakehouse, highlighting Fabric's integration capabilities for search or data mirroring. 🧮 Microsoft Fabric April 2024 Update: The April 2024 update brings various enhancements and previews to Microsoft Fabric, including new visuals like the 100% Stacked Area Chart, improvements to reporting, data connectivity, administration features, analytics, real-time analytics, data factory, and data pipelines. Additionally, the update includes the availability of Exam DP-600 for Fabric Analytics Engineer certification and free learning sessions. AWS BI  🧮 Build SAP Golden AMIs with EC2 Image Builder and Ansible: This blog post guides users on building a reusable Amazon Machine Image (AMI) for deploying Amazon Elastic Compute Cloud (EC2) instances for SAP installations. It covers using Terraform and Ansible to automate the process and provides sample code. 🧮 Transforming Business Experiences: The Impact of Amazon Q and Generative BI for AWS Partners. This post highlights how advances in AI, particularly Amazon Q and generative BI, are transforming business operations. It showcases how AWS partners like ZS Associates, Tiger Analytics, and Compass UOL are leveraging these innovations for industry-specific solutions. Google Cloud Data 🧮 What’s new with Google Cloud Cortex Framework? The article discusses Google Cloud Cortex Framework, emphasizing its role in unifying enterprise data for AI-driven insights. It highlights new solutions for marketing, sustainability management, and finance, showcasing how Cortex Framework accelerates innovation, enhances decision-making, and drives business efficiency in the AI era. Tableau🧮 Einstein Copilot for Tableau: Building the Next Generation of AI-Driven Analytics. The post delves into the development of Einstein Copilot for Tableau, an AI-driven tool revolutionizing data analysis. It highlights the challenges and solutions in building its infrastructure, improving accuracy and efficiency, and enhancing AI and core capabilities through collaboration and continuous improvement. ✨ Expert Insights from Packt CommunityFundamentals of Analytics Engineering - By Dumky De Wilde, Fanny Kassapian, Jovan Gligorevic and 4 more The role of dbt in analytics engineering dbt emerged as a solution to the challenges relating to data transformation faced in data analysis. Initially crafted as an open-source Python package, dbt aimed to bring software engineering best practices to the world of analytics. Over time, dbt matured beyond just a package, becoming a versatile cloud service. While the open-source package remains available and actively supported, dbt now offers a cloud-based version, packed with features such as an integrated development environment (IDE), scheduling tools, data lineage trackers, and hosted documentation. This is especially valuable for analysts who might not have a deep software engineering background. For more information on dbt’s history, read https://www.getdbt.com/blog/what-exactly-is-dbt. We will use dbt Cloud, which offers a free tier for a single developer: that’s you! You can learn more about its pricing here: https://www.getdbt.com/pricing. dbt seamlessly integrates into the ELT architecture. It does not store or process data but serves as a bridge between analysts and the data warehouse. dbt’s position in a data stack as an intermediary in the transformation layer. This is how it works: analysts draft SQL queries, enhanced with dbt’s unique capabilities. dbt then translates this specialized SQL into the native SQL of the data warehouse and dispatches it for execution. All the transformed data and results remain within the data warehouse, making dbt a lightweight yet powerful tool in the analytics toolkit. Because of dbt’s pivotal position in analytics engineering, we will spend more time discussing its features and zooming in on best practices. First, we will set up dbt for our use case. Setting up dbt Cloud The following steps are required for dbt: Creating a dbt Cloud account. Setting up a connection from dbt Cloud to BigQuery. Testing the connection by querying the data using dbt Cloud. Follow the step-by-step instructions here: https://github.com/PacktPublishing/Fundamentals-of-Analytics-Engineering/blob/main/chapter_8/guides/setting_up_dbt_cloud.md. Now, let’s focus on the various data layers in dbt. Data layers in dbt It is a widespread practice to separate the data we use for analytics into layers. This helps data practitioners communicate the distinct parts of the data transformation process. Broadly speaking, the process will fall into three layers in dbt, Raw, Preparation and Business.  Let’s take a closer look: Raw layer: The source data is stored in the form it arrives in. Whenever you receive data, it should be stored as-is so that you have a backup in case something goes wrong during the transformations. When you copied the Excel sheets using Airbyte, they became part of the raw layer inside BigQuery. Preparation layer: In the second layer, the raw data is cleaned, deduplicated, and transformed to conform to naming conventions and other rules. For our data, this could mean renaming fields for readability and standardizing sales figures from cents to euros. Business layer: In the final layer, business rules are applied to the prepared data, and different data is joined and modeled into datasets that are ready for consumption by BI tools and stakeholders. In our case, we might add a business rule to disregard negative sales amounts when summing the total stroopwafels sold, as these are likely an error. The resulting data can then be served to the BI tool for dashboarding. Discover more insights from Fundamentals of Analytics Engineering - By Dumky De Wilde, Fanny Kassapian, Jovan Gligorevic and 4 more. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!    Read Here💡 What's the Latest Scoop from the BI Community? 🧠 Datorios unleashes real-time AI with the first observability tool for streaming data: Datorios introduces the first observability tool for Apache Flink, offering deep insights into streaming data processing. It enables faster AI innovation and thorough auditability, providing developers with event visualization, event search, state monitoring, window analysis, and more. Datorios is now publicly available for free. 🧠 Snowflake Launches Arctic: The Most Open, Enterprise-Grade Large Language Model: Snowflake introduces Snowflake Arctic, an open, enterprise-grade large language model (LLM) with a Mixture-of-Experts architecture, optimized for complex enterprise workloads. Arctic sets new openness standards for AI technology, offering weights under an Apache 2.0 license and enhancing AI innovation. 🧠 Introducing Qlik's AI Accelerator - Delivering Tangible Customer Outcomes in Generative AI Integration: Qlik is at the forefront of integrating generative AI, particularly Large Language Models (LLMs), into data analysis and decision-making. They address key challenges like data privacy, technical complexity, and cost, offering seamless integration of popular LLMs and an AI Accelerator program to quickly prove the benefits of AI integration with minimal barriers to entry. 🧠 Avant Technologies Launches Advanced AI Supercomputing: Avant Technologies, an AI company, introduces a supercomputing network and licensable dataset with Wired4Tech, aiming to accelerate AI adoption. The offerings include a versatile AI dataset, dynamic resource scaling, accelerated AI processing, robust security measures, and seamless integration, designed to empower developers and drive innovation across industries. See you next time!
Read more
  • 0
  • 0
  • 405

article-image-bi-pro49-microsoft-fabric-lifecycle-management-data-factory-adds-cicd-to-fabric-data-pipelines-database-mirroring-aws-well-architected-data-analytics-lens
Merlyn Shelley
04 Apr 2024
11 min read
Save for later

BI-Pro#49: Microsoft Fabric Lifecycle Management, Data Factory Adds CI/CD to Fabric Data Pipelines, Database Mirroring, AWS Well-Architected Data Analytics Lens

Merlyn Shelley
04 Apr 2024
11 min read
Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to BI-Pro #49, your ultimate guide to data and BI insights! 🚀 ⏩ What's Inside? Python Simplified: Master data validation with Pydantic. Visualize Like a Pro: 30+ tools for stunning data visuals. R for Bioinformatics: Custom visuals for bio data. Interactive Data: JavaScript meets Handsontable. Seaborn Stories: Craft data tales with line plots. MetaGPT Insights: Next-gen data solutions unveiled. 🏭 Industry Scoop: Power BI’s Latest: March's must-know features. Fabric Innovations: Updates and new tools from Microsoft Fabric. AWS Well-Architected Data Analytics Lens: Analytics strategies for the real world. Google Cloud Savings: Cut costs on ETL workflows. Tableau Journeys: From student to BI analyst. 💎 Expert Takes: Deep Dive into Python Deep Learning: The latest from Packt. 👉 Community Buzz: Twitch Chat Analysis, Graph Networks, LLM Data Quality, and Ethical AI: Key conversations this week! Dive into the trends shaping data and BI today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktThanks for reading Packt BI-Pro! Subscribe for free to receive new posts and support our work.Pledge your supportSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🌀 man-group/ArcticDB: ArcticDB is a high-performance DataFrame database designed for Python Data Science, with a Python-centric API for Pandas DataFrames. 🌀 gradio-app/gradio: Gradio is an open-source Python package for building demos or web apps for ML models or Python functions, with easy sharing via built-in features. 🌀 Sinaptik-AI/pandas-ai: PandasAI is a Python library using generative AI to explore, clean, and analyze data with natural language queries.🌀 OpenRefine/OpenRefine: OpenRefine is a powerful Java-based tool for loading, understanding, cleaning, reconciling, and augmenting data, accessible from a web browser. 🌀 Kanaries/pygwalker: PyGWalker simplifies Jupyter Notebook workflows by converting pandas dataframes into interactive user interfaces for data analysis and visualization. 🌀 cleanlab/cleanlab: cleanlab aids in data and label cleaning by identifying issues in ML datasets automatically, enabling better model training with real-world data.Email Forwarded? Join BI-Pro Here!🔮 Data Viz with Python Libraries  🌀 Pydantic Tutorial: Data Validation in Python Made Simple. This blog tutorial explains how to use Pydantic, a data validation and serialization library in Python, to validate and serialize data classes, offering support for custom validators and Python's type hints for field validation. 🌀 30+ Data Visualization Libraries, Frameworks and Apps, Mastering Data Presentation: Explore over 30 data visualization tools like Metabase, Gephi, and Grafana, offering a range of features to transform raw data into meaningful visualizations for better decision-making in industries like tech, healthcare, finance, and marketing. 🌀 Mastering Data Visualization in R for Bioinformatics:  The article delves into data visualization in R for bioinformatics, stressing its role in understanding complex biological data, communicating findings, hypothesis generation, and decision-making. It also discusses Anscombe's Quartet, highlighting the importance of visualizing data before analysis and the limitations of summary statistics. 🌀 Integrating JavaScript charting libraries with Handsontable: The article guides developers on integrating Highcharts, Recharts, and Chart.js with Handsontable for data visualization. It explains the features of each library and provides demos for creating a stock portfolio with interactive charts. 🌀 Data Visualization with Seaborn Line Plot: The article introduces Seaborn, a Python library for data visualization, built on top of Matplotlib. It covers installation and demonstrates creating single line plots and customizing styles for better presentation of data. 🌀 MetaGPT’s Data Interpreter: SOTA Open Source LLM-based Data Solutions. MetaGPT introduces its Data Interpreter, a new agent for streamlined data interpretation and analysis. The Data Interpreter employs advanced techniques for real-time data adaptability, tool integration, and logical inconsistency identification, showcasing superior performance in machine learning tasks. ⚡Stay Informed with Industry HighlightsPower BI 🌀 Power BI March 2024 Feature Summary: The Power BI update introduces visual calculation editing, data model editing in the Power BI Service, and report subscription delivery to OneDrive SharePoint. A new Microsoft Fabric certification exam, DP-600, is also available, with free certification opportunities through the Fabric AI Skills Challenge. 🌀 Announcing the Public Preview of Database Mirroring in Microsoft Fabric: Mirroring, now in Public Preview, allows seamless integration of databases into Microsoft Fabric's OneLake, providing real-time insights without ETL. It simplifies data replication and warehousing, enabling easy data access and analysis across different sources, including data lakes and warehouses. 🌀 Get data with Power Query available in Power BI Report Builder (Preview): Power BI Report Builder now allows connecting to 100+ data sources like Snowflake, Databricks, and AWS Redshift. You can transform data using M-Query for paginated reports. Install the latest version and connect from the "Data" tab. Microsoft Fabric🌀 Microsoft Fabric March 2024 Update: This update brings new features like OneLake File Explorer, Autotune Query Tuning, and Test Framework for Power Query SDK in VS Code to Power BI, enhancing reporting, modeling, service, mobile, and developer experiences. 🌀 Data Factory Adds CI/CD to Fabric Data Pipelines: Fabric engineers with Azure Synapse Analytics and Azure Data Factory experience can now utilize Git integration and built-in Deployment Pipelines in Data Factory data pipelines in Fabric. This public preview offers source control, CI/CD features, and collaborative development environments, enhancing data analytics projects. 🌀 Microsoft Fabric Lifecycle Management – Getting started with Git Integration and Deployment Pipelines: Microsoft Fabric makes Lifecycle Management easy, enabling continuous releases through Git and Deployment Pipelines. Git allows reliable updates for supported items like Lakehouse, Notebooks, and Reports, while Deployment Pipelines clone content between stages like DEV, TEST, UAT, and PROD. AWS BI  🌀 Announcing the AWS Well-Architected Data Analytics Lens: The Data Analytics Lens helps assess and improve analytics platforms on AWS. It offers best practices, such as building ACID-compliant data lakes and leveraging Serverless for data pipelines, aligned with the AWS Well-Architected Framework's pillars for secure, efficient, and cost-effective solutions. 🌀 Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics. The post discusses how healthcare providers can improve patient care by leveraging AWS services for real-time analytics and personalized healthcare, focusing on a zero-ETL approach to data integration.Google Cloud Dat🌀 Enrich streaming data in Bigtable with Dataflow: The post discusses the importance of event stream processing in data engineering and introduces Apache Beam's Enrichment transform, which simplifies the process of enriching streaming data with Bigtable, improving data context and enabling more meaningful analysis.🌀 Dataflow at-least-once vs. exactly-once streaming modes: The post compares exactly-once and at-least-once processing modes in Dataflow Streaming Engine for streaming jobs. It explains the trade-offs between the two modes and provides guidance on choosing the right mode based on use case requirements. Tableau🌀 Data is both art and science - My Tableau Story: Andy Cotgreave. The post highlights Andy Cotgreave's journey from a data analyst at Oxford to becoming a Senior Technical Evangelist at Tableau. It emphasizes the importance of community engagement, innovation, building a portfolio, and having fun in data visualization. 🌀 Student to BI Analyst, How Tableau Can Lead to a Successful Data Career: This blog discusses Karolina Grodzinska's data visualization journey, from discovering Tableau to winning Iron Viz: Student Edition and becoming a Business Intelligence Analyst at Schneider Electric. Karolina emphasizes the importance of an active Tableau Public profile in career development and shares tips for building a strong portfolio and networking with the Tableau Community. ✨ Expert Insights from Packt CommunityPython Deep Learning - Third Edition - By Ivan VasilevDeveloping NN models for edge devices with TF Lite TF Lite is a TF-derived set of tools that allows us to run models on mobile, embedded, and edge devices. Its versatility is part of TF’s appeal for industrial applications (as opposed to research applications, where PyTorch dominates).The key paradigm of TF Lite is that the models run on-device, contrary to client-server architecture, where the model is deployed on remote, more powerful, hardware. This organization has the following implications (both good and bad): Low-latency execution: The lack of server-round trip significantly reduces the model inference time and allows us to run real-time applications. Privacy: The user data never leaves the device. Internet connectivity: Internet connectivity is not required. Small model size: The devices have limited computational ability, hence the need for small and computationally efficient models. More specifically, TF Lite models are stored in the FlatBuffers (https://flatbuffers.dev/) special efficient portable format, identified by the .tflite file extension. Besides its small size, it allows us to access data directly without parsing/unpacking it first. TF Lite models support a subset of the TF Core operations and allow us to define custom ones: Low power consumption: The devices often run on battery. Divergent training and inference: NN training is a lot more computationally intensive compared to inference. Because of this, the model training runs on a different, more powerful, piece of hardware than the actual devices, where the models will run inference. In addition, TF Lite has the following key features: Multi-platform and multi-language support, including Android (Java), iOS (Objective-C and Swift) devices, web (JavaScript), and Python for all other environments. Google provides a TF Lite wrapper API called MediaPipe Solutions (https://developers.google.com/mediapipe, https://github.com/google/mediapipe/), which supersedes the previous TF Lite API. Optimized for performance. It has end-to-end solution pipelines. TF Lite is oriented toward practical applications, rather than research. Because of this, it includes different pipelines for common ML tasks such as image classification, object detection, text classification, and question answering among others. The computer vision pipelines use modified versions of EfficientNet or MobileNet, and the natural language processing pipelines use BERT-based models. So, how does TF Lite model development work? First, we’ll select a model in one of the following ways:  An existing pre-trained .tflite model (https://tfhub.dev/s?deployment-format=lite). Use MediaPipe Model Maker (https://developers.google.com/mediapipe/solutions/model_maker) to apply feature engineering transfer learning on an existing .tflite model with a custom training dataset. Model Maker only works with Python. Convert a full-fledged TF model into .tflite format. Discover more insights from 'Python Deep Learning - Third Edition' by Ivan Vasilev. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here💡 What's the Latest Scoop from the BI Community? 🌀 Real-Time Twitch Chat Sentiment Analysis with Apache Flink: This blog explores building a real-time sentiment analysis application for Twitch chat using Apache Flink. It covers setting up the project, reading Twitch chat messages, performing sentiment analysis, and concludes with a demo. 🌀 Entity Type Prediction with Relational Graph Convolutional Network (PyTorch): This post discusses a Python setup for predicting entity types on heterogeneous graphs using the Relational Graph Convolutional Network (R-GCN) and the RGCNConv module from PyTorch. It explains knowledge graphs, entity type prediction, and the R-GCN model. 🌀 Data Quality Error Detection powered by LLMs: This article explores automating the identification of data errors in tabular datasets using Large Language Models (LLMs). It discusses the Data Dirtiness Score, challenges in data cleaning, and the potential of LLMs in detecting data quality issues. 🌀 Building Ethical AI Starts with the Data Team — Here’s Why: This article discusses the ethical considerations of AI, focusing on model bias, AI usage, and data responsibility. It emphasizes the role of data teams in ensuring ethical AI and suggests steps for data teams to take towards a more ethical future. See you next time!
Read more
  • 0
  • 0
  • 554

article-image-elevate-your-bi-dashboards-with-figma
Merlyn Shelley
28 Mar 2024
12 min read
Save for later

Elevate Your BI Dashboards with Figma

Merlyn Shelley
28 Mar 2024
12 min read
Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!Partnering with Figma Want to take your BI dashboards to the next level? Figma is the way to go!  It's all about ramping up the design, making things work better, and giving your Power BI projects a real boost.  With Figma, you'll speed up your projects, get more creative, and see better performance. So, why not give your reports a makeover with Figma? It's where design and data come together to make a big impact! Here's what Figma offers: ✅ Figma Professional: An all-in-one tool for seamless team collaboration. ✅ FigJam: Enables real-time teamwork and brainstorming. ✅ FigJam AI: Integrates ChatGPT for smarter collaboration. Guess what? You also have the Power BI UI Kit from the Figma Community! Sign Up Now! 👋 Hello,Welcome to BI-Pro #48, your ultimate guide to data and BI insights! 🚀In this issue: 🔮 Python Data Viz Matplotlib Data Visualization Seaborn: Visualizing Data in Python Use pandas for CSV Data Visualization Guides on SQL, Python, Data Cleaning, and Analysis Build An AI App with Python in 10 Steps ⚡ Industry Highlights Power BI Hybrid Workforce Experience Report Lakeview Dashboards Overview Grouping and Binning in Power BI Desktop Dashboards in Operations Manager Microsoft Fabric Analyze Dataverse Tables Bridging Fabric Lakehouses AWS Big Data Multicloud Analytics with Amazon Athena Analyze Fastly CDN Logs with QuickSight Google Cloud Data Spark Procedures in BigQuery  Gemini Pro 1.0 in BigQuery via Vertex AI ✨ Expert Insights from Packt Community Unlocking the Secrets of Prompt Engineering 💡 BI Community Scoop Creating Interactive Power BI Dashboards Using Report Templates in Power BI Desktop 10 Analytics Dashboard Examples for SaaS Future of Data Storytelling: Actionable Intelligence Power BI: Transforming Banking Data Power BI vs Tableau vs Qlik Sense | 2024 Winner Get ready to supercharge your skills with BI-Pro! 🌟 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🌀 sdv-dev/SDV: The Synthetic Data Vault (SDV) is a Python library that creates tabular synthetic data by learning patterns from real data using machine learning algorithms. 🌀 hyperspy/hyperspy: HyperSpy is a Python library for analyzing multidimensional datasets, making it easy to apply analytical procedures and access tools. 🌀 hi-primus/optimus: Optimus is a Python library for loading, processing, plotting, and creating ML models that works with pandas, Dask, cuDF, dask-cuDF, Vaex, or Spark. It simplifies data processing and offers various functions for data quality, plotting, and cross-platform compatibility. 🌀 mingrammer/diagrams: Diagrams simplifies cloud system architecture design in Python, supporting major providers and tracking changes in version control. 🌀 kayak/pypika: PyPika simplifies building SQL queries in Python with a flexible, easy-to-use interface, leveraging the builder design pattern for clean, efficient queries. Email Forwarded? Join BI-Pro Here!Partnering with Webflow   Transform your BI reporting with Webflow Enterprise.  Create visually stunning, scalable websites without coding, using a visual canvas.Seamlessly integrate with popular BI platforms and let Webflow handle the code.Start building smarter, faster, and more reliable websites for your data-driven decisions today! Get Started for Free! 🔮 Data Viz with Python Libraries  🌀 Matplotlib Data Visualization in Python: This blog introduces Matplotlib, a Python library for 2D visualizations, covering its capabilities and plot types like line, scatter, bar, histograms, and pie charts. It highlights Matplotlib's versatility, customization, and integration with other libraries, making it essential for data science and research. 🌀 Visualizing Data in Python With Seaborn:  This article introduces the seaborn library for statistical visualizations in Python. It covers creating various plots, such as bar, distribution, and relational plots, using seaborn's functional and objects interfaces. It emphasizes seaborn's clear and concise code for effective data visualization. 🌀 Use pandas to Visualize CSV Data in Python: This blog discusses using the CData Python Connector for CSV with pandas, Matplotlib, and SQLAlchemy to analyze and visualize live CSV data in Python. It highlights the ease of integration and superior performance of the connector, along with step-by-step instructions for connecting to CSV data, executing SQL queries, and visualizing the results in Python. 🌀 Collection of Guides on Mastering SQL, Python, Data Cleaning, Data Wrangling, and Exploratory Data Analysis: This guide is tailored for business intelligence professionals new to data science, offering step-by-step instructions on mastering SQL, Python, data cleaning, wrangling, and exploratory analysis. It emphasizes practical skills for extracting insights and showcases essential tools and techniques for effective data analysis. 🌀 Build An AI Application with Python in 10 Easy Steps: This blog outlines a 10-step guide to building and deploying AI applications with Python, covering objectives, data collection, model selection, training, evaluation, optimization, web app development, cloud deployment, and sharing the AI model, with practical advice for each step. ⚡Stay Informed with Industry HighlightsPower BI 🌀 Hybrid Workforce Experience Power BI report: This tutorial explains using the Power BI Hybrid Workforce Experience report to analyze the impact of hybrid work models on employees working onsite, remotely, or in a hybrid manner. It covers setup, key metrics analysis, and improving employee experience, with prerequisites outlined. 🌀 What are Lakeview dashboards? This article discusses Lakeview dashboards, designed for creating and sharing data visualizations within teams. It highlights their advanced features, comparison with Databricks SQL dashboards, and dataset optimizations for better performance, including handling various dataset sizes and query efficiency. 🌀 Use grouping and binning in Power BI Desktop: This article explains how to use grouping and binning in Power BI Desktop to refine data visualization. Grouping allows you to combine data points into larger categories for clearer analysis, while binning lets you define the size of data chunks for more meaningful visualization. The article provides step-by-step instructions for creating, editing, and applying groups and bins to numerical and time fields, enhancing the exploration of data and trends in visuals. 🌀 Dashboards in Operations Manager: This article covers dashboard templates and widgets in Operations Manager, outlining their layouts and functions. It highlights various dashboard types, such as Service Level, Summary, and Object State, each with specific widgets. Users can create, share, and view dashboards across different consoles. Microsoft Fabric🌀 Analyze Dataverse tables from Microsoft Fabric: The article announces new features for Dynamics 365 and Power Apps customers, allowing easy integration of insights into Fabric. Users can now create shortcuts to Dataverse environments in Fabric for quick data access and analysis across multiple environments, enhancing business insights. 🌀 Bridging Fabric Lakehouses: Delta Change Data Feed for Seamless ETL. This article explains using Delta Tables and the Delta Change Data Feed in Microsoft Fabric for efficient data synchronization across lakehouses. It highlights Delta Tables' features and demonstrates updating tables across Silver and Gold Lakehouses in a medallion architecture. AWS BI  🌀 Multicloud data lake analytics with Amazon Athena: This post discusses creating a unified query interface using Amazon Athena connectors to seamlessly query across multiple cloud data stores, simplifying analytics in organizations with data spread over different clouds. It also explores managing analytics costs using Athena workgroups and cost allocation tags. 🌀 How to Analyze Fastly Content Delivery Network Logs with Amazon QuickSight Powered by Generative BI? This post discusses using Fastly, a content delivery network (CDN), to enhance web performance and security. It highlights creating a dashboard with Amazon QuickSight for analyzing CDN logs, using AWS services like S3 and Glue for data storage and cataloging. Google Cloud Data 🌀 Apache Spark stored procedures in BigQuery are GA: BigQuery now supports Apache Spark stored procedures, enabling users to integrate Spark-based data processing with BigQuery's SQL capabilities. This simplifies using Spark within BigQuery, allowing seamless development, testing, and deployment of PySpark code, and installation of necessary packages in a unified environment. 🌀 Gemini Pro 1.0 available in BigQuery through Vertex AI: This post advocates for a unified platform to bridge data and AI teams, ensuring smooth workflows from data ingestion to ML training. It introduces BigQuery ML, enabling ML model creation, training, and execution in BigQuery using SQL. It supports various models, including Vertex AI-trained ones like PaLM 2 and Gemini Pro 1.0, and enables sharing trained models, promoting governed data usage and easy dataset discovery. Gemini Pro 1.0 integration into BigQuery via Vertex AI simplifies generative AI, enhancing collaboration, security, and governance in data workflows. ✨ Expert Insights from Packt CommunityUnlocking the Secrets of Prompt Engineering - By Gilbert Mizrahi Exploring LLM parameters LLMs such as OpenAI’s GPT-4 consist of several parameters that can be adjusted to control and fine-tune their behavior and performance. Understanding and manipulating these parameters can help users obtain more accurate, relevant, and contextually appropriate outputs. Some of the most important LLM parameters to consider are listed here: Model size: The size of an LLM typically refers to the number of neurons or parameters it has. Larger models can be more powerful and capable of generating more accurate and coherent responses. However, they might also require more computational resources and processing time. Users may need to balance the trade-off between model size and computational efficiency, depending on their specific requirements. Temperature: The temperature parameter controls the randomness of the output generated by the LLM. A higher temperature value (for example, 0.8) produces more diverse and creative responses, while a lower value (for example, 0.2) results in more focused and deterministic outputs. Adjusting the temperature can help users fine-tune the balance between creativity and consistency in the model’s responses. Top-k: The top-k parameter is another way to control the randomness and diversity of the LLM’s output. This parameter limits the model to consider only the top “k” most probable tokens for each step in generating the response. For example, if top-k is set to 5, the model will choose the next token from the five most likely options. By adjusting the top-k value, users can manage the trade-off between response diversity and coherence. A smaller top-k value generally results in more focused and deterministic outputs, while a larger top-k value allows for more diverse and creative responses. Max tokens: The max tokens parameter sets the maximum number of tokens (words or subwords) allowed in the generated output. By adjusting this parameter, users can control the length of the response provided by the LLM. Setting a lower max tokens value can help ensure concise answers, while a higher value allows for more detailed and elaborate responses. Prompt length: While not a direct parameter of the LLM, the length of the input prompt can influence the model’s performance. A longer, more detailed prompt can provide the LLM with more context and guidance, resulting in more accurate and relevant responses. However, users should be aware that very long prompts can consume a significant portion of the token limit, potentially truncating the model’s output. Discover more insights from 'Unlocking the Secrets of Prompt Engineering' by Gilbert Mizrahi. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here💡 What's the Latest Scoop from the BI Community? 🌀 Creating Interactive Power BI Dashboards That Engage Your Audience: This blog discusses the challenges faced by stakeholders and clients unfamiliar with using dashboards, preferring traditional tools like Excel. It emphasizes the importance of creating user-friendly and interactive dashboards to bridge this gap, offering techniques to enhance engagement and accessibility.🌀 Create and use report templates in Power BI Desktop: This tutorial explains how to create and use report templates in Power BI Desktop, enabling users to streamline report creation and standardize layouts, data models, and queries. Templates, saved with the .PBIT extension, help jump-start and share report creation processes across an organization. 🌀 10 Analytics Dashboard Examples to Gain Data Insights for SaaS: This article discusses the importance of analytics dashboards in simplifying the tracking of SaaS metrics and extracting insights. It provides 10 examples of analytics dashboards, including web, digital marketing, and user behavior, and highlights the top 5 analytics tools. The article emphasizes the need for clear, customizable, and intuitive dashboards for effective decision-making. 🌀 The Future of Data Storytelling: Actionable Intelligence [AI, Power BI, and Office]: This blog post discusses Zebra BI's solutions for reporting, planning, and presenting, emphasizing the importance of clarity, consistency, and actionability in data visualization. It introduces the concept of a reporting-planning-presenting cycle and highlights upcoming features and innovations, including the integration of AI. The post also mentions Zebra BI's adherence to the IBCS standard for clear and consistent business communication. 🌀 Power BI: Transforming Banking Data. This blog post discusses how Power BI can help banks analyze complex data for better decision-making. It covers challenges in banking, how Power BI integrates data sources, develops dashboards, and optimizes analytics. Benefits include improved operations, customer experience, risk management, and cost savings. 🌀 Power BI vs Tableau vs Qlik Sense | Which Wins In 2024? This blog compares Power BI, Tableau, and Qlik Sense for business intelligence (BI) and analytics. It highlights Power BI's advantages in data management, Tableau's strong visualization capabilities, and Qlik Sense's modern self-service platform. The article concludes with a comparison of features and recommendations for different needs. See you next time!Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support! 
Read more
  • 0
  • 0
  • 1156
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-transforming-web-data-with-browse-ai
Merlyn Shelley
26 Mar 2024
14 min read
Save for later

Transforming Web Data with Browse AI

Merlyn Shelley
26 Mar 2024
14 min read
Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!Partnering with Browse AI Turn Web Data into Your Business Superpower!👉 Train a robot in 2 minutes, no coding needed. 🤖 👉 Ideal for web scraping and data monitoring. 🌐 Here’s what you get: Monitor Websites for Changes ✅ Download Data from Any Website ✅ Turn Any Website into an API ✅ Product data extraction ✅ Also, extract data from news, stocks, jobs, social media, and more. Check out this 1-minute explainer video on how to extract data to Excel, Airtable, and connect to 5,000+ apps using Zapier! Start for free with up to 50 credits, and for a limited time, enjoy free setup and onboarding for Team and Company plans, saving up to 20% on Annual plans. Get Scraping Today!👋 Hello,Welcome to DataPro#85 – Your one-stop shop for the latest in Data Science and ML Algorithms! 🚀 In this issue:⚙️ Keeping Up with LLMs & GPTs  Meet Devin: The pioneering AI software engineer. Google's Croissant: A fresh take on metadata for ML-ready datasets. INSTRUCTIR by Kaist AI: Setting new standards in instruction-following for information retrieval models. Spyx by Sussex AI: Turbocharging spiking neural networks with just-in-time compiled optimization. SynCode by VMware: Enhancing LLM code generation with a touch of grammar. Chatbot Arena: The ultimate battleground for evaluating LLMs by human preference. Apollo: Bringing medical AI to the masses with a multilingual medical LLM. ✨ On the RadarTop AI tools for code generation in 2024. Setting up a Pypi mirror in AWS with Terraform. Ensuring safer code changes with custom pre-commit hooks. Deciphering the AQLM Quantization Algorithm. AI's role in revolutionizing web browsing. Tackling tensors through three tricky errors. Running RStudio inside a container. Harnessing PyTorch and MLX for Apple Silicon. 🏭 Industry Highlights Google Research: Boosting LLMs with Cappy, evolving tables with Chain-of-table, and Scalable Instructable Multiworld Agent (SIMA). AWS: Streamlining code review with generative AI using Amazon Bedrock. OpenAI Updates: Leadership continuity and global news partnerships. 📚 New in Packt Library Practical Guide to Applied Conformal Prediction in Python by Valery Manokhin. DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ deepseek-ai/DeepSeek-VL: Open-source Vision-Language (VL) model for real-world tasks, handling logical diagrams, web pages, formulas, scientific literature, and more. 🛠️ OpenGVLab/VideoMamba: VideoMamba enhances 3D CNNs and video transformers, excelling in long-term video understanding with scalability and modality compatibility. 🛠️ showlab/DragAnything: DragAnything uses entity representation for motion control in video generation, offering user-friendly interaction and outperforming existing methods. 🛠️ pkunlp-icler/FastV: FastV accelerates large vision language models by pruning redundant visual tokens, achieving 45% FLOPs reduction without performance loss. 🛠️ cnulab/RealNet: RealNet introduces SDAS for anomaly strength control, AFS for feature selection, and RRS for anomaly region identification. Partnering with SurfsharkSurfshark is allowing our readers to enjoy a full 2 years of their award-winning VPN protection for 79% off, plus 2 months free. With Surfshark One, you get: Unlimited devices and connections ✅ One account for the entire household ✅ Your online activity, made safe, secure, and invisible ✅ Plus, identity protection, ad blocking, antivirus, and data breach monitoring.Claim your VPN protection today! 📚 Expert Insights from Packt CommunityPractical Guide to Applied Conformal Prediction in Python - By Valery Manokhin Basic components of a conformal predictor We will now look at the basic components of a conformal predictor: Nonconformity measure: The nonconformity measure is a function that evaluates how much a new data point differs from the existing data points. It compares the new observation to either the entire dataset (in the full transductive version of conformal prediction) or the calibration set (in the most popular variant – ICP. The selection of the nonconformity measure is based on a particular machine learning task, such as classification, regression, or time series forecasting, as well as the underlying model. This will examine several nonconformity measures suitable for classification and regression tasks. Calibration set: The calibration set is a portion of the dataset used to calculate nonconformity scores for the known data points. These scores are a reference for establishing prediction intervals or regions for new test data points. The calibration set should be a representative sample of the entire data distribution and is typically randomly selected. The calibration set should contain a sufficient number of data points (at least 500). If the dataset is small and insufficient to reserve enough data for the calibration set, the user should consider other variants of conformal prediction – including TCP (see, for example, Mastering Classical Transductive Conformal Prediction in Action – https://medium.com/@valeman/how-to-use-full-transductive-conformal-prediction-7ed54dc6b72b). Test set: The test set contains new data points for generating predictions. For every data point in the test set, the conformal prediction model calculates a nonconformity score using the nonconformity measure and compares it to the scores from the calibration set. Using this comparison, the conformal predictor generates a prediction region that includes the target value with a user-defined confidence level. All these components work in tandem to create a conformal prediction framework that facilitates valid and efficient uncertainty quantification in a wide range of machine learning tasks. Discover more insights from 'Practical Guide to Applied Conformal Prediction in Python' by Valery Manokhin. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!   Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 Enhance code review and approval efficiency with generative AI using Amazon Bedrock: This post discusses the challenges faced by managers in overseeing code review and approval processes in software development, such as lack of technical expertise, time constraints, volume of change requests, manual effort, and the need for documentation. It also introduces a solution that leverages generative artificial intelligence and integrates it with AWS deployment tools to streamline the review and approval process. The solution includes automated change analysis, summarization, and an approval workflow. Google Research 🌀 Cappy: Outperforming and boosting large multi-task language models with a small scorer. This blog discusses advancements in large language models (LLMs) and their use in natural language processing (NLP). It introduces the concept of multi-task LLMs, such as T0, FLAN, and OPT-IML, which excel at understanding and solving various tasks. It also presents a new approach called Cappy, a lightweight pre-trained scorer that enhances the performance and efficiency of multi-task LLMs. 🌀 Chain-of-table: Evolving tables in the reasoning chain for table understanding. This research focuses on improving how large language models (LLMs) reason over tabular data, which is challenging due to the structured nature of tables. The proposed framework, Chain-of-Table, trains LLMs to iteratively update tables, mimicking human reasoning, resulting in improved performance on table understanding tasks. 🌀 Talk like a graph: Encoding graphs for large language models. This research explores how to teach large language models (LLMs) to reason with graph information, crucial for understanding interconnected data. They introduce GraphQA, a benchmark to evaluate LLMs on graph problems, revealing insights into effective graph encoding methods and improving LLM performance on graph tasks by up to 60%. 🌀 Scalable Instructable Multiworld Agent (SIMA): A generalist AI agent for 3D virtual environments. Google DeepMind has developed SIMA, a versatile AI agent trained on multiple video games to follow natural-language instructions, akin to human behavior. Collaborating with game studios, SIMA navigates various environments, showcasing potential for AI to understand and execute diverse tasks. OpenAI Updates 🌀 Review completed & Altman, Brockman to continue to lead OpenAI: The OpenAI Board completed a review by WilmerHale, expressing full confidence in Sam Altman and Greg Brockman's leadership. They also elected new board members and adopted governance enhancements. WilmerHale's review found a breakdown in trust between the prior Board and Mr. Altman, leading to his removal, but concluded that his conduct did not mandate removal. Following the review, the Board endorsed the decision to rehire Mr. Altman and Mr. Brockman. 🌀 Global news partnerships: Le Monde and Prisa Media: OpenAI has partnered with Le Monde and Prisa Media to bring French and Spanish news content to ChatGPT. This partnership aims to enhance user interaction with news content and contribute to the training of OpenAI's models. Through these partnerships, users will access summaries and links to original articles, expanding their news consumption experience. This collaboration supports the news industry and its role in providing reliable information globally. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🌀 Introducing Devin, the first AI software engineer: Meet Devin, the autonomous AI software engineer, skilled in long-term reasoning and planning. Devin can learn new technologies, build and deploy apps, find and fix bugs, train AI models, and contribute to open source. Devin excels in resolving real-world GitHub issues, outperforming previous models. Cognition, the AI lab behind Devin, aims to unlock new possibilities beyond coding. 🌀 Google’s Croissant: a metadata format for ML-ready datasets. Croissant is a new metadata format for ML datasets, aiming to simplify the use of existing datasets for training ML models. It standardizes dataset descriptions and organization, supporting responsible AI practices. Croissant builds upon schema.org and is supported by major tools and repositories like Kaggle, Hugging Face, and OpenML. It includes a specification, example datasets, a Python library, and a visual editor to facilitate dataset usage and publication. 🌀 Kaist AI’s INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models. This research focuses on enhancing search accuracy by improving retrievers to understand users' intentions, similar to language models. It introduces INSTRUCTIR, a benchmark for evaluating retrievers' ability to follow user-aligned instructions in retrieval tasks. The study addresses limitations in existing benchmarks and highlights potential overfitting issues in instruction-aware retrieval datasets.  🌀 Sussex AI’s Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks. Advancements in large neural architectures have led to powerful AI accelerators for training deep neural networks. However, these networks often incur high costs. Neuromorphic computing with Spiking Neural Networks (SNNs) offers energy-efficient alternatives, but training SNNs is challenging. Spyx, a new lightweight SNN simulation and optimization library designed in JAX, aims to facilitate SNN architecture investigation by bridging Python-based deep learning frameworks with custom compute kernels, achieving optimal hardware utilization. 🌀 VMware’s SynCode: Improving LLM Code Generation with Grammar Augmentation. SynCode is a novel framework for efficient syntactical decoding of code with large language models (LLMs). It leverages grammar of a programming language using an offline-constructed efficient lookup table called Deterministic Finite Automaton (DFA) mask store. SynCode seamlessly integrates with any context-free grammar (CFG) defined language, reducing syntax errors by 96.07% when combined with LLMs. 🌀 Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. Chatbot Arena is an open platform designed to evaluate Large Language Models (LLMs) by considering human preferences. Utilizing a pairwise comparison method and crowdsourced input, it assesses LLMs' alignment with user preferences. The platform, operational for months with over 240K votes, provides a credible and valuable resource for ranking LLMs. Check out the tool here. 🌀 Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People. The project aims to develop medical Large Language Models (LLMs) in the six most spoken languages, benefiting 6.1 billion people. This includes creating the ApolloCorpora multilingual medical dataset and the XMedBench benchmark, with Apollo models achieving top performance among models of similar sizes. The project will open-source training data, code, model weights, and evaluation benchmarks. You can check for the demo here. ✨ On the Radar: Catch Up on What's Fresh🌀 Top Artificial Intelligence (AI) Tools That Can Generate Code To Help Programmers (2024): The article discusses how AI is changing programming, with tools like OpenAI Codex and GitHub Copilot generating code. It explores AI's impact on code quality and development speed, showcasing various AI-powered tools like Tabnine, CodeT5, and Polycoder. Additionally, it mentions AI tools for code review, static code analysis, and AI-assisted coding in IDEs like PyCharm and Visual Studio. 🌀 Pypi mirror in a private AWS environment Terraform: This article explains how to install Python packages in an AWS Sagemaker Studio environment without internet access. It covers setting up Sagemaker in VPC Only mode, using VPC Endpoint interfaces for network communications, and accessing the Pypi package repository through AWS Codeartifact, which allows defining Pypi as an upstream repository. 🌀 Custom pre-commit hooks for safer code changes: This blog post explains the importance of using pre-commit hooks in software development, particularly with the git version control system. It discusses the challenges of maintaining coding standards in collaborative projects and provides a step-by-step tutorial on how to set up and use custom pre-commit hooks for a Python project, using the example of validating dataflow definitions for the Hamilton library. 🌀 AQLM Quantization Algorithm, explained: A new quantization algorithm, AQLM (Additive Quantization of Language Models), was recently released and integrated into HuggingFace Transformers and HuggingFace PEFT. AQLM sets a new state-of-the-art for 2-bit quantization while providing improvements for 3-bit and 4-bit ranges, pushing the boundaries of model accuracy and memory footprint. 🌀 Revolutionize Web Browsing with AI: This article explores creating an AI agent using the gpt-4-vision-preview model from OpenAI, enabling it to navigate the web like a human. It discusses the agent's browser control, content browsing, and decision-making processes, showcasing potential use cases such as aiding visually challenged users and automating web browsing tasks. 🌀 Understanding Tensors: Learning a Data Structure Through 3 Pesky Errors. This article discusses transitioning from managing tabular data to working with tensors in TensorFlow, offering debugging tips and code recipes. It covers visualizing TensorFlow datasets, understanding tensor specs, and augmenting model summaries, while addressing common errors related to tensor rank and shape. 🌀 Running RStudio Inside a Container: This tutorial focuses on setting up RStudio using Docker, particularly leveraging the Rocker RStudio image. It covers pulling the image, launching RStudio in a container, and ensuring persistence of data by using volume mapping. The tutorial provides step-by-step instructions and explanations for each stage. 🌀 PyTorch and MLX for Apple Silicon: The blog discusses Apple's MLX framework, which is optimized for Apple Silicon and serves as a bridge between PyTorch, NumPy, and Jax. It details a comparison between MLX and PyTorch through a custom convolutional neural network implementation for image classification tasks. The discussion includes insights into MLX's features, such as its array class, lazy computation, and compilation for performance optimization. The post also highlights the ease of converting PyTorch code to MLX, despite some differences in API compatibility and coding conventions. See you next time!Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support! 
Read more
  • 0
  • 0
  • 202

article-image-get-started-with-fabric-create-your-workspace-reports
Arshad Ali, Bradley Schacht
14 Mar 2024
9 min read
Save for later

Get Started with Fabric: Create Your Workspace & Reports

Arshad Ali, Bradley Schacht
14 Mar 2024
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Learn Microsoft Fabric, by Arshad Ali, Bradley Schacht. Harness the power of Microsoft Fabric to develop data analytics solutions for various use cases guided by step-by-step instructionsIntroductionEmbark on a journey to harness the full potential of Microsoft Fabric within Power BI. This article serves as your comprehensive guide, walking you through the essential steps to create your first Fabric workspace seamlessly. From understanding the fundamentals to practical implementation, we'll equip you with the knowledge and tools needed to optimize your data management and reporting processes. Get ready to elevate your Power BI experience and unlock new possibilities with Fabric-enabled workspaces.Creating your first Fabric-enabled workspaceOnce you have confirmed that Fabric is enabled in your tenant and you have access to it, the next step is to create your Fabric workspace. You can think of a Fabric workspace as a logical container that will contain items such as lakehouses, warehouses, notebooks, and pipelines. Follow these steps to create your first Fabric workspace:1. Sign into Power BI (https://app.powerbi.com/).2. Select Workspaces | + New workspace:Figure 2.5 – Creating a new workspace3. Fill out the Create a workspace form, as follows:Name: Enter Learn Microsoft Fabric and some characters for uniqueness.Description: Optionally, enter a description for the workspace:Figure 2.6 – Create a workspace – detailsAdvanced: Select Fabric capacity under License mode and then choose a capacity you have access to. If not, you can start a trial license, as described earlier, and use it here.4. Select Apply. Th e workspace will be created and opened.5. You can click on Workspaces again and then search for your workspace by typing its name in the search box. You can also pin the selected workspace so that it always appears at the top:Figure 2.7 – Searching for a workspace6. Clicking on the name of the workspace will open that workspace. A link to it will become available in the left-hand side navigation bar, allowing you to switch from one item to another quickly. Since we haven’t created anything yet, there is nothing here. You can click on +New to start creating Fabric items:Figure 2.8 – Switching to a workspaceWith a Microsoft  Fabric workspace set up, let’s review the different workloads that are available.Copilot in Power BIPower BI has several key components, including data transformation and data modeling, culminating in a visual report that end users will consume. The Copilot experience is centered around the visual storytelling and reporting aspects of Power BI. This materializes in three ways: report page creation, narrative generation, and improving Q&A.Let’s look at each of these Copilot capabilities.Creating reports with the Power BI CopilotThe most common use for Copilot with Power BI is likely to be for creating reports. There are two features that come together to build reports. The first analyzes the dataset to suggest content for your report by using table relationships and column names, while the second one helps you create intuitive reports quickly. Figure 11.30 shows an example where Copilot has suggested several report pages, each with a short description of what would be displayed:Figure 11.30 – The Power BI Copilot page suggestionsIf you like the page suggestions, simply click on the Create button and the report page will appear.While a suggested set of report content is a good starting point, analysts often have a specific need to meet. You can have Copilot create a report from the criteria you provide using prompts as well. These can be as simple as “create a page that shows customer analysis” or more specific, such as “create a page to show the impact of each sales territory on profit and quantity sold.”Figure 11.31 – Sales impact report created by CopilotOnce the report page is generated, Copilot cannot update the report, but you can interact with and modify the report as necessary. This is a great way to reduce the time to get started building reports.A couple of other important things to note are that in addition to not being able to modify reports, Copilot will not allow you to specify specific visual types, apply filters, or change the report layout. All of these can be changed manually after the initial report generation. It is worth noting that users should not expect Copilot to filter results to a specific time period based on their prompt as an example.Next, let’s look at the smart narrative.Creating a narrative using CopilotVisuals are a wonderful way to tell a story and give users the ability to explore data on their own. However, sometimes a narrative that summarizes what is being displayed in a report can be useful. It can not only tell a story but also provide some additional context and information for users.To get started, open a report and add a narrative visualization to the report as shown in Figure 11.32. You will see two options; click on Copilot. Choose the type of summary you wish to produce and optionally select specific pages or visuals to include in the summary. Then click on Create.Figure 11.32 – The report narrative generated by CopilotAfter the narrative is generated, remember to always review the narrative for accuracy and adjust the prompt, if necessary, to produce more accurate results. In addition to summaries, you can ask it to highlight key information, customize the order in which the data is described to help convey importance, specify specific data points to include in the summary, and even generate impact analysis showing how different factors affect metrics on the report.Report, page, and visual narratives are a great way to guide users through a report, especially if there isn’t a subject matter expert there to explain all the data.Finally, let’s look at using Copilot to improve the Q&A visual.Generating synonyms with CopilotThe Q&A visual has been dazzling users for years at this point. It is impressive to build a model, walk into the room, and tell users that they can use natural language to query their data without needing to build any visuals. This may not be as impressive as the Copilot functionality that we have today, but it is still a very useful tool in your Power BI visualization toolbelt.One piece of important information for the success of Q&A is something called a synonym. These are end-user-specific ways to reference data. For example, a table in the data model may be called Dim Person, but you know that some report consumers always refer to these as “users.” Therefore, you would create a synonym that tells Q&A that when someone asks about users, they are really talking about persons. This can also be done on a column level. A synonym for “postal code” could be “zip code,” while a synonym for an “item” could be “product” or “finished good.”Q&A itself may not use Copilot, but Power BI Desktop can leverage Copilot to generate synonyms. This can be done when creating a new Q&A visual by clicking on Add synonyms from the ribbon with the label Improve Q&A with synonyms from Copilot. They can also be generated from the Q&A settings menu by adding Copilot as a source from the Suggestion settings list.The more synonyms that can be used to describe your data, the more likely you are to produce quality Q&A results. It is important to double-check the synonyms generated by Copilot to ensure they line up with your specific business terminology.With these Copilot experiences for Power BI, you will be able to generate report ideas, report pages and visuals, summaries, and narratives, and improve Q&A.ConclusionIn conclusion, by mastering the creation of Fabric workspaces in Power BI, you've laid a solid foundation for efficient data management and reporting. With Fabric's capabilities at your fingertips, you're equipped to streamline workflows, generate insightful reports, and enhance collaboration within your organization. Keep exploring the diverse functionalities of Fabric to continuously refine your Power BI experience and stay ahead in the realm of data analytics.Author bioArshad Ali is a principal product manager at Microsoft, working on the Microsoft Fabric product team in Redmond, WA. He focuses on Spark Runtime, which empowers both data engineering and data science experiences. In his previous role, he helped strategic customers and partners adopt Azure Synapse and Microsoft Fabric.Arshad has more than 20 years of industry experience and has been with Microsoft for over 16 years. He is the co-author of the book Big Data Analytics with Azure HDInsight and the author of over 200 technical articles and blogs on data and analytics. Arshad holds an MBA from the Foster School of Business at the University of Washington and an MCA from India.Bradley Schacht is a principal program manager on the Microsoft Fabric product team based in Saint Augustine, Florida. Bradley is a former consultant and trainer and has co-authored five books on SQL Server and Power BI. As a member of the Microsoft Fabric product team, Bradley works directly with customers to solve some of their most complex data problems and helps shape the future of Microsoft Fabric. Bradley gives back to the community by speaking at events, such as the PASS Summit, SQL Saturday, Code Camp, and user groups across the country, including locally at the Jacksonville SQL Server User Group (JSSUG). He is a contributor on SQLServerCentral and blogs on his personal site, BradleySchacht.
Read more
  • 0
  • 0
  • 203

article-image-enhancing-image-search-with-vector-similarity
Bahaaldine Azarmi, Jeff Vestal
12 Mar 2024
12 min read
Save for later

Enhancing Image Search with Vector Similarity

Bahaaldine Azarmi, Jeff Vestal
12 Mar 2024
12 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Vector Search for Practitioners with Elastic, by Bahaaldine Azarmi and Jeff Vestal. Optimize your search capabilities in Elastic by operationalizing and fine-tuning vector search and enhance your search relevance while improving overall search performanceIntroductionVector similarity search plays a crucial role in image search. After images are transformed into vectors, a search query (also represented as a vector) is compared against the database of image vectors to find the most similar matches. This process is known as k-Nearest Neighbor (kNN) search, where “k” represents the number of similar items to retrieve.Several algorithms can be used for kNN search, including brute-force search and more efficient methods such as the Hierarchical Navigable Small World (HNSW) algorithm (see Chapter 7, Next Generation of Observability Powered, by Vectors for a more in-depth discussion on HNSW). Bruteforce search involves comparing the query vector with every vector in the database, which can be computationally expensive for large databases. On the other hand, HNSW is an optimized algorithm that can quickly find the nearest neighbors in a large-scale database, making it particularly useful for vector similarity search in image search systems.The tangible benefits of image search are observed across industries. Its flexibility and adaptability make it a tool of choice for enhancing user experiences, ensuring digital security, or even revolutionizing digital content interactions.Image search in practiceApplications of image search are varied and far-reaching. In e-commerce, for example, reverse image search allows customers to upload a photo of a product and find similar items for sale. In the field of digital forensics, image search can be used to find visually similar images across a database to detect illicit content. It is also used in the realm of social media for face recognition, image tagging, and content recommendation.As we continue to generate and share more visual content, the need for effective and efficient image search technology will only grow. The combination of artificial intelligence, machine learning, and vector similarity search provides a powerful toolkit to meet this demand, powering a new generation of image search capabilities that can analyze and understand visual content.Traditionally, image search engines use text-based metadata associated with images, such as the image’s filename, alt text, and surrounding text context, to understand the content of an image. This approach, however, is limited by the accuracy and completeness of the metadata, and it fails to analyze the actual visual content of the image itself.Over time, with advancements in artificial intelligence and machine learning, more sophisticated methods of image search have been developed that can analyze the visual content of images directly. This technique, known as content-based image retrieval (CBIR), involves extracting feature vectors from images and using these vectors to find visually similar images.Feature vectors are a numerical representation of an image’s visual content. They are generated by applying a feature extraction algorithm to the image. The specifics of the feature extraction process can vary, but in general, it involves analyzing the image’s colors, textures, and shapes. In recent years, CNNs have become a popular tool for feature extraction due to their ability to capture complex patterns in image data.Once feature vectors have been extracted from a set of images, these vectors can be indexed in a database. When a new query image is submitted, its feature vector is compared to the indexed vectors, and the images with the most similar vectors are returned as the search results. The similarity between vectors is typically measured using distance metrics such as Euclidean distance or cosine similarity.Despite the impressive capabilities of CBIR systems, there are several challenges in implementing them. For instance, interpreting and understanding the semantic meaning of images is a complex task due to the subjective nature of visual perception. Furthermore, the high dimensionality of image data can make the search process computationally expensive, particularly for large databases.To address these challenges, approximate nearest neighbor (ANN) search algorithms, such as the HNSW graph, are often used to optimize the search process. These algorithms sacrifice a small amount of accuracy for a significant increase in search speed, making them a practical choice for large-scale image search applications.With the advent of Elasticsearch’s dense vector field type, it is now possible to index and search highdimensional vectors directly within an Elasticsearch cluster. This functionality, combined with an appropriate feature extraction model, provides a powerful toolset for building efficient and scalable image search systems.In the following sections, we will delve into the details of image feature extraction, vector indexing, and search techniques. We will also demonstrate how to implement an image search system using Elasticsearch and a pre-trained CNN model for feature extraction. The overarching goal is to provide a comprehensive guide for building and optimizing image search systems using state-of-the-art technology.Vector search with imagesVector search is a transformative feature of Elasticsearch and other vector stores that enables a method for performing searches within complex data types such as images. Through this approach, images are converted into vectors that can be indexed, searched, and compared against each other, revolutionizing the way we can retrieve and analyze image data. This inherent characteristic of producing embeddings applies to other media types as well. This section provides an in-depth overview of the vector search process with images, including image vectorization, vector indexing in Elasticsearch, kNN search, vector similarity metrics, and fine-tuning the kNN algorithm.Image vectorizationThe first phase of the vector search process involves transforming the image data into a vector, a process known as image vectorization. Deep learning models, specifically CNNs, are typically employed for this task. CNNs are designed to understand and capture the intricate features of an image, such as color distribution, shapes, textures, and patterns. By processing an image through layers of convolutional, pooling, and fully connected nodes, a CNN can represent an image as a high-dimensional vector. This vector encapsulates the key features of the image, serving as its numerical representation.The output layer of a pre-trained CNN (often referred to as an embedding or feature vector) is often used for this purpose. Each dimension in this vector represents some learned feature from the image. For instance, one dimension might correspond to the presence of a particular color or texture pattern.The values in the vector quantify the extent to which these features are present in the image.Figure 1 : Layers of a CNN modelAs seen in the preceding diagram, these are the layers of a CNN model:1. Accepts raw pixel values of the image as input.2. Each layer extracts specific features such as edges, corners, textures, and so on.3. Introduces non-linearity, learns from errors, and approximates more complex functions.4. Reduces the dimensions of feature maps through down-sampling to decrease the computational complexity.5. Consists of the weights and biases from the previous layers for the classification process to take place.6. Outputs a probability distribution over classes.Indexing image vectors in ElasticsearchOnce the image vectors have been obtained, the next step is to index these vectors in Elasticsearch for future searching. Elasticsearch provides a special field type, the dense_vector field, to handle the storage of these high-dimensional vectors.A dense_vector field is defined as an array of numeric values, typically floating-point numbers, with a specified number of dimensions (dims). The maximum number of dimensions allowed for indexed vectors is currently 2,048, though this may be further increased in the future. It’s essential to note that each dense_vector field is single-valued, meaning that it is not possible to store multiple values in one such field.In the context of image search, each image (now represented as a vector) is indexed into an Elasticsearch document. This vector can be one per document or multiple vectors per document. The vector representing the image is stored in a dense_vector field within the document. Additionally, other relevant information or metadata about the image can be stored in other fields within the same document.The full example code can be found in the Jupyter Notebook available in the chapter 5 folder of this book’s GitHub repository at https://github.com/PacktPublishing/VectorSearch-for-Practitioners-with-Elastic/tree/main/chapter5, but we’ll discuss the relevant parts here.First, we will initialize a pre-trained model using the SentenceTransformer library.The clip-ViT-B-32-multilingual-v1 model is discussed in detail later in this chapter:model = SentenceTransformer('clip-ViT-B-32-multilingual-v1')Next, we will prepare the image transformation function:transform = transforms.Compose([ transforms.Resize(224), transforms.CenterCrop(224), lambda image: image.convert("RGB"), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ])Transforms.Compose() combines all the following transformations:transforms.Resize(224): Resizes the shorter side of the image to 224 pixels while maintaining the aspect ratio.transforms.CenterCrop(224): Crops the center of the image so that the resultant image has dimensions of 224x224 pixels.lambda image: image.convert("RGB"): This is a transformation that converts the image to the RGB format. This is useful for grayscale images or images with an alpha channel, as deep learning models typically expect RGB inputs.transforms.ToTensor(): Converts the image (in the PIL image format) into a PyTorch tensor. This will change the data from a range of [0, 255] in the PIL image format to a float in a range [0.0, 1.0].transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)): Normalizes the tensor image with a given mean and standard deviation for each channel. In this case, the mean and standard deviation for all three channels (R, G, B) are 0.5. This normalization will transform the data range from [0.0, 1.0] to [-1.0, 1.0].We can use the following code to apply the transform to an image file and then generate an image vector using the model. See the Python notebook for this chapter to run against actual image files:from PIL import Image img = Image.open("image_file.jpg") image = transform(img).unsqueeze(0) image_vector = model.encode(image)The vector and other associated data can then be indexed into Elasticsearch for use with kNN search:# Create document document = {'_index': index_name, '_source': {"filename": filename, "image_vector": vector See the complete code in the chapter 5 folder of this book’s GitHub repository.With vectors generated and indexed into Elasticsearch, we can move on to searching for similar images.k-Nearest Neighbor (kNN) searchWith the vectors now indexed in Elasticsearch, the next step is to make use of kNN search. You can refer back to Chapter 2, Getting Started with Vector Search in Elastic, for a full discussion on kNN and HNSW search.As with text-based vector search, when performing vector search with images, we first need to convert our query image to a vector. The process is the same as we used to convert images to vectors at index time.We convert the image to a vector and include that vector in the query_vector parameter of the knn search function:knn = { "field": "image_vector", "query_vector": search_image_vector[0], "k": 1, "num_candidates": 10 }Here, we specify the following:field: The field in the index that contains vector representations of images we are searching againstquery_vector: The vector representation of our query imagek: We want only one closest imagenum_candidates: The number of approximate nearest neighbor candidates on each shard to search againstWith an understanding of how to convert an image to a vector representation and perform an approximate nearest neighbor search, let’s discuss some of the challenges.Challenges and limitations with image searchWhile vector search with images offers powerful capabilities for image retrieval, it also comes with certain challenges and limitations. One of the main challenges is the high dimensionality of image vectors, which can lead to computational inefficiencies and difficulties in visualizing and interpreting the data.Additionally, while pre-trained models for feature extraction can capture a wide range of features, they may not always align with the specific features that are relevant to a particular use case. This can lead to suboptimal search results. One potential solution, not limited to image search, is to use transfer learning to fine-tune the feature extraction model on a specific task, although this requires additional data and computational resources.ConclusionIn conclusion, vector similarity search revolutionizes image retrieval by harnessing advanced algorithms and machine learning. From e-commerce to digital forensics, its impact is profound, enhancing user experiences and content discovery. Leveraging techniques like k-Nearest Neighbor search and Elasticsearch's dense vector field, image search becomes more efficient and scalable. Despite challenges, such as high dimensionality and feature alignment, ongoing advancements promise even greater insights into visual data. As technology evolves, so does our ability to navigate and understand the vast landscape of images, ensuring a future of enhanced digital interactions and insights.Author BioBahaaldine Azarmi, Global VP Customer Engineering at Elastic, guides companies as they leverage data architecture, distributed systems, machine learning, and generative AI. He leads the customer engineering team, focusing on cloud consumption, and is passionate about sharing knowledge to build and inspire a community skilled in AI.Jeff Vestal has a rich background spanning over a decade in financial trading firms and extensive experience with Elasticsearch. He offers a unique blend of operational acumen, engineering skills, and machine learning expertise. As a Principal Customer Enterprise Architect, he excels at crafting innovative solutions, leveraging Elasticsearch's advanced search capabilities, machine learning features, and generative AI integrations, adeptly guiding users to transform complex data challenges into actionable insights.
Read more
  • 0
  • 0
  • 530

article-image-streamlining-insights-with-microsoft-copilot
Gus Frazer
04 Mar 2024
9 min read
Save for later

Streamlining Insights with Microsoft Copilot

Gus Frazer
04 Mar 2024
9 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Data Cleaning with Power BI, by Gus Frazer. Unlock the full potential of your data by mastering the art of cleaning, preparing, and transforming data with Power BI for smarter insights and data visualizationsIntroductionFor those who have never heard of Microsoft  Copilot, it is a new technology that Microsoft has released across a number of its platforms that combines generative AI with your data to enhance productivity. Copilot for Power BI harnesses cutting-edge generative AI alongside your dataset, revolutionizing the process of uncovering and disseminating insights with unprecedented speed. Seamlessly integrated into your workflow, Copilot offers an array of functionalities aimed at streamlining your reporting experience.When it comes to report creation, Copilot streamlines the process by allowing users to effortlessly generate reports by articulating the insights they seek or posing questions regarding their dataset using NLP. Copilot then analyzes the data, pulling together relevant information to craft visually striking reports, thereby transforming raw data into actionable insights instantaneously. Moreover, Copilot has the ability to read your data and suggest the best position to begin your analysis, which can then be tailored to suit the direction you want to take the analysis in.This is great, but how can it help you clean and prepare data for analysis? Well, Copilot can be leveraged on multiple data tools from within the Microsoft  Fabric platform. For those who are not aware, Power BI has now become part of the Fabric platform. Depending on what type of license you have for Power BI, you might already have access to this. Any customers with Premium capacity licensing for the Power BI service would have automatically been given access to Microsoft  Fabric, and more importantly, Copilot.That being said, currently, Copilot has only been made available to customers with a P1 (or above) Premium capacity or a Fabric license of F64 (or above), which is the equivalent licensing available directly from the Azure portal.If you would like to follow along with the next example, you will need to set up a Fabric capacity within your Azure portal. Don’t worry, you can pause this service when it’s not being used to ensure you are only charged for the time you’re using it. Alternatively, follow the steps to see the outcome:1. Log in to the Azure portal that you set up in the previous section of this chapter.2. Select the search bar at the top of the page and type in Microsoft Fabric. Select the service in the menu that appears below the search bar, which should take you to the page where you can manage your capacities.3. Select Create a Fabric capacity. Note that you will need to use an organizational account in order to create a Fabric capacity as opposed to a personal account. You can sign up for a Microsoft  Fabric trial for your organization within the window. Further details on how to do this are provided here: https://learn.microsoft.com/en-us/power-bi/ enterprise/service-admin-signing-up-for-power-bi-with-a-newoffice-365-trial.4. Select the subscription and resource group you would like to use for this Fabric capacity.5. Then, under capacity details, you can enter your capacity name. In this example, you can call it cleaningdata.6. The Region field should populate with the region of your tenant, but you can change this if you like. However, this may have implications on performance, which it should warn you about with a message.7. Set the capacity to F64.8. Then, click on select Review + create.9. Review the terms and then click on Create, which will begin the deployment of your capacity.10. Once deployed, select Go to resource to view your Fabric capacity. Take note that this will be active once deployed. Make sure to return here aft er testing to pause or delete your Fabric capacity to prevent yourself from getting charged for this service.Now you will need to ensure you have activated the Copilot settings from within your Fabric capacity. To do this, go to https://app.powerbi.com/admin-portal/ to log in and access the admin portal.Important tipIf you can’t see the Tenant settings tab, then you will need to ensure you have been set up as an admin within your Microsoft  365 admin center. If you have just created a new account, then you will need to set this up. Follow the next links to assign roles:• https://learn.microsoft.com/en-us/microsoft-365/admin/addusers/assign-admin-roles• https://learn.microsoft.com/en-us/fabric/admin/microsoftfabric-admin11. Scroll to the  bottom of Tenant settings until you see the Copilot and Azure OpenAI service (preview) section as shown:Figure  – The tenant settings from within Power BI12. Ensure both settings are set to Enabled and then click on Apply.Now that you have created your Fabric capacity, let’s jump into an example of how we can use Copilot to help with the cleaning of data. As we have created a new capacity, you will have to create a new workspace that uses this new capacity:1. Navigate back to Workspaces using the left navigation bar. Then, select New Workspace.2. Name your workspace CleaningData(Copilot), then select the dropdown for advanced configuration settings.3. Ensure you have selected Fabric capacity in the license mode, which in turn will have selected your capacity below, and then select Apply. You have now created your capacity!4. Now let’s use Fabric to create a new dataflow using the latest update of Datafl ow Gen2. Select New from within the workspace and then select More options.5. This will navigate you to a page with all the possible actions to create items within your Fabric workspace. Under Data Factory, select Datafl ow Gen2.6. This will load a Datafl ow Gen2 instance called Datafl ow 1. On the top row, you should now see the Copilot logo within the Home ribbon as highlighted:Figure – The ribbon within a Dataflow Gen2 instance7. Select Copilot to open the Copilot window on the right-hand side of the page. As you have not connected to any data, it will prompt you to select get data.8. Select Text/CSV and then enter the following into the File path or URL box:https://raw.githubusercontent.com/PacktPublishing/Data-Cleaningwith-Power-BI/main/Retail%20Store%20Sales%20Data.csv9. Leave the rest of the settings as their defaults and click on Next.10. This will then open a preview of the file data. Click on Create to load this data into your Datafl ow Gen2 instance. You will see that the Copilot window will have now changed to prompt you as to what you would like to do (if it hasn’t, then simply close the Copilot window and reopen):Figure – Data loaded into Dataflow Gen211. In this example, we can see that the data includes a column called Order Date but we don’t have a fi eld for the fi scal year. Enter the following prompt to ask Copilot to help with the transformation:There's a column in the data named Order Date, which shows when an order was placed. However, I need to create a new column from this that shows the Fiscal Year. Can you extract the year from the date and call this Fiscal Year? Set this new column to type number also.12. Proceed using the arrow key or press Enter. Copilot will then begin working on your request. As you will see in the resulting output, the model has added a function (or step) called Custom to the query that we had selected.13. Scroll to the far side and you will see that this has added a new column called Fiscal Year.14. Now add the following prompt to narrow down our data and press Enter:Can you now remove all columns leaving me with just Order ID, Order Date, Fiscal year, category, and Sales?15. This will then add another function or step called Choose columns. Finally, add the following prompt to aggregate this data and press Enter:Can you now group this data by Category, Fiscal year, and aggregated by Sum of Sales?As you can see, Copilot has now added another function called Custom 1 to the applied steps in this query, resulting in this table:Figure – The results from asking Copilot to transform the dataTo view the M query that  Copilot has added, select Advanced editor, which will show the functions that Copilot has added for you:Figure – The resulting M query created by Copilot to carry out the request transformations to clean the dataIn this example, you explored the new technologies available with Copilot and how they help to transform the data using tools such as Datafl ow Gen2.While it’s great to understand the amazing possibilities AI brings to data, it’s also crucially important that you understand the challenges it presents.ConclusionIn conclusion, Microsoft Copilot offers a groundbreaking approach to enhancing productivity and efficiency in data analysis and report generation within Power BI. By seamlessly integrating generative AI technology, Copilot revolutionizes the way insights are discovered and data is prepared, providing users with unprecedented speed and accuracy. Whether streamlining report creation or optimizing data management tasks, Copilot empowers users to unlock the full potential of their data, paving the way for more informed decision-making and actionable insights.Author BioGus Frazer is a seasoned Analytics Consultant focused on Business Intelligence solutions. With over 7 years of experience working for the two market-leading platforms, Power BI & Tableau, has amassed a wealth of knowledge and expertise. Gus has helped hundreds of customers to drive their digital and data transformations, scope data requirements, drive actionable insights, and most important of all, cleanse data ready for analysis. Most recently helping to set up, organize and run the Power BI UK community at Microsoft. He holds 6 Azure and Power BI certifications, including the PL-300 and DP-500 certifications. In this book, Gus offers readers invaluable guidance on ingesting, preparing, and cleansing data for analysis in Power BI. --This text refers to an out of print or unavailable edition of this title.
Read more
  • 0
  • 0
  • 532
article-image-revolutionize-power-bi-queries-with-openai
Gus Frazer
27 Feb 2024
10 min read
Save for later

Revolutionize Power BI Queries with OpenAI

Gus Frazer
27 Feb 2024
10 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Data Cleaning with Power BI, by Gus Frazer. Unlock the full potential of your data by mastering the art of cleaning, preparing, and transforming data with Power BI for smarter insights and data visualizationsIntroductionDiscover the transformative potential of leveraging Azure OpenAI, integrated with ChatGPT functionality, to enhance Power BI's M query capabilities. In this article, we delve into how this powerful combination offers expert guidance, efficient solutions, and insightful recommendations for optimizing data transformation tasks. From generating M queries to streamlining complex transformations, explore how Azure OpenAI with ChatGPT empowers users to boost productivity and efficiency in Power BI.Using OpenAI for M queriesAzure OpenAI, with ChatGPT functionality within it, can be a helpful tool for generating M queries in Power BI by providing suggestions, helping with syntax, and offering insights into data transformation tasks. In the following example, you will learn about how you can leverage the chat playground within OpenAI to improve your productivity and efficiency when writing M queries. We will do this by asking a series of questions directly within Azure OpenAI.Complete the next steps to  follow along with the example in your own environment:1. Click on Deployment on the left-hand side and then select Create new deployment to get started.2. Select a model from the base models, in this case, select gpt-35-turbo, and then name your deployment. In this example, name it CleaningDataOpenAI.3. Select Chat playground from the Azure OpenAI Studio home screen or from the Playground tab on the left of your screen. This should open the following view:Figure 15.4 – Screenshot of the chat playground from within Azure OpenAINow that you have set up an instance of OpenAI, here’s an example conversation taken from OpenAI, where the chat functionality has been used to assist with writing M queries in Power BI:User: Hi, I’m working on a Power BI project, and I need to transform some data using M queries. I have a table with sales data, and I want to calculate the total sales for each product category. Can you help me with the M query for that?ChatGPT: Of course! To calculate the total sales for each product category, you can use the Group By transformation in Power BI’s Power Query Editor. Here’s a sample M snippet for your scenario:let Source = YourDataSource, GroupedTable = Table.Group(Source, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}) in GroupedTableReplace YourDataSource with the actual name of your data source. This code groups the data by the ProductCategory column and calculates the total sales for each group.User: Thanks! That’s a great starting point. What if I want to filter out products with sales less than $1,000  before calculating the total sales?ChatGPT: You can add a filtering step before the Group By transformation. Here’s an updated version of the M:let Source = YourDataSource, FilteredTable = Table.SelectRows(Source, each [Sales] >= 1000), GroupedTable = Table.Group(FilteredTable, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}) in GroupedTableThis code filters out rows where sales are less than $1,000 before grouping the data by product category and calculating the total sales.User: Perfect! This is really helpful. One more thing – what if I want to sort the results in descending order based on total sales?ChatGPT: You can add a sorting step using the Table.Sort function. Here’s the modified M:let Source = YourDataSource, FilteredTable = Table.SelectRows(Source, each [Sales] >= 1000), GroupedTable = Table.Group(FilteredTable, {"ProductCategory"}, {{"TotalSales", each List.Sum([Sales]), type number}}), SortedTable = Table.Sort(GroupedTable,{{"TotalSales", Order. Descending}}) in SortedTableThis code sorts the grouped table in descending order based on the “TotalSales” column.This interaction represents a simple example of how those who can leverage the power of tools such as OpenAI and C hatGPT will be able to quickly upskill in areas such as coding. It has to be said, though, that while this is still in its infancy, it’s important to always test and validate the answers provided before implementing them in production. Also, ensure that you take precautions when using the publicly available ChatGPT model to avoid sharing sensitive data publicly. If you would like to use sensitive data or you want to ensure that requests are given within a secured governed environment, make sure to use the ChatGPT model within your own Azure OpenAI instance.In more complex examples, optimizing Power Query transformations could involve efficient interaction with Azure OpenAI. This includes streamlining API calls, managing large datasets, and incorporating caching mechanisms for repetitive queries, ensuring a seamless and performant data cleaning process.As we begin to explore the use cases where this technology can be most effective, there are a number of clear early winners:Optimizing query plans: ChatGPT’s natural language understanding can assist in formulating more efficient Power Query plans. By describing the desired transformations in natural language, users can interact with ChatGPT to generate optimized query plans. This involves selecting the most suitable Power Query functions and structuring transformations for performance gains.Caching strategies for repetitive queries: ChatGPT can guide users in devising effective caching strategies. By understanding the context of data transformations, it can recommend where to implement caching mechanisms to store and reuse intermediate results, minimizing redundant API calls and computations. The following is an example of just this, where I have asked Azure OpenAI to verify and optimize my query from the Power Query Advanced Editor. The model suggested I use the Table.Buffer function to help cache the table in memory and optimize the query.Figure – An example request to OpenAI to help optimize my query for Power Query                                                        Figure – An example response from OpenAI to help optimize my query for Power QueryNow as we highlighted in Chapter 11, M Query Optimization, Table.Buffer can indeed improve the performance of your queries and refreshes, but this really depends on the data you are working with. In the previous example, the model doesn’t take the characteristics, size, or complexity of your data into consideration as it isn’t plugged into your data at this stage. Also linking back to the example you walked through in Chapter 11, the placement of where you add Table.Buffer can really impact how your query performs. In the previous example, if you were connecting to a small dataset, you would likely cause it to run slower by adding the Table.Buffer function as the second variable in the query.Lastly, it’s worth mentioning that how you prompt these models is crucially important. In the previous example, we didn’t specify what type of data source we were using in our query. As such, the model hasn’t provided an insight or overview that using Table.Buffer on a data source supporting query folding will cause it to break the fold. Again, this is not so much of a problem if Table.Buffer is placed at the end of your query for smaller datasets, but it is a problem if you add it nearer to the beginning of the query, like in the previous example.Handling large datasets: Dealing with large datasets often poses a challenge in Power Query. OpenAI models, including ChatGPT, can provide insights into dividing and conquering large datasets. This includes strategies for parallel processing, filtering data early in the transformation pipeline, and using aggregations to reduce computational load.Dynamic query adjustments: ChatGPT’s interactive nature allows users to dynamically adjust queries based on evolving requirements. It can assist in crafting queries that adapt to changing data scenarios, ensuring that Power Query transformations remain flexible and responsive to varied datasets.Guidance on complex transformations: Power Query oft en involves intricate transformations. ChatGPT can act as a virtual assistant, guiding users through the process of complex transformations. It can suggest optimal function compositions, advise on conditional logic placement, and assist in structuring transformations to enhance efficiency. The best example of this can be seen in the following two screenshots of an active use case seen in many businesses. The example begins with a user asking the model for a description of what the query is doing. OpenAI then provides a breakdown of what the query is doing in each step to help the user interpret the code. It helps to break down the barriers to coding and also helps to decipher code that has not been documented well by previous employees.                                                     Figure – An example request to OpenAI to help translate my queryFigure – An example response from OpenAI to help describe my queryError handling strategies: Optimizing Power Query also entails robust error handling. ChatGPT can provide recommendations for anticipating and handling errors gracefully within a query. This includes strategies for logging errors, implementing fallback mechanisms, and ensuring the stability of the overall data preparation process.In this section, you learned how to optimize Power Query transformations with Azure OpenAI efficiently. Key takeaways include using ChatGPT for natural-language-based query planning and effective caching strategies. Insights include handling large datasets through parallel processing, early filtering, and aggregations. This knowledge equips you to streamline and enhance your Power Query processes effectively.In the next section, you will learn about Microsoft  Copilot, how to set up a Power BI instance with Copilot activated, and also how you can use this new AI technology to help clean and prepare your data.ConclusionIn conclusion, Azure OpenAI with ChatGPT presents a game-changing solution for maximizing Power BI's potential. From query optimization to error-handling strategies, this integration streamlines processes and enhances productivity. As users navigate complex data transformations, the guidance provided fosters efficient decision-making and empowers users to tackle challenges with confidence. With Azure OpenAI and ChatGPT, the possibilities for revolutionizing Power BI workflows are endless, offering a glimpse into the future of data transformation and analytics.Author BioGus Frazer is a seasoned Analytics Consultant focused on Business Intelligence solutions. With over 7 years of experience working for the two market-leading platforms, Power BI & Tableau, has amassed a wealth of knowledge and expertise. Gus has helped hundreds of customers to drive their digital and data transformations, scope data requirements, drive actionable insights, and most important of all, cleanse data ready for analysis. Most recently helping to set up, organize and run the Power BI UK community at Microsoft. He holds 6 Azure and Power BI certifications, including the PL-300 and DP-500 certifications. In this book, Gus offers readers invaluable guidance on ingesting, preparing, and cleansing data for analysis in Power BI. --This text refers to an out of print or unavailable edition of this title.
Read more
  • 0
  • 0
  • 340

article-image-setting-up-polars-for-data-analysis
Luca Zanna
23 Feb 2024
7 min read
Save for later

Setting Up Polars for Data Analysis

Luca Zanna
23 Feb 2024
7 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, Data Analysis with Polars, by Luca Zanna. Leverage Polars, the lightning-fast dataframe library, to take your Python data analysis skills to the next levelIntroductionIn the ever-evolving landscape of data analysis, harnessing the right tools and methodologies can make all the difference. Welcome to a world where Polars, a powerful data manipulation library, takes center stage. This article is your gateway to unlocking the potential of Polars, and it begins by unraveling the essential components of the data analysis journey. From setting up virtual environments to simplifying data analysis in the cloud with Google Colab, we explore how Polars streamlines your path to insights. Whether you're a seasoned data analyst or just starting your journey, this guide will equip you with the knowledge and tools needed to make your data analysis endeavors efficient and rewarding. Join us as we delve into the fascinating realm of Polars and embrace a new era of data exploration.Installation and virtual environments We will not go through the installation of Python as that is outside the scope of the book. A visit to python.org will give all the information necessary to install Python. Now on to virtual environments. Understanding Virtual Environments and Their Benefits Imagine you have built a fantastic data analysis project using Polars. Your project uses: Python 3.8Polars version 0.15.1 Numpy 1.23.0 Now, you start a new project, and you want to use a newer Polars (0.16.14), along with Numpy and Arrow. So, the new project requires: Python 3.10 Polars 0.16.14 Numpy 1.24.0 Pyarrow 11.0.0 Upgrading Polars and Numpy libraries globally isn't a good idea. If Polars functions have changed between versions, your first project might stop working or give incorrect results with the new version. This is where virtual environments come in. Virtual environments create separate 'spaces' for each project: one for your first data analysis project and another for your new data pipeline project. You can set up a virtual environment manually or have your IDE set-up a virtual environment for you. If you decide to set it up manually, you can check out the guide at https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environment. Installing and using Polars on a machine To install Polars, first make sure you are in a virtual environment. Then, type: pip install polars If you already have Polars installed and want to upgrade it, type: pip install polars --upgrade In the book we will use other libraries, including numpy, pandas, matplotlib. You can install them with the syntax above, and you can also install multiple libraries at the same tine:  pip install numpy pandas matplotlib Let’s now get our development environment set-up. We will use Visual Studio code, but you are free to use any other IDE that you like. 1. Type code . in the command line to open Visual Studio Code. 2. Right-click on the left, choose New File, and create first_dataframe.ipynb.                                                                    Figure – Creating a new file in Visual Studio Code Files with extension .ipynb are Jupyter Notebook files, which are great for data analysis. To work with these files you need to install the Jupyter extension on VS Code. You can do that by clicking on ‘Extensions’ on the left bar, searching for Jupyter, installing it, and activating it.  Figure – Install Jupyter extension in Visual Studio Code 3. Now back to our file. The first thing to ensure is that we are using Python from our virtual environment. Click on Select Kernel at the top right, then click on the Python that starts with env/: that will be the Python for our virtual environment. Avoid the paths starting with /usr and /bin as those are the system Python instead of our virtual environment.  Figure – Select the Python interpreter in Visual Studio code Now, we're ready for Polars. 4. Type import polars as pl in the first cell and press Shift + Enter to run it. 5. Create a dataframe in the next cell by typing: df = pl.DataFrame({    'a': ['Hello', 'World!'] }) 6. Press Shift + Enter to run the cell. This creates a dataframe called df with one column named 'a' and two rows: 'Hello' and 'World!' To see the dataframe, type df in the next cell and run it.  Figure – Visual Studio code with first Polars dataframe We created our first Polars dataframe. Using Polars on the cloud with Google Colab Instead of installing Polars on your computer, you can also use it in the cloud. One popular cloud service for running code is Google Colab. This way, you don't need to install anything on your machine. To access Google Colab, visit https://colab.research.google.com/ in your web browser. Click on "New Notebook," and you'll see a page that looks similar to VS Code. Now, let's create the same Polars dataframe example in Google Colab: 1. In the first cell, type the following command to ensure we have the latest version of Polars: %pip install polars --upgrade 2. Next, enter this code to import Polars and create a dataframe: import polars as pl df = pl.DataFrame({    'a': ['Hello', 'World !'] }) Finally, display the dataframe by typing: df And that's it! You now have your first Polars dataframe in Google Colab.                                                                     Figure – Google Colab with first Polars dataframe ConclusionIn closing, Polars offers a bridge to the future of data analysis. With the knowledge and hands-on experience gained from this article, you're well-prepared to conquer the intricacies of data manipulation and visualization. The ability to effortlessly create, manipulate, and analyze data using Polars is a powerful tool in your arsenal. Whether you're a data enthusiast or a seasoned analyst, embracing Polars sets you on a path toward efficiency, precision, and data-driven success. As the data landscape continues to evolve, you're now equipped to stay ahead, make informed decisions, and revolutionize your approach to data exploration.Author BioLuca Zanna is a Data Engineer and Data Analyst with over 15 years of experience. He started his career as a financial data analyst after a Master's in Management and passing the Certified Public Accountant (CPA) exam. Luca spent a decade working on financial analysis systems at L’Oréal: developing the systems and training financial analysts across Europe and Asia.Currently, Luca helps companies with building data infrastructure to better leverage their data. Luca is also a corporate teacher for topics such as data analysis, SQL, Python, and cloud data engineering.
Read more
  • 0
  • 0
  • 438

article-image-enhancing-data-quality-with-cleanlab
Prakhar Mishra
21 Feb 2024
8 min read
Save for later

Enhancing Data Quality with Cleanlab

Prakhar Mishra
21 Feb 2024
8 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionIt is a well-established fact that your machine-learning model is only as good as the data it is fed. ML model trained on bad-quality data usually has a number of issues. Here are a few ways that bad data might affect machine-learning models -1. Predictions that are wrong may be made as a result of errors, missing numbers, or other irregularities in low-quality data. The model's predictions are likely to be inaccurate if the data used to train is unreliable.2. Bad data can also bias the model. The ML model can learn and reinforce these biases if the data is not representative of the real-world situations, which can result in predictions that are discriminating.3. Poor data also disables the the ability of ML model to generalize on fresh data. Poor data may not effectively depict the underlying patterns and relationships in the data.4. Models trained on bad-quality data might need more retraining and maintenance. The overall cost and complexity of model deployment could rise as a result.As a result, it is critical to devote time and effort to data preprocessing and cleaning in order to decrease the impact of bad data on ML models. Furthermore, to ensure the model's dependability and performance, it is often necessary to use domain knowledge to recognize and address data quality issues.It might come as a surprise, but gold-standard datasets like ImageNet, CIFAR, MNIST, 20News, and more also contain labeling issues. I have put in some examples below for reference -The above snippet is from the Amazon sentiment review dataset , where the original label was Neutral in both cases, whereas Cleanlab and Mechanical turk said it to be positive (which is correct).The above snippet is from the MNIST dataset, where the original label was marked to be 8 and 0 respectively, which is incorrect. Instead, both Cleanlab and Mechanical Turk said it to be 9 and 6 (which is correct).Feel free to check out labelerrors to explore more such cases in similar datasets.Introducing CleanlabThis is where Cleanlab can come in handy as your best bet. It helps by automatically identifying problems in your ML dataset, it assists you in cleaning both data and labels. This data centric AI software uses your existing models to estimate dataset problems that can be fixed to train even better models. The graphic below depicts the typical data-centric AI model development cycle:Apart from the standard way of coding all the way through finding data issues, it also offers Cleanlab Studio - a no-code platform for fixing all your data errors. For the purpose of this blog, we will go the former way on our sample use case.Getting Hands-on with CleanlabInstallationInstalling cleanlab is as easy as doing a pip install. I recommend installing optional dependencies as well, you never know what you need and when. I also installed sentence transformers, as I would be using them for vectorizing the text. Sentence transformers come with a bag of many amazing models, we particularly use ‘all-mpnet-base-v2’ as our choice of sentence-transformers for vectorizing text sequences. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search. Feel free to check out this for the list of all models and their comparisons.pip install ‘cleanlab[all]’ pip install sentence-transformersDatasetWe picked the SMS Spam Detection dataset as our choice of dataset for doing the experimentation. It is a public set of labeled SMS messages that have been collected for mobile phone spam research with total instances of roughly ~5.5k. The below graphic gives a sneak peek of some of the samples from the dataset.Data PreviewCodeLet’s now delve into the code. For demonstration purposes, we inject a 5% noise in the dataset, and see if we are able to detect them and eventually train a better model.Note: I have also annotated every segment of the code wherever necessary for better understanding.import pandas as pd from sklearn.model_selection import train_test_split, cross_val_predict       from sklearn.preprocessing import LabelEncoder               from sklearn.linear_model import LogisticRegression       from sentence_transformers import SentenceTransformer       from cleanlab.classification import CleanLearning       from sklearn.metrics import f1_score # Reading and renaming data. Here we set sep=’\t’ because the data is tab       separated.       data = pd.read_csv('SMSSpamCollection', sep='\t')       data.rename({0:'label', 1:'text'}, inplace=True, axis=1)       # Dropping any instance of duplicates that could exist       data.drop_duplicates(subset=['text'], keep=False, inplace=True)       # Original data distribution for spam and not spam (ham) categories       print (data['label'].value_counts(normalize=True))       ham 0.865937       spam 0.134063       # Adding noise. Switching 5% of ham data to ‘spam’ label       tmp_df = data[data['label']=='ham']               examples_to_change = int(tmp_df.shape[0]*0.05)       print (f'Changing examples: {examples_to_change}')       examples_text_to_change = tmp_df.head(examples_to_change)['text'].tolist() changed_df = pd.DataFrame([[i, 'spam'] for i in examples_text_to_change])       changed_df.rename({0:'text', 1:'label'}, axis=1, inplace=True)       left_data = data[~data['text'].isin(examples_text_to_change)]       final_df = pd.concat([left_data, changed_df])               final_df.reset_index(drop=True, inplace=True)       Changing examples: 216       # Modified data distribution for spam and not spam (ham) categories       print (final_df['label'].value_counts(normalize=True))       ham 0.840016       spam 0.159984    raw_texts, raw_labels = final_df["text"].values, final_df["label"].values # Converting label into integers encoder = LabelEncoder() encoder.fit(raw_train_labels)       train_labels = encoder.transform(raw_train_labels)       test_labels = encoder.transform(raw_test_labels)       # Vectorizing text sequence using sentence-transformers transformer = SentenceTransformer('all-mpnet-base-v2') train_texts = transformer.encode(raw_train_texts)       test_texts = transformer.encode(raw_test_texts)       # Instatiating model instance model = LogisticRegression(max_iter=200) # Wrapping the sckit model around CL cl = CleanLearning(model) # Finding label issues in the train set label_issues = cl.find_label_issues(X=train_texts, labels=train_labels) # Picking top 50 samples based on confidence scores identified_issues = label_issues[label_issues["is_label_issue"] == True] lowest_quality_labels =       label_issues["label_quality"].argsort()[:50].to_numpy()       # Beauty print the label issue detected by CleanLab def print_as_df(index):    return pd.DataFrame(              {    "text": raw_train_texts,              "given_label": raw_train_labels,           "predicted_label": encoder.inverse_transform(label_issues["predicted_label"]),       },       ).iloc[index]       print_as_df(lowest_quality_labels[:5]) As we can see, Cleanlab assisted us in automatically removing the incorrect labels and training a better model with the same parameters and settings. In my experience, people frequently ignore data concerns in favor of building more sophisticated models to increase accuracy numbers. Improving data, on the other hand, is a pretty simple performance win. And, thanks to products like Cleanlab, it's become really simple and convenient.Feel free to access and play around with the above code in the Colab notebook hereConclusionIn conclusion, Cleanlab offers a straightforward solution to enhance data quality by addressing label inconsistencies, a crucial step in building more reliable and accurate machine learning models. By focusing on data integrity, Cleanlab simplifies the path to better performance and underscores the significance of clean data in the ever-evolving landscape of AI. Elevate your model's accuracy by investing in data quality, and explore the provided code to see the impact for yourself.Author BioPrakhar has a Master’s in Data Science with over 4 years of experience in industry across various sectors like Retail, Healthcare, Consumer Analytics, etc. His research interests include Natural Language Understanding and generation, and has published multiple research papers in reputed international publications in the relevant domain. Feel free to reach out to him on LinkedIn
Read more
  • 0
  • 0
  • 239
article-image-leveraging-google-cloud-for-custom-endpoint-with-openai
Henry Habib
20 Feb 2024
8 min read
Save for later

Leveraging Google Cloud for Custom Endpoint with OpenAI

Henry Habib
20 Feb 2024
8 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, OpenAI API Cookbook, by Henry Habib. Integrate the ChatGPT API into various domains ranging from simple wrappers to knowledge-based assistants, multi-model, and conversational applicationsIntroductionIn the realm of application development, the integration of OpenAI's capabilities through custom backend endpoints on Google Cloud is a pivotal step towards unlocking intelligent solutions. This article explores the process of creating such endpoints using Google Cloud's Cloud Functions, allowing for control, customization, and the seamless integration of OpenAI's features. Through this fusion of technologies, developers can craft innovative applications that leverage the power of artificial intelligence to enhance user experiences and drive creativity in their projects.Creating a public endpoint server that calls the OpenAI APIThere are many important benefits of creating your own public endpoint server that calls the OpenAI API, instead of connecting to the OpenAI API directly – the biggest being control and customization, which we will explore in this recipe and the next recipe.In this recipe, we will use GCP to host our public endpoint. When this endpoint is called, it will make a request to OpenAI for a slogan for an ice cream company and then will return the answer to the user. This sounds simple and almost unnecessary to make a public endpoint, but it is the final step we need to build a truly intelligent application that leverages OpenAI.To do this, we will create a GCP resource called Cloud Functions, which we will explore later in the How it works… section of the recipe.Getting readyEnsure you have an OpenAI platform account with available usage credits. If you don’t, please follow the Setting up your OpenAI API Playground environment recipe in Chapter 1. Furthermore, ensure you have created a GCP account. To do this, navigate to https://cloud. google.com/, then select Start Free from the top right, and follow the instructions that you see.You may need to provide a billing profile as well to create any GCP resources. Note that GCP does have a free tier, and in this recipe, we will not go above the free tier (so, essentially, you should not be billed for anything).You may need to create a project if this is your first time logging into Google Cloud Platform. After you log in, select Select a project from the top left and then select New Project. Provide a project name and then select Create.The next recipe in this chapter will also have this same requirement.How to do it…1.  Navigate to https://console.cloud.google.com/. In the Search field at the top of the page, type in Cloud Functions and select the top choice from the drop-down menu, Cloud Functions.Figure – Cloud Functions in the dropdown2. Select Create Function from the top of the page. This will begin to create our custom backend endpoint and start the configuration steps.On the Configuration page, fill in the following steps:Environment: Select 2nd gen from the drop-down menu.Function name: Since we’re creating a backend endpoint that will produce company slogans, the function name will be slogan_creator.Region: Choose the environment location nearest you.In the Trigger menu, choose HTTPS. In the Authentication sub-menu, select Allow unauthenticated invocation. We need to check this as we are going to create a public endpoint that will be accessible from our frontend services.                                                                                  Figure – Sample configuration settings of a Google Cloud Function3. Select the Next button on the bottom of the page to then move on to the Code section.4. From the Runtime dropdown, select Python 3.12. This ensures that our backend endpoint will be coded using the Python programming language.5. For that Entry point option, type in create_slogan. This refers to the name of the function in Python that is called when the public endpoint is reached and triggered.6. On the left-hand side menu, you will see two files: main.py and requirements.txt. Select the requirements.txt file. This will list all the Python packages that need to be installed for our Cloud Function to operate.7. In the center of the screen where the contents of requirements.txt are displayed, enter a new line and type in openai. This will ensure that the latest openai library package is installed. Your screen should look like what’s displayed in Figure below.Figure – Snapshot of the requirements.txt file8. From the left-hand side menu, select main.py. Copy and paste the following code into the center of the screen (where the content for that file is displayed). These are the instructions that the public endpoint will run when it is triggered:import functions_framework from openai import OpenAI @functions_framework.http def create_slogan(request): client = OpenAI(api_key = '<API Key here>') response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ { "role": "system", "content": "You are an AI assistant that creates one slogan based on company descriptions" }, { "role": "user", "content": "A company that sells ice cream" } ], temperature=1, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0 ) return response.choices[0].message.contentAs you can see, it simply calls the OpenAI endpoint, requests a chat completion, and then returns the output to the user. You will also need your OpenAI API key.9. Next, deploy the function by selecting the Deploy button at the bottom of your page.10.   Wait for your function to be fully deployed, which typically takes two minutes. You can verify whether the function has been deployed or not by observing the progress in the top left section of the page (shown in Figure below). Once it is green and checkmarked, the build is successful, and your function has been deployed.                                                                                      Figure – The Cloud Function deployment page11. Now, let’s verify that our function works. Select the endpoint URL, found on the top of the page near URL. It’s typically in the form https://[location]-[project-name]. cloudfunctions.net/[function-name]. It is also highlighted in the above Figure.12. This will open a new web page that will trigger our custom public endpoint, and return a chat completion, which, in this case, is the slogan for an ice cream business. Note that this is a public endpoint – this will work on your computer, phone, or any device connected to the internet.Figure – Output of a Google Cloud FunctionHow it works…In this recipe, we created a public endpoint. This endpoint can be accessed by anyone (including your application in future recipes). The logic of the endpoint is simple and something we have covered prior: return a slogan for a company that sells ice cream. What’s new, however, is that this is our very own public endpoint that is hosted in Google Cloud, using the Cloud Function resource.Note that we used the free tier of Google Cloud Functions, which does have limitations such as a cap on the number of function invocations per month, limited execution time, and constrained computational resources. However, for our current purposes, these limitations are not a hindrance, allowing us to deploy and test our functions effectively without incurring costs. This setup is ideal for small-scale applications or for learning and experimentation purposes, providing a practical way to understand cloud functionalities and serverless architecture in a cost-effective manner.ConclusionIn conclusion, the synergy between Google Cloud's infrastructure and OpenAI's capabilities offers developers a powerful platform for creating intelligent applications. By leveraging Cloud Functions to build custom backend endpoints, developers can unlock a world of possibilities for innovation and creativity. This article has provided a comprehensive guide to integrating OpenAI into Google Cloud, empowering developers to craft intelligent solutions that enhance user experiences and drive the evolution of application development. With this knowledge, developers are well-equipped to embark on their journey of building intelligent applications that push the boundaries of what is possible in the digital landscape.Author BioHenry Habib is a Manager at one of the world's top management consulting firms, advising F500 companies on analytics and operations, with a particular focus on building intelligent AI-driven solutions and tools to create impact. He is a passionate online instructor and educator, amassing a of more than 150K paid students and facilitating technical programs at large banks and governmental.A proponent in the no-code and LLM revolution, he believes that anyone can now create powerful and intelligent applications without any deep technical skills. Henry resides in Toronto, Canada with his wife, and enjoys reading AI research papers and playing tennis in his free time.
Read more
  • 0
  • 0
  • 465

article-image-ai-distilled-37-cutting-edge-updates-and-expert-guidance
Merlyn Shelley
16 Feb 2024
10 min read
Save for later

AI_Distilled 37: Cutting-Edge Updates and Expert Guidance

Merlyn Shelley
16 Feb 2024
10 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!👋 Hello ,“[Sovereign AI] codifies your culture, your society's intelligence, your common sense, your history – you own your own data…[Use AI to] activate your industry, build the infrastructure, as fast as you can.” - Jensen Huang, NVIDIA founder and CEO Huang strongly advocated for countries to rapidly develop their own national AI capabilities and systems. NVIDIA's dominance in GPUs positions it to be a major beneficiary of the AI revolution as its technologies are fundamental for running advanced AI applications. No wonder NVIDIA's market value recently surpassed Amazon’s. Embark on a new AI journey with AI_Distilled, a curated digest of the most recent developments in AI/ML, LLMs, NLP, GPT, and Gen AI. We’ll kick things off by tapping into the latest news and developments in the AI sector: OpenAI updates ChatGPT with memory retention Microsoft unveils new Copilot features Google updates Gemini and unveils mobile app Apple's new AI model called MGIE New open-source AI model Aya converses in 100+ languages Reka AI introduces two new state-of-the-art AI models DeepMind and USC develop new technique to improve LLMs’ reasoning abilities New open-source AI model Smaug-72B achieves top spot NVIDIA unveils new chatbot Chat with RTX AI helps identify birds and conserve an English wetland USPTO issues new guidance stating AI alone can't be named as an inventor We’ve also handpicked GPT and LLM resources and secret knowledge that’ll come in handy for your next project:  Building a Scalable Foundation Model Platform for Your Enterprise Making Bridges Between AI and Business Evaluating Large Language Models: A Guide to Benchmark Tests Code Generation Gets Smarter with Context Looking for hands-on tips and strategies straight from the developer community? We’ve got you covered with some incredible tutorials to get you started: Building a Question Answering Bot from Scratch Creating SMS Apps with Next.js and AI Assistants Harness the Power of LLMs Without GPUs Making the Switch to Open-Source Models Finally, feel free to check out our curated list of smoking hot GitHub repositories. arplaboratory/learning-to-fly time-series-foundation-models/lag-llama noahfarr/rlx uclaml/SPIN phidatahq/phidata  📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition." 📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book! Share your thoughts and opinions here! Writer’s Credit: Special shout-out to Vidhu Jain for their valuable contribution to this week’s newsletter content!  Cheers,  Merlyn Shelley  Editor-in-Chief, Packt  SignUp | Advertise | Archives⚡ TechWave: AI/GPT News & AnalysisOpenAI (whose annual revenue hit $2 billion in December) has updated its ChatGPT chatbot to retain information from past conversations, allowing it to apply context from previous discussions to new chats. This could help the bot respond with more relevant, personalized replies. Microsoft's AI assistant Copilot has also received upgrades like improved models and image editing tools. Google updated its conversational AI too with a new mobile app and advanced model (now called Gemini). A new paid tier for Gemini Ultra provides developers and users access to more advanced features and capabilities for $20 per month.  Courtesy: OpenAI Apple's new AI model called MGIE allows editing images through natural language commands, performing tasks from color adjustments to complex manipulations. Interestingly, a new open-source AI model called Aya can converse in over 100 languages, potentially increasing access for many. Reka AI has introduced two new state-of-the-art AI models, Flash and Edge, which achieve top performance on language and vision tasks while Edge maintains a smaller size. Researchers from DeepMind and USC have developed a new technique called SELF-DISCOVER to improve LLMs’ reasoning abilities. A new open-source AI model (available for all) called Smaug-72B has achieved the top spot on the leaderboard for language models, demonstrating skills that surpass proprietary competitors.Courtesy: Apple NVIDIA's market value surpassed Amazon's for the first time since 2002 thanks to strong demand for its AI chips. The company also released a new chatbot called Chat with RTX allowing users to run personalized generative AI models locally on PCs. Chatbots aside, AI is making waves across fields. Conservationists are using AI to identify birds by their songs to help restore an English wetland. Scientists are speeding discoveries and tackling climate change better as new multimodal and smaller language models enhance technologies. That said, the USPTO has issued new guidance stating that while AI alone can't be named as an inventor on patents, humans can use AI in the invention process as long as they make a significant creative contribution.   🔮 Expert Insights from Packt Community LLMs Under the Hood – Building Models for Your Unique Use Cases [Video] - By Maxime Labonne, Denis Rothman, Abi Aryan This video course is an invaluable resource for AI developers looking to master the art of building enterprise-grade Large Language Models (LLMs). Here's why it's a must-watch:  Key Takeaways: 1. Expert-Led Guidance: Learn from industry experts Maxime Labonne, Dennis Rothman, and Abi Aryan, who bring a wealth of experience in LLM development. 2. End-to-End Coverage: Gain comprehensive insights into the entire LLM lifecycle, from architecture to deployment. 3. Advanced Skills Development: Acquire advanced skills to architect high-performing LLMs tailored to your specific business needs. 4. Hands-On Learning: Engage in practical exercises that reinforce key concepts and techniques for building, refining, and deploying LLMs. 5. Real-World Impact: Learn how to create LLMs that deliver tangible business value and solve complex organizational challenges. Course Highlights: - Making informed architecture decisions for optimal performance. - Selecting the right model types and configuring hyperparameters effectively. - Curating high-quality training data for better model outcomes. - Mastering pre-training, fine-tuning, and rigorous model evaluation techniques. - Strategies for smooth productionization, proactive monitoring, and post-deployment maintenance. By the end of this masterclass, you'll be equipped with the practical knowledge and skills needed to develop and deploy LLMs that drive real-world impact for your organization. Watch Here 🌟 Secret Knowledge: AI/LLM Resources🌀 Building a Scalable Foundation Model Platform for Your Enterprise: This post outlines how enterprises can provide different teams governed access to powerful foundation models through a centralized API layer. The solution described captures model usage and costs for each team to enable chargebacks. It also allows controlling access and throttling usage on a per-team basis. Building on serverless AWS services ensures the solution scales to meet demand. Whether you need transparent access for innovation or just want to understand how teams are leveraging AI, implementing a solution like this can help unlock the potential of generative AI for your whole organization.  🌀 Making Bridges Between AI and Business: This article discusses how businesses can develop an AI platform to integrate generative technologies like RAG and CRAG safely. It covers collecting data, querying knowledge bases, and using prompt engineering to guide AI models. The goal is to leverage AI's potential while avoiding risks through a blended strategy of retrieval and generation. This overview provides a solid foundation for aligning cutting-edge models with your organization's needs. 🌀 Evaluating Large Language Models: A Guide to Benchmark Tests: As AI language models become more advanced, it's important we have proper ways to assess their abilities. This article outlines several benchmark tests that evaluate language models on tasks like reasoning, code generation and more. Tests like WinoGrande, Hellaswag and GLUE provide insights into models' strengths and weaknesses. The benchmarks also allow for comparisons between different models. They give us a more complete picture of a model's skills. 🌀 Code Generation Gets Smarter with Context: Google's Codey APIs now enhance code completion and generation using Retrieval Augmented Generation, which retrieves relevant code from repositories to provide more accurate responses. This "RAG" technique allows LLMs to leverage external context. The blog post explores how RAG works and demonstrates its ability to inject appropriate code snippets. While not perfect, RAG is a useful tool to explore coding variations and adapt to custom styles when used with Codey on Vertex AI.   Partnering with Notion Ever tried Notion? It's a workspace that helps you do things better and faster.You get AI for notes and teamwork, easy drag-and-drop for content, and cool new features to help manage projects and share knowledge.Give it a Try! 🔛 Masterclass: AI/LLM Tutorials🌀 Building a Question Answering Bot from Scratch: This tutorial shows you how to create a basic question answering bot by processing text from Wikipedia, generating embeddings with OpenAI, and storing the data in Momento Vector Index. It covers initializing clients, loading and chunking data, generating embeddings, indexing the embeddings, and searching to return answers. The bot is enhanced by using GPT-3 to provide concise responses instead of raw text. Following these steps will give you hands-on experience constructing a QA system from the ground up. 🌀 Creating SMS Apps with Next.js and AI Assistants: This article shows you how to build a texting app that uses Next.js for the frontend and backend. OpenAI's GPT-3 is utilized to generate meeting invite messages. Twilio handles sending the texts. React components collect invite details. API routes fetch GPT-3 responses and send data to Twilio. It's a clever way to enhance workflows with AI. 🌀 Harness the Power of LLMs Without GPUs: Google's new localllm tool allows developers to run large language models locally using just a CPU, eliminating the need for expensive GPUs. With localllm and Cloud Workstations, you can build AI-powered apps right in your browser-based development environment. Quantized models optimize performance on CPUs while localllm handles downloading and running the models. The post provides instructions for setting up a Cloud Workstation with localllm pre-installed to get started with this new way to develop with LLMs. 🌀 Making the Switch to Open-Source Models: Hugging Face's new Messages API allows developers to easily transition chatbots and conversational models from OpenAI to open-source options like Mixtral. The API maintains compatibility with OpenAI libraries so your code doesn't need updating. You can also deploy these models to Hugging Face's Inference Endpoints and use them with frameworks like LangChain and LlamaIndex. This unlocks greater control, lower costs and more transparency compared to closed-source options. 🚀 HackHub: Trending AI Tools🌀 arplaboratory/learning-to-fly: Train end-to-end quadrotor control policies using deep reinforcement learning on a laptop in seconds 🌀 time-series-foundation-models/lag-llama: Open-source foundation model for probabilistic time series forecasting to perform zero-shot predictions on new time series data and eventually fine-tune for their specific forecasting needs 🌀 noahfarr/rlx: Implements reinforcement learning algorithms using Apple's MLX framework, making the code optimized to run on M-series chips 🌀 uclaml/SPIN: Implement Self-Play Fine-Tuning to enhance language models through self-supervised learning 🌀 phidatahq/phidata: Open-source toolkit for building AI assistants using function calling  Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support! 
Read more
  • 0
  • 0
  • 228