Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

How-To Tutorials

7019 Articles
article-image-creating-and-using-kibana-dashboards
Huage Chen
02 Jul 2024
12 min read
Save for later

Creating and Using Kibana Dashboards

Huage Chen
02 Jul 2024
12 min read
This article is an excerpt from the book, Elastic Stack 8.x Cookbook, by Huage Chen and Yazid Akadiri. Unlock the full potential of Elastic Stack for search, analytics, security, and observability and manage substantial data workloads in both on-premise and cloud environmentsIntroductionIn this guide, we will integrate all previously created visualizations into a comprehensive dashboard consisting of multiple panels. Additionally, we will explore how to enhance user interaction using control-based drilldowns.Getting readyMake sure to complete the following recipes from this chapter:Creating visualizations with Kibana LensCreating visualizations from runtime fieldsCreating Kibana mapsAt the end of this recipe, you will have dashboards composed of the various visualizations and elements built into the aforementioned recipes.How to do it...Building dashboards is very straightforward in Kibana, especially if you’ve already created some visualizations. Follow these steps:1. Go to Kibana | Analytics | Dashboard and click on Create dashboard.This will bring you to a blank canvas, where you can start adding some visualizations.2. We will start by adding a nice image! You can be creative, but we provided a sample picture:A. Click on Add panel | Image. B. Select the Use link tab and set Link to image with the following URL: https://upload. wikimedia.org/wikipedia/commons/6/60/Ville_de_RENNES_Noir. svg. Then, click on Save:Figure 6.54 – Adding an image for a logoThe logo will be added to the panel. Including a picture is a great way to add some personalization and branding to your dashboards. Let’s add some proper visualizations from the ones we’ve built in the last three recipes.3. Click on Add from library and select the [Rennes Traffic] Number of locations visualization. Make sure to align it to the right with the image panel.4. Let’s add another visualization; this time, we’ll pick [Rennes Traffic] Average speed gauge.At this stage, your dashboard should look like the one shown in Figure 6.55:Figure 6.55 – Rennes traffic dashboard – first stepYou can easily rearrange the position of the different panels by clicking on the title section and moving the panel with your mouse anywhere you want on the canvas. To adjust the size and fit of the panel, position your mouse on the small arrow at the bottom right of the panel. Let’s keep adding more panels to our dashboard.5. Click on Add from library and add the following visualizations in the respective order:I. [Rennes Traffic] Traffic status waffleII. [Rennes Traffic] Speed by road hierarchyIII. [Rennes Traffic] Average speed & Traffic StatusIV. [Rennes Traffic] Traffic status by hour6. Finally, let’s add a Map visualization for a real-time view of the traffic; select the one named [Rennes Traffic] Traffic fluidity.By now, your dashboard should look like the one shown in Figure 6.56:Figure 6.56 – Rennes traffic dashboard – more visualizationsYou can start playing around with the dashboard to see the built-in interactivity of the panels. For example, clicking on a specific road hierarchy will automatically apply the filter to the entire dashboard.You can also have dedicated panels to filter and display only the data you are interested in with Controls. Let’s add some to our dashboard.7. On the dashboard toolbar, click on Controls:Figure 6.57 – Adding controls to the dashboard8. From the drop-down list, select Add control; the Create control flyout will appear on the right of the screen.9. Select the traffic_status field and click on Save and close.10. Back to the dashboard, you now have a new panel on top of the visualization named traffic_status. By clicking on it, you will see a drop-down list where you can select the values associated with the status of the traffic you want to filter, as shown in Figure 6.58. Select congested as an example:Figure 6.58 – Using controls in the dashboard11. You can see on your dashboard that all the panels have been updated according to the value selected in the traffic_status control.Imagine you want to filter your traffic data to analyze it within a specific time range, such as early in the morning or late in the afternoon, to better understand traffic patterns. This is where the time slider control proves to be incredibly useful.12. Go to the Controls menu again in the dashboard toolbar and select Add time slider control.You’ll see a new panel to the right of traffic_status:Figure 6.59 – Time slider controlBy clicking the play icon, you will see your dashboard animate and your data change over the defined time range. You can advance the time range forward as well as backward, which is especially useful when working with time series data.Your dashboard should now look as shown in Figure 6.60, with our two controls:                                                                                     Figure 6.60 – Rennes traffic dashboard with controls13. Save the dashboard by clicking the Save button in the upper-right corner. Name it [Rennes Traffic] Overview.To enhance our dashboard further, consider this: users frequently manage multiple dashboards, and the ability to navigate seamlessly from one to another is crucial, especially when aiming to refine analysis or focus on more detailed panels related to a specific dataset. Dashboard drilldowns are invaluable in this scenario as they allow you to transition between dashboards while maintaining the overall context. Let’s explore how to implement and use this feature effectively!For this exercise, we have already built a drilldown dashboard. Download and save the NDJSON file of the exported dashboard from the following location: https://github.com/PacktPublishing/ Elastic-Stack-8.x-Cookbook/blob/main/Chapter6/kibana-objects/rennesdata-drilldown-dashboard.ndjson. Then, follow these steps:1. To import the dashboard, go to Stack Management | Saved Objects.2. Click on Import and select the NDJSON file you have previously downloaded from the GitHub repository. Upon completing the import process, you will notice a warning in the flyout about data view conflicts. The reason is straightforward: our saved objects rely on an existing data view. To resolve the conflict, simply click on the drop-down list under the New data view column and select metrics-rennes_traffic-raw, as shown in Figure 6.61, then click on Confirm all changes to finalize the import procedure:Figure 6.61 – Importing saved objects and selecting the right data view3. Once all the objects have been imported, you will get a recap as shown in the following screenshot:                                                                                      Figure 6.62 – Saved objects successfully imported from the fileReturn to the [Rennes Traffic] Overview dashboard. Then, open the menu for the [Rennes Traffic] Speed by road hierarchie panel and select Create drilldown:Figure 6.63 – Creating drilldown from the panel4. Navigate to the drilldowns page and select the Go to Dashboard option. Here, you will need to name your drilldown—consider View Details for Road Hierarchy as a suggestion. Then, from the Choose destination dashboard drop-down menu, select [Rennes Traffic] Detailed traffic drilldown dashboard, which you have recently imported. This process sets up a targeted navigation path within your dashboard environment, allowing for a seamless transition between your overview and detailed analysis dashboards:Figure 6.64 – Configuring dashboard drilldown5. Click on Create drilldown. Save the dashboard to test our drilldown, click on one of the five charts in the [Rennes Traffic] Speed by road hierarchie panel. You will be redirected to the detailed dashboard filtered on the value you have selected.Figure 6.65 – Dashboard view after drilldownEt voilà! You have just built your first dashboard with a nice touch of interactivity thanks to controls and drilldowns.How it works...In Kibana, a dashboard is a collection of visualizations and saved searches that you can arrange and customize to display the data that is most important to you. You can create multiple dashboards for different use cases, and each dashboard can have its own set of visualizations and searches.Dashboards are a powerful tool for data analysis because they allow you to see multiple visualizations side by side and quickly identify patterns and trends in your data. You can also use dashboards to monitor key metrics in real time, which is especially useful for operational use cases. Kibana provides a wide range of visualization types that you can use to create custom dashboards, including bar charts, line charts, pie charts, tables, and more.The following table outlines a framework for choosing the right visualization:Use caseRecommended type of visualizationComparison and correlationMany items: Horizontal barFew items: Vertical barComparison over timeFew periods and categories: Stacked barFew time periods but many categories: Line graphDistribution of valuesFew numbers of points: Vertical bar histogramMany points: Line histogramComposition of a wholeSimple compositions with few items: Waffle or TreemapMultiple grouping dimensions for a few bottomlevel items: MosaicMultiple grouping dimensions for many bottomlevel items: TreemapEye-catching summaryOne value: MetricMany values: Table with color stylingVisualizing goals or targetsVertical bar or Line with reference linesMetricTable 6.2 – Choosing the right visualizationIn addition to visualizations, Kibana dashboards also support saved searches, which allow you to quickly filter your data based on specific criteria. You can save searches that you use frequently and add them to your dashboard for easy access.Overall, Kibana dashboards are a powerful tool for data analysis and monitoring. They allow you to quickly identify patterns and trends in your data, monitor key metrics in real time, and customize your view of the data to suit your needs.There’s more...In our recipe, we have used dashboard drilldowns, but you can also create URL and Discover drilldowns. With the former, you can link to data outside of Kibana, and with the latter, you can open Discover from a Lens panel while keeping all the contextual information.Dashboards are great when used in Kibana, but you can also share them with teams and colleagues outside of Kibana. You have many options that are easily accessible from the Share menu in the toolbar when it comes to sharing dashboards: you can interactively embed dashboards as an iFrame, export them as reports in various formats (PNG, CSV, PDF, etc.), and share them as direct links for easy access.When building dashboards, design thinking is a good practice. Start by asking yourself the following questions:What is the outcome or the goal of the dashboard? Is it about understanding high-level behaviors, visually correlating specific metrics at the same time, or finding the root cause of an issue?Who is using this dashboard to do their job? If you are building it for a team or someone else, step into their shoes to visualize their perspective when they will need that data.See alsoLooking for more design tips to elevate your dashboards? Look no further and check out this blog: https://www.elastic.co/blog/designing-intuitive-kibanadashboards-as-a-non-designerIf you’re interested in delving deeper into the topics of creating dashboards more efficiently, be sure to check out this technical blog: https://www.elastic.co/blog/buildingkibana-dashboards-more-efficientlyFor developers interested in debugging their Kibana dashboard, the following article will be very useful: https://www.elastic.co/blog/debugging-kibana-dashboardsConclusionIn this guide, we've explored the process of integrating various visualizations into a comprehensive Kibana dashboard, enhancing user interaction through control-based drilldowns. By following the steps outlined, you should now have a functional and interactive dashboard that can provide valuable insights into your data.We began by preparing the necessary visualizations and then moved on to assembling the dashboard by adding images for personalization and aligning various traffic visualizations. We also incorporated control panels for dynamic filtering, allowing for more precise data analysis. The final touch was adding drilldowns to enable seamless navigation between detailed and overview dashboards.Kibana dashboards offer powerful tools for data analysis and real-time monitoring. By displaying multiple visualizations side by side, you can quickly identify patterns and trends, making dashboards invaluable for operational and analytical use cases.Remember, the key to a successful dashboard is thoughtful design—consider the goals, the audience, and the specific data insights needed. Utilize the wide range of visualization types that Kibana offers and don't hesitate to leverage the sharing options to collaborate with your team effectively.For further reading and advanced tips on designing intuitive dashboards, building them efficiently, or debugging, check out the additional resources provided. Happy dashboarding!Author BioHuage Chen is a member of Elastic's customer engineering team and has been with Elastic for over five years, helping users throughout Europe to innovate and implement cloud-based solutions for search, data analysis, observability, and security. Before joining Elastic, he worked for 10 years in web content management, web portals, and digital experience platforms.Yazid Akadiri has been a solutions architect at Elastic for over four years, helping organizations and users solve their data and most critical business issues by harnessing the power of the Elastic Stack. At Elastic, he works with a broad range of customers, with a particular focus on Elastic observability and security solutions. He previously worked in web services-oriented architecture, focusing on API management and helping organizations build modern applications.
Read more
  • 0
  • 0
  • 705

article-image-hands-on-exploratory-data-analysis-with-duckdb
Ned Letcher
28 Jun 2024
7 min read
Save for later

Hands-On Exploratory Data Analysis with DuckDB

Ned Letcher
28 Jun 2024
7 min read
This article is an excerpt from the book, Getting Started with DuckDB, by Simon Aubury and Ned Letcher. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.Introduction DuckDB is a versatile and highly optimized database management system designed for efficient data analysis workflows. Its capabilities allow practitioners to scale their data analysis efforts beyond traditional tools, making it an excellent choice for local machine data processing. In this excerpt, we will explore how to use DuckDB for hands-on exploratory data analysis, leveraging Python, Jupyter Notebooks, and Plotly for interactive data visualizations.Technical RequirementsTo follow along with the examples in this guide, you will need the following setup:Python environmentJupyter NotebookDuckDB installedJupySQL libraryPlotly libraryYou can find the necessary code examples in the chapter_11 folder in the book’s GitHub repository at [PacktPublishing](https://github.com/PacktPublishing/Getting-Started-with-DuckDB/tree/main/chapter_11).Obtaining the Dataset We will be using a pedestrian counting system dataset from the city of Melbourne, containing hourly pedestrian counts from sensors located in and around the Melbourne Central Business District (CBD). This dataset provides a comprehensive view of pedestrian traffic patterns over several years.To download the dataset, visit the dataset’s home page [Melbourne Pedestrian Counting System](https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour) and locate the ZIP file containing the 2009 to 2022 archive.Setting Up the Environment Before diving into the code, ensure your Python environment is set up with the necessary dependencies. You will need to: 1. Set up a Python virtual environment:python -m venv duckdb_env source duckdb_env/bin/activate 2. Install the required libraries:   pip install jupyter duckdb plotly jupysql pandas  3. Start Jupyter Notebook: jupyter notebook Loading and Cleaning DataFirst, we will load our dataset from a CSV file and perform some data cleaning steps before writing it to a DuckDB database.Loading CSV Data into DuckDBimport duckdb import pandas as pd # Load the dataset into a pandas DataFrame data_url = "path_to_downloaded_zip_file/2022/2022.csv" pedestrian_counts = pd.read_csv(data_url) # Display the first few rows of the dataframe print(pedestrian_counts.head()) # Create a DuckDB connection and write the DataFrame to a DuckDB table con = duckdb.connect(database=':memory:') con.execute("CREATE TABLE pedestrian_counts AS SELECT * FROM pedestrian_counts") ```Data Cleaning StepsPerform necessary data cleaning operations such as handling missing values, correcting data types, and filtering irrelevant records.# Convert the 'Date_Time' column to datetime format pedestrian_counts['Date_Time'] = pd.to_datetime(pedestrian_counts['Date_Time']) # Handle missing values by filling them with 0 pedestrian_counts = pedestrian_counts.fillna(0) # Write the cleaned data to DuckDB con.execute("DROP TABLE pedestrian_counts") con.execute("CREATE TABLE pedestrian_counts AS SELECT * FROM pedestrian_counts") # Verify the cleaned data result = con.execute("SELECT * FROM pedestrian_counts LIMIT 5").fetchdf() print(result)Using JupySQL for SQL QueriesJupySQL is a powerful library that allows you to run SQL queries directly in Jupyter Notebooks. This makes it easy to interact with your DuckDB database without switching contexts. #### Example JupySQL Query%load_ext sql %sql duckdb:///:memory: # Query to view the first few rows of the dataset %%sql SELECT * FROM pedestrian_counts LIMIT 5;Visualizing Data with Plotly Plotly is a versatile data visualization library that integrates well with Jupyter Notebooks. We will use it to create interactive visualizations of our dataset.Total Pedestrian Counts Over Timeimport plotly.express as px # Aggregate pedestrian counts by year yearly_counts = con.execute("""    SELECT strftime('%Y', Date_Time) AS Year, SUM(Counts) AS Total_Counts    FROM pedestrian_counts    GROUP BY Year    ORDER BY Year """).fetchdf() # Create a bar chart fig = px.bar(yearly_counts, x='Year', y='Total_Counts', title='Total Pedestrian Counts by Year') fig.show()Monthly Traffic Counts# Aggregate pedestrian counts by month for the years 2019 and 2020 monthly_counts = con.execute("""    SELECT strftime('%Y-%m', Date_Time) AS Month, SUM(Counts) AS Monthly_Counts    FROM pedestrian_counts    WHERE strftime('%Y', Date_Time) IN ('2019', '2020')    GROUP BY Month    ORDER BY Month """).fetchdf() # Create a line chart to compare the two years fig = px.line(monthly_counts, x='Month', y='Monthly_Counts', title='Monthly Pedestrian Counts for 2019 and 2020') fig.show()Hourly Traffic Patterns# Aggregate pedestrian counts by hour of the day hourly_counts = con.execute("""    SELECT strftime('%H', Date_Time) AS Hour, AVG(Counts) AS Average_Counts    FROM pedestrian_counts    GROUP BY Hour    ORDER BY Hour """).fetchdf() # Create a line chart for hourly patterns fig = px.line(hourly_counts, x='Hour', y='Average_Counts', title='Average Hourly Pedestrian Counts') fig.show()Exploratory Data Analysis With our dataset loaded and visualized, we can perform a more detailed exploratory data analysis.Comparing Traffic on Weekdays vs. Weekends# Add a column for day of the week pedestrian_counts['Day_of_Week'] = pedestrian_counts['Date_Time'].dt.day_name() # Aggregate pedestrian counts by day of the week daily_counts = con.execute("""    SELECT Day_of_Week, AVG(Counts) AS Average_Counts    FROM pedestrian_counts    GROUP BY Day_of_Week    ORDER BY FIELD(Day_of_Week, 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday') """).fetchdf() # Create a bar chart for daily patterns fig = px.bar(daily_counts, x='Day_of_Week', y='Average_Counts', title='Average Pedestrian Counts by Day of the Week') fig.show()Peak Hours of Pedestrian Traffic# Identify peak hours by finding the hours with the highest average counts peak_hours = con.execute("""    SELECT strftime('%H', Date_Time) AS Hour, AVG(Counts) AS Average_Counts    FROM pedestrian_counts    GROUP BY Hour    ORDER BY Average_Counts DESC    LIMIT 5 """).fetchdf() # Create a bar chart for peak hours fig = px.bar(peak_hours, x='Hour', y='Average_Counts', title='Peak Hours of Pedestrian Traffic') fig.show()ConclusionDuckDB, combined with JupySQL and Plotly, provides a robust framework for performing hands-on exploratory data analysis. By leveraging DuckDB’s high-performance SQL capabilities and integrating with powerful visualization tools, you can efficiently uncover insights from your data. We encourage you to further explore DuckDB’s features and apply these techniques to your datasets.For a deeper dive into DuckDB's powerful data analysis capabilities and to explore more advanced topics, we highly recommend reading the book 'Getting Started with DuckDB' by Simon Aubury and Ned Letcher."Author BioSimon Aubury has been working in the IT industry since 2000 as a data engineering specialist. He has an extensive background in building large, flexible, highly available distributed data systems. Simon has delivered critical data systems for finance, transport, healthcare, insurance, and telecommunications clients in Australia, Europe, and Asia Pacific. In 2019, Simon joined ThoughtWorks as a principal data engineer and today is associate director of data platforms at Simple Machines in Sydney, Australia. Simon is active in the data community, a regular conference speaker, and the organizer of local and international meetups and data engineering conferences.Ned Letcher has worked as a data science and software engineering consultant since completing his PhD in computational linguistics in 2018 and currently works at Thoughtworks. He has designed and developed data-powered products and services across a range of industries and helped organizations and teams improve the effectiveness of their data processes and workflows. Ned has also worked as a Python trainer, supporting both tertiary students and data professionals across various organizations. He is active in the data community, speaking at and helping organize meetups and conferences, as well as contributing to a range of open source projects.
Read more
  • 0
  • 0
  • 539

article-image-top-100-essential-data-science-tools-repos-streamline-your-workflow-today
Merlyn Shelley
27 Jun 2024
14 min read
Save for later

Top 100+ Essential Data Science Tools & Repos: Streamline Your Workflow Today!

Merlyn Shelley
27 Jun 2024
14 min read
IntroductionAs data professionals, navigating the vast sea of Big Data often leaves us searching for the right tools to harness its potential. Whether we're defining intricate problems, identifying emerging trends, or crafting innovative solutions, the challenge is undeniable. Too often, this quest has us wandering aimlessly through the web, seeking elusive answers. Here at the DataPro Newsletter team, we understand this all too well. That's why, in celebration of our 100th edition, we're thrilled to present a special gift to our valued readers—a thorough reference module brimming with resources. This carefully curated collection features over 100 of the most popular tools and GitHub repositories. Each one is not only widely used and trusted but is also consistently updated with the latest breakthroughs to enhance your data processing capabilities. Think of this module as your treasure chest, designed to streamline your workflow and inspire innovative solutions. Bookmark this page for quick access whenever you encounter challenges in any area of data science and machine learning, from DataOps to Recommender Systems to Quantitative Finance—we've got it all covered! So, dive into this one-stop reference module, explore its depths, and let the spirit of data kinship propel you forward. Here's to more empowering tools and transformative insights from your DataPro team—cheers! DataOps/MLOps kestra-io/kestra: Kestra is an open-source orchestrator for scheduled and event-driven workflows, leveraging Infrastructure as Code for reliable management. open-metadata/OpenMetadata: OpenMetadata is a unified platform for data discovery, observability, and governance, featuring a central repository, column lineage, and team collaboration. dolthub/dolt: Dolt is a SQL database with Git-like version control features, accessible via MySQL or a command line interface. iterative/dvc: DVC is a tool for reproducible machine learning, enabling data and model versioning, lightweight pipelines, experiment tracking, and easy sharing. quiltdata/quilt: Quilt allows creating versioned datasets with Python and an S3 bucket. It supports data-driven teams, aiding rapid experimentation and collaboration. Real-time Data Processing allinurl/goaccess: GoAccess is a real-time web log analyzer for *nix systems and browsers, offering fast HTTP statistics. More details: goaccess.io. feathersjs/feathers: Feathers is a TypeScript/JavaScript framework for building APIs and real-time apps, compatible with various backends and frontends. apache/age: Apache AGE extends PostgreSQL with graph database capabilities, supporting both relational SQL and openCypher graph queries seamlessly. zephyrproject-rtos/zephyr: Real-time OS for diverse hardware, from IoT sensors to smart watches, emphasizing scalability, security, and resource efficiency. hazelcast/hazelcast: Hazelcast integrates stream processing and fast data storage for real-time insights, enabling immediate action on data-in-motion within unified platform. Data Quality Management WeBankFinTech/Qualitis: Qualitis manages data quality through verification, notification, and management across various data sources, solving data processing-related quality issues. raystack/optimus: Optimus is a robust workflow orchestrator for data transformation, modeling, pipelines, and quality management, emphasizing ease of use and reliability. Toloka/crowd-kit: Crowd-Kit is a Python library for crowdsourced annotation, featuring aggregation methods, metrics, and datasets to simplify working with crowd data. ydataai/ydata-profiling: ydata-profiling offers a streamlined, fast EDA solution akin to pandas' df.describe(), providing detailed DataFrame analysis exportable in formats like HTML and JSON. cleanlab/cleanlab: cleanlab automates data and label cleaning by detecting issues in ML datasets, enhancing model training with real-world data. Predictive Analytics spring-cloud/spring-cloud-dataflow: Spring Cloud Data Flow enables microservices-driven data processing pipelines on Cloud Foundry and Kubernetes, supporting diverse use cases like streaming and batch processing. ScottfreeLLC/AlphaPy: AlphaPy, a Python ML framework, caters to speculators and data scientists with scikit-learn, pandas, and additional tools for feature engineering and visualization. retentioneering/retentioneering-tools: Retentioneering simplifies analyzing clickstreams and user paths, offering deeper insights than funnel analysis, benefiting data and marketing analysts. genular/pandora: PANDORA offers advanced analytics for biomedical research, employing machine learning tools like clustering, PCA, UMAP, and interpretable models for discovery. nabeel-oz/qlik-py-tools: Qlik's SSE integrates modern data science into Qlik Sense, enabling business users to leverage advanced analytics through Python-based functions. Deep Learning Lightning-AI/pytorch-lightning: Lightning 2.0 simplifies PyTorch workflows with a stable API, enabling scalable training and deployment of AI models efficiently. ultralytics/yolov5: YOLOv5 by Ultralytics is a leading vision AI model, built on extensive open-source research and development for advanced performance. hpcaitech/ColossalAI: Colossal-AI simplifies distributed deep learning with user-friendly tools, enabling easy parallel training and inference similar to local model development. naptha/tesseract.js: Tesseract.js simplifies OCR with a webassembly-based Tesseract engine, supporting both browser and Node.js environments with easy integration and setup. microsoft/DeepSpeed: DeepSpeed enables efficient training of models like ChatGPT with significant speed improvements and cost reductions across all scales. Reinforcement Learning ray-project/ray: Ray is a unified framework that scales AI and Python applications with a distributed runtime and specialized AI libraries. d2l-ai/d2l-en: An open-source book using Jupyter notebooks to make deep learning accessible, blending concepts, context, and interactive code examples. Unity-Technologies/ml-agents: Unity ML-Agents enables games and simulations for training intelligent agents with deep reinforcement learning and imitation learning, fostering innovation in AI. google/trax: Trax is a Google Brain-endorsed deep learning library known for clear code and speed, demonstrated in a Colab notebook. wandb/wandb: The repository includes a CLI and Python API for visualizing and tracking machine learning experiments effectively. VowpalWabbit/vowpal_wabbit: Vowpal Wabbit advances machine learning with online, hashing, allreduce, and active learning techniques, pushing the frontier of ML capabilities. Time Series Analysis taosdata/TDengine: TDengine is a high-performance, open-source time-series database designed for IoT, connected cars, industrial IoT, and DevOps environments. timescale/timescaledb: An open-source SQL database for time-series data, optimized for rapid data ingestion and complex querying, available as a PostgreSQL extension. influxdata/telegraf: Telegraf is an agent for gathering and processing metrics, logs, and data, featuring 300+ plugins and community-driven development for flexibility. questdb/questdb: QuestDB is an open-source time-series database known for high throughput ingestion, fast SQL queries, and operational simplicity, ideal for various high-cardinality datasets. ccfos/nightingale: Nightingale is an all-in-one, open-source, cloud-native monitoring system combining data collection, visualization, and alerting capabilities seamlessly. Data Engineering  PrefectHQ/prefect: Prefect simplifies Python data pipeline orchestration, transforming scripts into dynamic workflows that react to changes and ensure resilience. airbytehq/airbyte: Airbyte, an open-source data integration platform, offers 300+ connectors for seamless ELT pipelines between diverse data sources and destinations. argoproj/argo-workflows: Argo Workflows orchestrates parallel jobs on Kubernetes via container-native workflows, supporting DAGs and accelerating compute-intensive tasks like ML and data processing. dagster-io/dagster:  Dagster is a cloud-native data pipeline orchestrator with integrated lineage, observability, declarative programming, and robust testability across the lifecycle. Avaiga/taipy: Taipy simplifies web app development for data scientists & ML engineers using Python, focusing on AI algorithms with no extra languages. Business Intelligence ankane/blazer: SQL-based tool for data exploration, chart creation, dashboard sharing. Supports various data sources, variables, checks, audits, and security integrations. evidence-dev/evidence: Open-source BI tool uses Markdown with SQL queries for data sourcing, rendering charts, and generating templated, dynamic web pages. lightdash/lightdash: Empower teams with self-service data insights using dbt: define metrics, visualize data, and share dashboards seamlessly across your organization. TuiQiao/CBoard: User-friendly open BI platform for self-service reporting and dashboards, simplifying data insights and sharing across teams effortlessly. quarylabs/quary: BI platform for engineers to connect databases, write SQL for table transformations, create charts, dashboards, and reports with collaboration and deployment capabilities. Data Visualization netdata/netdata: Real-time metrics collection and visualization for servers, cloud, Kubernetes, and edge/IoT devices, scaling effortlessly across diverse environments. directus/directus: Open-source API and dashboard for managing SQL database content with REST & GraphQL interfaces, supporting various databases, and customizable for on-premises or cloud deployment. airbnb/visx: Reusable low-level visualization components combining d3's power with React's DOM updating capabilities for dynamic data visualization. uber/react-vis: React component library for diverse data visualizations: line, bar, scatter, heatmaps, pie charts, sunbursts, radar charts, and more. bokeh/bokeh: Interactive visualization library for web browsers, offering versatile graphics creation and high-performance interactivity for large datasets and dashboards. apache/echarts: Free JavaScript library for intuitive, interactive, and customizable charts, ideal for enhancing commercial products with powerful visualizations. Recommender Systems NicolasHug/Surprise: Python scikit for building recommender systems with explicit rating data, emphasizing experiment control, dataset handling, and diverse prediction algorithms. gorse-io/gorse: Open-source recommendation system in Go, designed for universal integration into online services, automating model training based on user interaction data. recommenders-team/recommenders: Recommenders, a Linux Foundation project, offers Jupyter notebooks for building classic and cutting-edge recommendation systems, covering data prep, modeling, evaluation, optimization, and production deployment on Azure. alibaba/Alink: Alink, developed by Alibaba's PAI team, integrates Flink for ML algorithms. PyAlink supports various Flink versions, maintaining compatibility up to Flink 1.13. RUCAIBox/RecBole: RecBole, built on Python and PyTorch, facilitates research with 91 recommendation algorithms across general, sequential, context-aware, and knowledge-based categories. Quantitative Finance AI4Finance-Foundation/FinGPT: FinGPT is a cost-effective, adaptable financial large language model for quick updates and fine-tuning, enhancing accessibility compared to BloombergGPT. google/tf-quant-finance: This library leverages TensorFlow's hardware acceleration and automatic differentiation for high-performance mathematical methods, mid-level functions, and pricing models support. goldmansachs/gs-quant: GS Quant, a Python toolkit by Goldman Sachs, aids in developing quantitative trading strategies and risk management solutions with robust market experience. domokane/FinancePy: A Python finance library specializing in pricing and managing financial derivatives across fixed-income, equity, FX, and credit markets. romanmichaelpaolucci/Q-Fin: QFin is evolving with enhanced object-oriented principles, deprecating old modules like PDEs/SDEs, introducing 'stochastics' for model calibration and option pricing. avhz/RustQuant: This Rust library for quantitative finance covers diverse modules from autodiff and data handling to instruments pricing and stochastic processes. Responsible AI microsoft/responsible-ai-toolbox: Responsible AI Toolbox offers interfaces and libraries for model and data exploration, enabling developers to monitor and improve AI responsibly. Giskard-AI/giskard: Giskard, an open-source Python library, detects performance, bias, and security issues in AI applications, spanning LLMs to traditional ML models. fairlearn/fairlearn: Fairlearn, a Python package, helps developers assess and mitigate fairness issues in AI systems with algorithms and assessment metrics provided. Azure/PyRIT: PyRIT is an open-access Python tool for generative AI, aiding security professionals and ML engineers in identifying system risks. ModelOriented/DALEX: DALEX enhances model transparency to prevent failure through its explainability tools, supporting understanding and trust in complex AI systems. JohnSnowLabs/langtest: LangTest simplifies testing of AI models with over 60 tests in one line, covering robustness, bias, fairness, and accuracy across various NLP frameworks. Explainable AI (XAI) SeldonIO/alibi: Alibi is a Python library focused on machine learning model inspection, offering diverse explanation methods for classification and regression models. Trusted-AI/AIX360: AI Explainability 360 offers an open-source Python toolkit for detailed model interpretability across various data types, supporting diverse explanation methods. dssg/aequitas: Aequitas is an open-source toolkit for bias auditing and Fair ML, aiding data scientists and researchers in assessing and correcting model biases. albermax/innvestigate: iNNvestigate is a Python library providing a unified interface for various methods to analyze neural networks' predictions and understand their internal workings. mindsdb/lightwood: Lightwood is an AutoML framework simplifying machine learning pipelines with JSON-AI syntax, allowing customization and automation across diverse data types. Anomaly Detection SeldonIO/alibi-detect: Alibi Detect is a Python library for detecting outliers, adversarial attacks, and drift in tabular, text, image, and time series data. datamllab/tods: TODS automates outlier detection in multivariate time-series data with modules for data processing, feature analysis, and diverse detection algorithms. pygod-team/pygod: PyGOD is a Python library using PyTorch Geometric for graph outlier detection, offering 10+ algorithms and easy integration with PyOD. Jingkang50/OpenOOD: This repository replicates methods from the Generalized Out-of-Distribution Detection Framework for fair comparison across anomaly, novelty, and out-of-distribution detection methods. yzhao062/pyod: PyOD is a Python library for detecting anomalies in multivariate data, offering diverse algorithms for various project scales and datasets. chaos-genius/chaos_genius: Chaos Genius is an open-source ML-powered analytics engine for outlier detection and root cause analysis at scale. Supply Chain Analytics guacsec/guac: GUAC creates a high fidelity graph database for software security, facilitating organizational outcomes like audit, policy, and risk management. owasp-dep-scan/blint: BLint is a Binary Linter using lief to verify executable security and capabilities, now supporting SBOM generation for compatible binaries. samirsaci/picking-route: This repository focuses on improving warehouse productivity through Python-based tools and methodologies, particularly addressing order batching and optimizing picking routes using the Single Picker Routing Problem (SPRP). ragamarkely/scanalytics: Scanalytics automates Supply Chain Analytics & Design tasks in Python, streamlining analyses and reducing manual spreadsheet work for assignments. aitechtools/SunFlow: SunFlow optimizes supply chain design with comprehensive modeling of materials, components, suppliers, manufacturers, and customers, integrating costs, capacities, and constraints. CIOL-SUST/SupplyGraph: This repository introduces a benchmark dataset for applying Graph Neural Networks (GNNs) to supply chain networks, enabling research in optimization and prediction. Network Optimization ray-project/ray: Ray is a scalable framework with a distributed runtime and AI libraries designed to accelerate AI and Python applications. svg/svgo: SVGO optimizes SVG files by removing redundant metadata, comments, and hidden elements to improve file efficiency and rendering performance. zeux/meshoptimizer: meshoptimizer is a C/C++ library optimizing GPU rendering by reducing mesh complexity and storage overhead, compatible with Rust via meshopt crate. cvxpy/cvxpy: CVXPY is a Python-based modeling language designed for convex optimization problems, providing a natural expression format aligned with mathematical conventions. guofei9987/scikit-opt: The repository provides Python implementations of various swarm intelligence algorithms such as Genetic Algorithm, Particle Swarm Optimization, and others for optimization tasks. Speech Processing espnet/espnet: ESPnet is a detailed speech processing toolkit using PyTorch, covering recognition, synthesis, translation, enhancement, diarization, and understanding tasks. mozilla/DeepSpeech: DeepSpeech is an open-source Speech-To-Text engine based on Baidu's research, implemented using TensorFlow for accessibility and performance. microsoft/SpeechT5: The repository proposes SpeechT5, adapting T5's text-to-text approach for self-supervised speech and text representation learning using shared encoders and modality-specific nets. sloria/TextBlob: Python library simplifying NLP tasks like POS tagging, sentiment analysis, and classification with a straightforward API for textual data. pytorch/audio: Torchaudio integrates PyTorch with audio processing, emphasizing GPU acceleration, trainable features via autograd, and maintaining a consistent tensor-based style. Graph Data Science neo4j/graph-data-science: The Neo4j Graph Data Science (GDS) library offers graph algorithms, transformations, and ML pipelines, accessible via Cypher within Neo4j. cncf/landscape-graph: This repository explores open source project dynamics, evolution, and collaboration using a Graph Data Model for insightful community analysis. BlueBrain/nexus: Blue Brain Nexus organizes and enhances data with a Knowledge Graph ecosystem, featuring various products, libraries, and tools for comprehensive use. lynxkite/lynxkite: LynxKite is a robust graph data science platform with a user-friendly interface and powerful Python API for large datasets. dgraph-io/dgraph: Dgraph is a scalable GraphQL database optimized for performance, offering ACID transactions and distributed architecture for real-time queries. arangodb/arangodb: ArangoDB is a versatile multi-model database supporting documents, graphs, and key-values, empowering high-performance applications with SQL-like queries and JavaScript extensions. ETL/ELT (Extract, Transform, Load / Extract, Load, Transform) redpanda-data/connect: Redpanda Connect is a robust stream processor for seamless data integration, featuring a powerful mapping language and easy deployment options. turbot/steampipe: Steampipe simplifies data access from APIs with CLI, Postgres FDWs, SQLite extensions, export tools, and cloud-based Turbot Pipes. risingwavelabs/risingwave: RisingWave is a cost-efficient streaming database compatible with Postgres, designed for real-time event streaming data processing and analysis. apache/dolphinscheduler: Apache DolphinScheduler is a modern data orchestration platform with low-code workflow creation, robust task management, and cloud-native capabilities. rudderlabs/rudder-server: RudderStack is a privacy-focused, Segment-alternative platform in Golang and React. It simplifies data collection and integrates with warehouses and tools for enriched customer data pipelines. We hope this extensive collection of tools and techniques proves to be a valuable asset in your daily data practice. May it help you achieve smoother workflows and better outcomes! 
Read more
  • 1
  • 0
  • 2325

article-image-mastering-semi-structured-data-in-snowflake
Serge Gershkovich
27 Jun 2024
7 min read
Save for later

Mastering Semi-Structured Data in Snowflake

Serge Gershkovich
27 Jun 2024
7 min read
This article is an excerpt from the book, Data Modeling with Snowflake, by Serge Gershkovich. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.Introduction In the era of big data, the ability to efficiently manage and analyze semi-structured data is crucial for businesses. Snowflake, a leading cloud-based data platform, offers robust features to handle semi-structured data formats like JSON, Avro, and Parquet. This article explores the benefits of using the VARIANT data type in Snowflake and provides a hands-on guide to managing semi-structured data.The Benefits of Semi-Structured Data in Snowflake Semi-structured data formats are popular due to their flexibility when working with dynamically varying information. Unlike relational schemas, where a precise entity structure must be predefined, semi-structured data can adapt to include or omit attributes as needed, as long as they are properly nested within corresponding parent objects. For example, consider the contact list on your phone. It contains a list of people and their contact details but does not capture those details uniformly. Some contacts may have multiple phone numbers, while others have only one. Some entries might include an email address and street address, while others have just a number and a vague description. To handle this type of data, Snowflake uses the VARIANT data type, which allows semi-structured data to be stored as a column in a relational table. Snowflake optimizes how VARIANT data is stored internally, ensuring better compression and faster access. Semi-structured data can sit alongside relational data in the same table, and users can access it using basic extensions to standard SQL, achieving similar performance. Another compelling reason to use the VARIANT data type is its adaptability to change. If columns are added or removed from semi-structured data, there is no need to modify ELT (extract, load, and transform) pipelines. The VARIANT data type does not require schema changes, and read operations will not fail for an attribute that no longer exists.Getting Hands-On with Semi-Structured Data Let's delve into a practical example of working with semi-structured data in Snowflake. This example uses JSON data representing information about pirates, such as details about the crew, weapons, and their ship. All this information is stored in a single VARIANT data type column. In relational data, a row represents a single entity; in semi-structured data, a row can represent an entire file containing multiple entities. Creating a Table for Semi-Structured Data Here is a sample SQL script to create a table with semi-structured data:CREATE TABLE pirates_data (    id NUMBER AUTOINCREMENT PRIMARY KEY,    load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,    data VARIANT ); ``` In this example, the `AUTOINCREMENT` keyword generates a unique ID for each record inserted, and the `VARIANT` column stores the semi-structured JSON data.Loading Semi-Structured Data To load semi-structured data into Snowflake, you can use the `COPY INTO` command. Here’s an example of how to load JSON data from an external stage into the `pirates_data` table:COPY INTO pirates_data FROM @my_stage/pirates_data.json FILE_FORMAT = (TYPE = 'JSON'); ```Querying Semi-Structured Data Once the data is loaded, you can query it using standard SQL. For instance, to extract specific attributes from the JSON data, you can use the dot notation: SELECT    data:id::NUMBER AS pirate_id,    data:crew AS crew,    data:weapons AS weapons FROM pirates_data; ```This query extracts the `id`, `crew`, and `weapons` fields from the JSON data stored in the `data` column.Converting Semi-Structured Data into Relational Data Although semi-structured data offers flexibility, converting it into a relational format can provide better performance for certain queries. Snowflake allows you to transform VARIANT data into relational columns using the `FLATTEN` function. Here's an example of how to flatten a JSON array into a relational table:SELECT    value:id::NUMBER AS pirate_id,    value:name::STRING AS name,    value:rank::STRING AS rank FROM pirates_data, LATERAL FLATTEN(input => data:crew); ``` This query converts the `crew` array from the JSON data into individual rows in a relational format, making it easier to query and analyze.Schema-on-Read vs. Schema-on-Write One of the main advantages of using the VARIANT data type in Snowflake is the flexibility of schema-on-read. This approach allows you to ingest data without a predefined schema, and then define the schema at the time of reading the data. This contrasts with the traditional schema-on-write approach, where the schema must be defined before data ingestion.Benefits of Schema-on-ReadFlexibility: You can ingest data without worrying about its structure, which is particularly useful for unstructured or semi-structured data sources.Adaptability: Schema changes do not require re-ingestion of data, as the schema is applied at read time.Speed: Data can be loaded more quickly, as there is no need to enforce a schema during the ingestion process.Example: Using Schema-on-Read with VARIANT Data Here’s an example demonstrating schema-on-read with semi-structured data in Snowflake: SELECT    data:id::NUMBER AS pirate_id,    data:ship.name::STRING AS ship_name,    data:ship.type::STRING AS ship_type FROM pirates_data; ```In this query, the schema is defined at read time, allowing you to extract specific attributes from the nested JSON data.Handling Nested and Repeated Data Snowflake’s support for semi-structured data also extends to handling nested and repeated data structures. The FLATTEN function is particularly useful for working with such data, enabling you to transform nested arrays into a more manageable relational format.Example: Flattening Nested Data Consider a JSON structure where each pirate has a nested array of previous voyages. To flatten this nested data, you can use the following query: SELECT    data:id::NUMBER AS pirate_id,    value:date::DATE AS voyage_date,    value:destination::STRING AS voyage_destination FROM pirates_data, LATERAL FLATTEN(input => data:previous_voyages); ```This query extracts the nested `previous_voyages` array and converts it into individual rows in a relational format.Performance Considerations When working with semi-structured data in Snowflake, it’s important to consider performance implications. While the VARIANT data type offers flexibility, it can also introduce overhead if not managed properly.Tips for Optimizing PerformanceUse Caching: Take advantage of Snowflake’s caching mechanisms to reduce query times for frequently accessed data.Optimize Queries: Write efficient SQL queries, avoiding unnecessary complexity and ensuring that only the required data is processed.Monitor Usage: Regularly monitor your Snowflake usage and performance metrics to identify and address potential bottlenecks.ConclusionHandling semi-structured data in Snowflake using the VARIANT data type provides immense flexibility and performance benefits. Whether you are dealing with dynamically changing schemas or integrating semi-structured data with relational data, Snowflake’s capabilities can significantly enhance your data management and analytics workflows. By leveraging the techniques outlined in this article, you can efficiently model and transform semi-structured data, unlocking new insights and value for your organization.For more detailed guidance and advanced techniques, refer to the book "Data Modeling with Snowflake," which provides comprehensive insights into modern data modeling practices and Snowflake’s powerful features.Author BioSerge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.
Read more
  • 0
  • 0
  • 577

Banner background image
article-image-prompt-engineering-with-azure-prompt-flow
Shankar Narayanan SGS
06 May 2024
10 min read
Save for later

Prompt Engineering with Azure Prompt Flow

Shankar Narayanan SGS
06 May 2024
10 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionThe ability to generate relevant and creative prompts is one of the imperative aspects of the natural language processing system. Especially when the world is evolving in the landscape of artificial intelligence, it is one of the crucial prospects. During this situation, Microsoft's Azure prompt flow provides groundbreaking solutions while empowering the data, scientists, and developers to engineer prompts effectively.  Here, let us explore the nuances of Azure prompt flow while delving deep into the realm of prompt engineering. Significance of Prompt Engineering With the help of prompt engineering, one can construct problems, helping the user with the guide of machine learning models effectively. However, it involves Formulating contextually relevant and specific questions or statements that elicit the desired responses from the artificial intelligence models. Azure prompt flow is one of the sophisticated tools by Microsoft Azure that simplifies intricate processes while enabling the developers to create brands that can have meaningful and accurate outcomes. Getting started with Azure prompt flow Even before exploring the practical applications of Azure Prompt flow, it is necessary to understand the few essential components of Azure prompt flow. The core of prompt flow utilizes the GPT 3.5 architecture to generate various relevant responses to prompts. With the integration of Azure, one can expect a secure and seamless environment for prompt engineering.  Let us consider a practical example of a chatbot application.  from azure.ai.textanalytics import TextAnalyticsClient from azure.core.credentials import AzureKeyCredential # Set up Azure Text Analytics client key = "YOUR_AZURE_TEXT_ANALYTICS_KEY" endpoint = "YOUR_AZURE_TEXT_ANALYTICS_ENDPOINT" credential = AzureKeyCredential(key) text_analytics_client = TextAnalyticsClient(endpoint=endpoint, credential=credential) # User input user_input = "Tell me a joke." # Generate a prompt using Azure Promptflow prompt = f"User: {user_input}\nChatbot:" # Get chatbot's response response = text_analytics_client.analyze_sentiment(prompt) # Output the response print(f"Chatbot: {response[0]['sentiment']}") In this particular example, we can see that the user inputs a request. Azure prompt flow constructs the required form for the chatbot while generating a sentiment analysis response. Here is the output:Chatbot: Positive Tuning prompts using Azure Promptflow  Crafting good prompts can be a challenging task. With the concept of variants, the user would be able to test the behavior of the model under various conditions. Example:  If the user wants to create a chatbot using Azure Promptflow, then this example might help one to respond creatively to the queries about movies.  Prompt Tuning: User: "Tell me about your favorite movie." Chatbot: "Certainly! One of my favorite movies is 'Inception.' Directed by Christopher Nolan, it's a mind-bending sci-fi thriller that explores the depths of the human mind." Python code:  from azure.ai.textanalytics import TextAnalyticsClient from azure.core.credentials import AzureKeyCredential # Set up Azure Text Analytics client key = "YOUR_AZURE_TEXT_ANALYTICS_KEY" endpoint = "YOUR_AZURE_TEXT_ANALYTICS_ENDPOINT" credential = AzureKeyCredential(key) text_analytics_client = TextAnalyticsClient(endpoint=endpoint, credential=credential) # User input user_input = "Tell me about your favorite movie." # Generate a creative prompt using Azure Promptflow prompt = f"User: {user_input}\nChatbot: Certainly! One of my favorite movies is 'Inception.' Directed by Christopher Nolan, it's a mind-bending sci-fi thriller that explores the depths of the human mind." # Get chatbot's response response = text_analytics_client.analyze_sentiment(prompt) # Output the response print(f"Chatbot: {response[0]['sentiment']}") In this example, Azure Promptflow is used to create prompts tailored to specific user queries, providing creative and contextually relevant responses. The analyze_sentiment function from the Azure Text Analytics client is used to assess the sentiment of the generated prompts. Replace "YOUR_AZURE_TEXT_ANALYTICS_KEY" and "YOUR_AZURE_TEXT_ANALYTICS_ENDPOINT" with your actual Azure Text Analytics API key and endpoint. Here are a few examples: URL: https://music.apple.com/us/app/apple-music/id1108187390 Text Content: Apple Music is a comprehensive music streaming app that boasts an extensive library of songs, albums, and playlists. Users can enjoy curated playlists, radio shows, and exclusive content from their favorite artists. Apple Music allows offline downloads and offers a family plan for multiple users. It also integrates with the user's existing music library, making it seamless to access purchased and uploaded music. OUTPUT: {"category": "App", "evidence": "Both"}​​​​ URL: https://www.youtube.com/user/premierleague Text Content: Premier League Pass, in collaboration with the English Premier League, delivers live football matches, highlights, and exclusive behind-the-scenes content on YouTube. Football aficionados can stay updated with their favorite teams and players through this official channel. Subscribing to Premier League Pass on YouTube ensures fans never miss a moment from the most exciting football league in the world. OUTPUT: {"category": "Channel", "evidence": "URL"} URL: https://arxiv.org/abs/2305.06858 Text Content: This research paper explores the realm of image captioning, where advanced algorithms generate descriptive captions for images. The study delves into techniques that combine computer vision and natural language processing to achieve accurate and contextually relevant image captions. The paper discusses various models, evaluates their performance, and presents findings that contribute to the field of image captioning technology. OUTPUT: {"category": "Academic", "evidence": "Text content"} URL: https://exampleconstructionsite.com/ Text Content: This website is currently under construction. Please check back later for updates and exciting content. OUTPUT: {"category": "None", "evidence": "None"}  For a given URL: {{url}}, and text content: {{text_content}}. Classified Category: Travel Evidence: The text contains information about popular tourist destinations, travel itineraries, and hotel recommendations. OUTPUT: After summarizing, here is the final Promptflow with 2 variants for the summarize_text_content node.  Benefits of using Azure ML prompt flow Apart from offering a wider range of benefits, Azure ML promptflow helps users to make the transition from ideation to experimentation. This ultimately results in production ready LLM based applications.  Prompt engineering agility  Azure prompt flow offers a visual representation of the struct of the flow structure. It allows the users to understand and navigate the projects while offering a notebook-like coding experience for debugging and efficient flow development. At the same time, users can create as well as compare more than one prompt variant which helps in facilitating an iterative refinement process. Enterprise readiness  The prompt flow streamlines the entire prompt engineering process and leverages robust enterprise readiness solutions. It thus offers a secure, reliable, and scalable foundation for experimentation and development. Besides, it supports team collaboration where multiple users can work together, share knowledge, and maintain version control. Application development  The well-defined process of Azure prompt facilitates the seamless development of AI applications. Only by leveraging it the user can progress effectively through the consequent stages of developing, testing, tuning, and deploying flows. All these ultimately result in creating a fully-fledged AI applications.   However, when the user follows this methodical and structured approach, it empowers them to develop fine-tune and test rigorously to deploy with confidence. Real-world applications of Azure Promptflow  Content creation One of the applications of Azure promptflow lies in the content creation tunes. Various content creators can generate outlines and creative ideas by creating engineering tailored to specific topics. One can even generate entire paragraphs using the prompt flow engineering method. This helps streamline the content creation process while making it look more inspiring and efficient. Language Translation Developers are now leveraging Azure promptflow to build large language translation applications. With the help of constructing prompts in the source language, one can let the system translate the inputs by providing accurate outputs required in the desired language. Such a profound implication can only be possible with the help of Azure prompt flow. It has the propensity to break all the language barriers in the globalized world. Custom support chat box By integrating Azure prompt flow within the customer support chatbots, one can enhance the user experience. However, the prompt engineering techniques help ensure the queries are accurately understood. This process would result in relevant and precise responses. It significantly reduces the response time while improving customer satisfaction. Azure prompt flow simplifies prompt engineering   Prompt engineering is an iterative and challenging process. With the help of Azure prompt flow, one can simplify the development, comparisons, and evaluation of problems. The process makes it easier for the user to find the best prompt for use cases.  Besides, developing a chatbot that utilizes large language models, including GPT3.5, can help companies provide personalized product recommendations based on customer input. Here, Azure prompt flow allows users to evaluate, create, and even deploy from the machine learning models. It speeds up the whole process of developing and deploying artificial intelligence solutions. At the same time, it also allows the user to create connections to the large language model.   Such models include GPT 3.5 and Azure open AI. Users can also use these models for different purposes, including chat computation or creating embeddings. Designing and modifying prompts Designing and modifying alarms for effective use is crucial, especially when using them for large language models. Azure prompt flow enables users to test, create, and deploy various prompt versions for recommendation purposes. To effectively utilize the large language model, especially while dealing with multiple prompts, it is imperative to modify them and design accordingly for better results. Once you can create the problems, it is time to evaluate and test them in multiple scenarios. For instance, if you are creating prompts for a product company, you must explain the process of prompts and their flow to handle the user queries. Also, one can mention the need for custom coding and deployment of end-to-end solutions with the help of Azure's prompt flow feature. Conclusion  With powerful prompt engineering capabilities, Azure prompt flow enables the developers to construct contextually relevant prompts. It enhances the efficiency and accuracy of AI applications over various domains. The potential of prompt engineering makes the future of AI development promising. However, it can only be possible with the help of Azure AI leading the way. Author BioShankar Narayanan (aka Shanky) has worked on numerous different cloud and emerging technologies like Azure, AWS, Google Cloud, IoT, Industry 4.0, and DevOps to name a few. He has led the architecture design and implementation for many Enterprise customers and helped enable them to break the barrier and take the first step towards a long and successful cloud journey. He was one of the early adopters of Microsoft Azure and Snowflake Data Cloud. Shanky likes to contribute back to the community. He contributes to open source is a frequently sought-after speaker and has delivered numerous talks on Microsoft Technologies and Snowflake. He is recognized as a Data Superhero by Snowflake and SAP Community Topic leader by SAP.
Read more
  • 0
  • 0
  • 497

article-image-fabrics-code-first-automl-and-hyperparameter-tuning-google-cloud-cortex-framework-snowflakes-data-metric-functions-qliks-ai-accelerator
Merlyn Shelley
29 Apr 2024
12 min read
Save for later

Fabric’s Code-First AutoML and Hyperparameter Tuning, Google Cloud Cortex Framework, Snowflake’s Data Metric Functions, Qlik's AI Accelerator

Merlyn Shelley
29 Apr 2024
12 min read
Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to BI-Pro #54: Your Premier Destination for Data and Business Intelligence Insights! 🌟 In this edition, we dive deep into the cutting-edge solutions of business intelligence, data modeling, and advanced analytics. Prepare to explore an array of transformative topics and industry insights that will redefine how you interact with technology and data. 🧩 Highlights of This Issue: Python Practice Platforms: The top 7 platforms where you can sharpen your Python skills. Innovative Experiments: Dive into hands-on experiments with MLFlow and Microsoft Fabric to enhance your project’s efficiency. SAP Expertise: Master the complex data models of SAP and leverage them for optimal performance. AI-Powered Business Management: Learn how to integrate AI to streamline and enhance business management functions. Snowflake’s Surveillance: Monitor your data pipelines effectively using Snowflake’s Data Metric Functions. 🧬 Stay Informed with Industry Highlights: Power BI: Learn about the significant deprecation of AutoML in Power BI using Dataflows V1. Microsoft Fabric: Get the scoop on the new code-first AutoML and hyperparameter tuning, now available in public preview. AWS BI: Discover how to build SAP Golden AMIs with EC2 Image Builder and Ansible and explore the transformative impact of Amazon Q on business experiences. Google Cloud Data: Catch up with the latest updates from the Google Cloud Cortex Framework. Tableau: Uncover how Einstein Copilot for Tableau is building the next generation of AI-driven analytics. From the Experts at Packt Community: Gain insights from industry leaders on the fundamentals of Analytics Engineering. 🧮 What’s the Latest from the BI Community? Explore real-time AI capabilities with Datorios’ new observability tool. Learn about Snowflake's launch of Arctic, an enterprise-grade LLM. Discover how Qlik's AI Accelerator is integrating generative AI to deliver customer outcomes. Witness the future of AI with Avant Technologies’ new supercomputing advancements. Join us as we unpack these topics to keep you at the forefront of the data and BI world. Stay curious, stay informed! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktPackt BI-Pro is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Upgrade to paidSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🧩 pixiedust/pixiedust: PixieDust is an open-source library enhancing Jupyter notebooks, improving data work experience, particularly for cloud-hosted notebooks without configuration access. 🧩 plotly/plotly.py: plotly.py is an interactive, open-source graphing library for Python, offering over 30 chart types, including scientific, 3D, statistical, and financial charts. 🧩 AykutSarac/jsoncrack.com: JSON Crack is a free, open-source data visualization app for JSON, YAML, XML, CSV, etc., offering interactive graphs for easy data exploration and analysis. 🧩 apexcharts/apexcharts.js: ApexCharts is a JavaScript charting library with a simple API, 100+ samples, and over a dozen chart types for beautiful, responsive visualizations in apps and dashboards. 🧩 antvis/G2: G2 is a visualization library inspired by "The Grammar of Graphics," offering an introduction, examples, tutorials, and API reference for learning and using its core concepts. 🧩 visgl/deck.gl: deck.gl simplifies high-performance, WebGL2/WebGPU-based visualization of large datasets. It offers pre-built layers for easy setup or customizable architecture for tailored needs. Email Forwarded? Join BI-Pro Here!🔮 Revolutionizing Analytics: New BI Tools🧬 7 Best Platforms to Practice Python: The article lists seven platforms—Practice Python, Edabit, CodeWars, Exercism, PYnative, LeetCode, and HackerRank—that offer various levels of programming challenges for learning and practicing Python, particularly for coding interviews and skill improvement. 🧬 Experimenting with MLFlow and Microsoft Fabric: The blog discusses the importance of systematic experimentation in machine learning (ML) to improve model performance, highlighting the use of MLFlow within Fabric for managing ML experiments. It covers setting up experiments, running them, logging results, and analyzing them, emphasizing the importance of tracking configurations and outcomes for iterative improvement in ML models. 🧬 Mastering SAP’s data models: The article discusses challenges faced in understanding SAP data models for analytics, focusing on integrating procurement data. It explains SAP's ERP software, data architecture basics, table types (master vs. transaction), and data mapping for procurement tables. 🧬 Building an AI-Powered Business Manager: The post explores the concept of consolidating business management into a single, chat-based platform powered by Large Language Models (LLMs). It discusses the advantages for small businesses, outlines project structure, sets up the database, and updates the Tool class to handle SQLModel instances. 🧬 Monitor Data Pipelines Using Snowflake’s Data Metric Functions: The author emphasizes the importance of data quality in gaining trust with stakeholders and focuses on using Google's Site Reliability Engineering principles to measure the health of data systems. It discusses defining service level indicators and objectives for data quality dimensions and provides a technical implementation example in Snowflake. ⚡Stay Informed with Industry HighlightsPower BI🧮 Deprecation of AutoML in Power BI using Dataflows V1: The update announces the deprecation of Power BI Automated Machine Learning (AutoML) models for Dataflows V1 in all regions as of the third week of April. Customers are encouraged to migrate to the AutoML solution based on Synapse Data Science in Microsoft Fabric, offering a more customizable AutoML experience with advanced tools and features. Microsoft Fabric🧮 Introducing Code-First AutoML and Hyperparameter Tuning: Now in Public Preview for Fabric Data Science: The update introduces code-first automated machine learning (AutoML) and hyperparameter tuning in Public Preview for Fabric Data Science. Users can access both AutoML and Tune capabilities seamlessly within the Fabric 1.2 runtime, enhancing machine learning model optimization and accessibility. 🧮 Fabric Change the Game: Embracing Azure Cosmos DB for NoSQL. The post explores setting up Azure Cosmos DB for NoSQL and leveraging Vector Search capabilities of AI Search Services through Microsoft Fabric's Lakehouse features. It also discusses integrating Cosmos DB Mirror and using Python coding facilitated through Lakehouse, highlighting Fabric's integration capabilities for search or data mirroring. 🧮 Microsoft Fabric April 2024 Update: The April 2024 update brings various enhancements and previews to Microsoft Fabric, including new visuals like the 100% Stacked Area Chart, improvements to reporting, data connectivity, administration features, analytics, real-time analytics, data factory, and data pipelines. Additionally, the update includes the availability of Exam DP-600 for Fabric Analytics Engineer certification and free learning sessions. AWS BI  🧮 Build SAP Golden AMIs with EC2 Image Builder and Ansible: This blog post guides users on building a reusable Amazon Machine Image (AMI) for deploying Amazon Elastic Compute Cloud (EC2) instances for SAP installations. It covers using Terraform and Ansible to automate the process and provides sample code. 🧮 Transforming Business Experiences: The Impact of Amazon Q and Generative BI for AWS Partners. This post highlights how advances in AI, particularly Amazon Q and generative BI, are transforming business operations. It showcases how AWS partners like ZS Associates, Tiger Analytics, and Compass UOL are leveraging these innovations for industry-specific solutions. Google Cloud Data 🧮 What’s new with Google Cloud Cortex Framework? The article discusses Google Cloud Cortex Framework, emphasizing its role in unifying enterprise data for AI-driven insights. It highlights new solutions for marketing, sustainability management, and finance, showcasing how Cortex Framework accelerates innovation, enhances decision-making, and drives business efficiency in the AI era. Tableau🧮 Einstein Copilot for Tableau: Building the Next Generation of AI-Driven Analytics. The post delves into the development of Einstein Copilot for Tableau, an AI-driven tool revolutionizing data analysis. It highlights the challenges and solutions in building its infrastructure, improving accuracy and efficiency, and enhancing AI and core capabilities through collaboration and continuous improvement. ✨ Expert Insights from Packt CommunityFundamentals of Analytics Engineering - By Dumky De Wilde, Fanny Kassapian, Jovan Gligorevic and 4 more The role of dbt in analytics engineering dbt emerged as a solution to the challenges relating to data transformation faced in data analysis. Initially crafted as an open-source Python package, dbt aimed to bring software engineering best practices to the world of analytics. Over time, dbt matured beyond just a package, becoming a versatile cloud service. While the open-source package remains available and actively supported, dbt now offers a cloud-based version, packed with features such as an integrated development environment (IDE), scheduling tools, data lineage trackers, and hosted documentation. This is especially valuable for analysts who might not have a deep software engineering background. For more information on dbt’s history, read https://www.getdbt.com/blog/what-exactly-is-dbt. We will use dbt Cloud, which offers a free tier for a single developer: that’s you! You can learn more about its pricing here: https://www.getdbt.com/pricing. dbt seamlessly integrates into the ELT architecture. It does not store or process data but serves as a bridge between analysts and the data warehouse. dbt’s position in a data stack as an intermediary in the transformation layer. This is how it works: analysts draft SQL queries, enhanced with dbt’s unique capabilities. dbt then translates this specialized SQL into the native SQL of the data warehouse and dispatches it for execution. All the transformed data and results remain within the data warehouse, making dbt a lightweight yet powerful tool in the analytics toolkit. Because of dbt’s pivotal position in analytics engineering, we will spend more time discussing its features and zooming in on best practices. First, we will set up dbt for our use case. Setting up dbt Cloud The following steps are required for dbt: Creating a dbt Cloud account. Setting up a connection from dbt Cloud to BigQuery. Testing the connection by querying the data using dbt Cloud. Follow the step-by-step instructions here: https://github.com/PacktPublishing/Fundamentals-of-Analytics-Engineering/blob/main/chapter_8/guides/setting_up_dbt_cloud.md. Now, let’s focus on the various data layers in dbt. Data layers in dbt It is a widespread practice to separate the data we use for analytics into layers. This helps data practitioners communicate the distinct parts of the data transformation process. Broadly speaking, the process will fall into three layers in dbt, Raw, Preparation and Business.  Let’s take a closer look: Raw layer: The source data is stored in the form it arrives in. Whenever you receive data, it should be stored as-is so that you have a backup in case something goes wrong during the transformations. When you copied the Excel sheets using Airbyte, they became part of the raw layer inside BigQuery. Preparation layer: In the second layer, the raw data is cleaned, deduplicated, and transformed to conform to naming conventions and other rules. For our data, this could mean renaming fields for readability and standardizing sales figures from cents to euros. Business layer: In the final layer, business rules are applied to the prepared data, and different data is joined and modeled into datasets that are ready for consumption by BI tools and stakeholders. In our case, we might add a business rule to disregard negative sales amounts when summing the total stroopwafels sold, as these are likely an error. The resulting data can then be served to the BI tool for dashboarding. Discover more insights from Fundamentals of Analytics Engineering - By Dumky De Wilde, Fanny Kassapian, Jovan Gligorevic and 4 more. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!    Read Here💡 What's the Latest Scoop from the BI Community? 🧠 Datorios unleashes real-time AI with the first observability tool for streaming data: Datorios introduces the first observability tool for Apache Flink, offering deep insights into streaming data processing. It enables faster AI innovation and thorough auditability, providing developers with event visualization, event search, state monitoring, window analysis, and more. Datorios is now publicly available for free. 🧠 Snowflake Launches Arctic: The Most Open, Enterprise-Grade Large Language Model: Snowflake introduces Snowflake Arctic, an open, enterprise-grade large language model (LLM) with a Mixture-of-Experts architecture, optimized for complex enterprise workloads. Arctic sets new openness standards for AI technology, offering weights under an Apache 2.0 license and enhancing AI innovation. 🧠 Introducing Qlik's AI Accelerator - Delivering Tangible Customer Outcomes in Generative AI Integration: Qlik is at the forefront of integrating generative AI, particularly Large Language Models (LLMs), into data analysis and decision-making. They address key challenges like data privacy, technical complexity, and cost, offering seamless integration of popular LLMs and an AI Accelerator program to quickly prove the benefits of AI integration with minimal barriers to entry. 🧠 Avant Technologies Launches Advanced AI Supercomputing: Avant Technologies, an AI company, introduces a supercomputing network and licensable dataset with Wired4Tech, aiming to accelerate AI adoption. The offerings include a versatile AI dataset, dynamic resource scaling, accelerated AI processing, robust security measures, and seamless integration, designed to empower developers and drive innovation across industries. See you next time!
Read more
  • 0
  • 0
  • 556
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-chatgpt-for-coding
Jakov Semenski
25 Apr 2024
6 min read
Save for later

ChatGPT for Coding

Jakov Semenski
25 Apr 2024
6 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionChatGPT's coding style is terrible:Verbosecomplexand outdated.Let's change that.ChatGPT promised to be our coding savior, but sometimes it feels more like a blast from the past.Remember those early 2000’s coding books? Yep, it's giving those vibes.It's like having a sports car with a tractor engine. Great potential, but the performance? Not quite there.Imagine harnessing the power of ChatGPT but with the finesse of a master coder.Ready for the upgrade?Here are 12 Pro prompts that will get you the right results.Tip #1: Specificity is the kingAs soon as you ask for some coding snippet from ChatGPT, by default, you will get the most basic HelloWorld example.The more vague your prompt is, the more mediocre your results will beInstead, specify exactlylanguageversionframeworkWrite backend code for Library app that uses Rest to communicate Cover endpoints for adding, removing, and filtering books by category and date published Use Java latest version. Use lambda streams instead of for loops Use Spring framework Tip #2: Avoid code vomitChatGPT loves to write a lot of code, the way I like to call it “code vomit”We are no longer rewarded by the amount of code we produce, but by the clarity and principles we follow.Give chat GPT instructions towrite clean codeuse latest principlescover logging and exception handlingWrite clean code Code needs to be covered with Logging and proper exception handling Use principles: Kiss & DRY, SOLID Keep in mind to use design patterns where it is applicable Using coding instructions I gave you, give me code for each class Tip #3: Make it easy to use with IDEEvery time ChatGPT writes code you getexplanationsimport statementscomments.This can be good for a beginner but is not something we need for our IDEOur IDE is already good with importing all the right packages, so let ChatGPT knowWhen writing code, avoid detailed explanations, just simple bullet points Don't add import statements, as IDE will do this instead Tip #4 Write testsYour code is not complete if you are not done with tests.But not just any tests. We want to have unit and integration tests in areadable format (give when then)covering the happy and unhappy pathuses the latest testing libraries such as AssertJ and BDDMockitoFor each class write a unit and integration tests Use given when then format For libraries use BDDMockito and AssertJ Cover happy and unhappy paths Tip #5 Give REST call request examplesWhat is the app if we cannot test it without some examplesInstead of creating them manually, ask ChatGPT to create Curl examples we can easily copy to Postman.For each request, generate curl examples Now go ahead and use your terminal or copy/paste them to PostmanTip #6 Create documentationWe don’t want just plain text, instead, we need a quick start guide for developersWrite a quick start guide for developers using markdown. Imagine this app has been published to github repository Cover - Introduction - how to install app - how to run it - how to use it Tip #7 Prepare deployment script for CloudThis app cannot live just in your local environment. Instead, we need a deployment script.Depending on where you want to deploy your changes, it might beKubernetes cluster scriptGoogle-specific terraform scriptsAWS cloud formation scriptAzure-specific deployment scriptOr ask ChatGPT to suggest a deployment scriptProvide me deployment script for one of most popular cloud providers Tip #8 Version ControlOur code for now is living only locally. Let’s ask chatGPT to give us instructions on how to set up Version ControlProvide Github version control setup instructions Tip #9 Define CI/CD pipelineCI/CD or continuous integration and continuous deployment is a must-have step for any serious development.There are plenty of options to choose from, such asJenkinsGitHub actionsBambooWith CI we guarantee we cansafely merge our changes by running build and testscheck if our code changes comply with sonar policiesWith CD we guarantee that we can safely deploy our changesProvide github actions that for each open pull request we run the build and run all the tests Also automatically include sonarqube scans Also create github action to run deployment on every code merge Tip #10 Performance optimizationOur backend rest service is now running, but the question we need to ask ourselveshow fast is ithow many requests it can handlewhat is the maximum limit of requestsFor that, we need to execute performance tests, e.g. using jmeter or gatling.We need to test what is the limit of our app. Write a load test script for gatling that tests how many book searches we can execute Tip #11 Run a security auditHow can we ensure our app is secure and not open to any threats?The best way is to run security scans.Our application might be open for security threats. Which security scan tools we can use for free and how can we use them. Give me step-by-step instruction on how to use it. Tip #12 Optimize for observabilityYou have your app running somewhere in the cloud.But did you optimize it for observability?How can you easily troubleshoot issues?How can you trace requests between different services?Did you set up monitoringWe want to make sure our application is optimized for observability Create guideline and configuration for the cloud environment for Traceability - tracing request from start to finish Monitoring - monitoring key performance metrics Logging - have a centralized logging system ConclusionYou can find the full prompt herehttps://chat.openai.com/share/f0bef1ca-062d-4a22-96aa-9711615329a5ChatGPT is a tool, and like any tool, it shines when used the right way.With these prompts, you get a coding assistant that keeps up with the latest trends, ensuring your code is not just functional but also follows modern standards.Author BioJakov Semenski is an IT Architect working at IBMiX with almost 20 years of experience.He is also a ChatGPT Speaker at the WeAreDevelopers conference and shares valuable tech stories on LinkedIn.
Read more
  • 0
  • 0
  • 1270

article-image-writing-unit-tests-for-r-functions-with-chatgpt
Dan MacLean
23 Apr 2024
5 min read
Save for later

Writing unit tests for R functions with ChatGPT

Dan MacLean
23 Apr 2024
5 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!This article is an excerpt from the book, R Bioinformatics Cookbook - Second Edition, by Dan MacLean. Discover over 80 recipes for modeling and handling real-life biological data using modern libraries from the R ecosystem.IntroductionChatGPT is beneficial for generating R unit tests because of its knowledge of programming patterns and best practices. It can understand code snippets, including functions and their expected behaviors, and generate relevant unit tests. By leveraging its training on diverse programming contexts, ChatGPT can assist in creating comprehensive test cases that cover different scenarios and edge cases. It helps ensure code correctness, identify potential bugs, and improve code reliability. ChatGPT’s ability to generate unit tests saves developers time and effort, promotes software quality, and contributes to building robust and well-tested R code. In this recipe, we’ll learn how to get ChatGPT to do the tedious work of generating test cases for code that we are working on.Getting readyWe just need a ChatGPT account: https://chat.openai.com/.How to do it…To write unit tests for an R function with ChatGPT, do the following:1. Paste the following request into ChatGPT:write unit tests for the following R code factorise_cols <- function(df, col_list ){ not_all_na <- function(x) any(!is.na(x)) for ( i in col_list){ df <- df %>% dplyr::mutate(!!(i) := as.factor(!!(i)) ) } df %>% dplyr::select_if(not_all_na) }2.  Read and check the output:library(dplyr) library(testthat) # Create a test data frame test_df <- data.frame( col1 = c("A", "B", "C", NA), col2 = c(1, 2, NA, 4), col3 = c("X", "Y", "Z", NA) ) # Define the expected output for the test data frame expected_df <- data.frame( col1 = factor(c("A", "B", "C", NA)), col3 = factor(c("X", "Y", "Z", NA)) ) # Start defining the unit tests test_that("factorise_cols correctly factorises selected columns", { # Call the factorise_cols function result_df <- factorise_cols(test_df, c("col1", "col3")) # Check if the selected columns are factorised expect_equal(result_df$col1, expected_df$col1) expect_equal(result_df$col3, expected_df$col3) # Check if other columns are not modified expect_equal(result_df$col2, test_df$col2) }) test_that("factorise_cols returns a data frame with non-NA columns", { # Call the factorise_cols function result_df <- factorise_cols(test_df, c("col1", "col3")) # Check if the returned data frame only contains non-NA columns expect_true(all(!is.na(result_df$col1))) expect_true(all(!is.na(result_df$col3))) expect_true(all(is.na(result_df$col2))) })And with that, we should have an output that can test some of the facets of our code at least and a good start to a more dependable and workable code base.How it works…The recipe here takes advantage of ChatGPT’s representations of the structure of the R programming language in order to write code that will effectively test some example code. In step 1, we simply define the function we wish to test and ask for tests.In step 2, we see the output that ChatGPT generated in this instance. It has given us a pretty good set of unit tests. As with everything to do with ChatGPT, there isn’t a guarantee that they are correct, but we can read and verify them very easily – certainly in much less time than it would take to write them. One thing to note is that, in this case at least, ChatGPT hasn’t generated tests for the case with only NA in a column, which we may decide we need. It is true that this isn’t clear in the initial code, so generating the test has given us a new thought on the safe running of this function.ConclusionIn conclusion, leveraging ChatGPT for unit testing R functions offers a transformative approach. Its adept understanding of programming nuances simplifies the arduous task of generating comprehensive tests, fostering code reliability and quality assurance. By effortlessly crafting diverse test cases, ChatGPT significantly reduces developers' workload, ensuring code correctness, identifying potential bugs, and fortifying the codebase against edge cases. While it doesn't guarantee absolute correctness, its output provides a solid foundation for enhancing code robustness. Embracing ChatGPT's capabilities not only saves time and effort but also contributes profoundly to building more dependable and well-tested R code, elevating the development process to new levels of efficiency and reliability.Author BioProfessor Dan MacLean has a Ph.D. in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now Head of Bioinformatics at the world-leading Sainsbury Laboratory in Norwich, UK where he works on bioinformatics, genomics, and machine learning. He teaches undergraduates, post-graduates, and post-doctoral students in data science and computational biology. His research group has developed numerous new methods and software in R, Python, and other languages with over 100,000 downloads combined.
Read more
  • 0
  • 0
  • 359

article-image-gemini-10-pro-vision-in-bigquery-python-ui-library-feature-engineering-with-fabric-and-pyspark-power-analytics-with-redshift-amazon-rds-for-mysql
Merlyn Shelley
19 Apr 2024
14 min read
Save for later

Gemini 1.0 Pro Vision in BigQuery, Python UI Library, Feature Engineering with Fabric and PySpark, Power analytics with Redshift, Amazon RDS for MySQL

Merlyn Shelley
19 Apr 2024
14 min read
Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!Get the first look at Sigma's new features and functionality at our virtual product launch on May 2nd at 12pm ET/9am PT.The virtual event will showcase talks and demos from Sigma's CEO, co-founders, and product managers about what's next in the future of analytics.Don't miss out. See how Sigma is reinventing BI.👋 Hello,Welcome to BI-Pro #52: Your Premier Destination for Data and BI Insights! 🌟 In This Edition: 🔮 Data Viz with Python Libraries Exploring causality with Python. Meet NiceGUI: Your Soon-to-be Favorite Python UI Library. Feature Engineering with Microsoft Fabric and PySpark. 10 GitHub Repositories to Master Python. 🔌 Power BI On-premises data gateway April 2024 release. Copilot in Power BI expansion. 🛠️ Microsoft Fabric Introducing Optimistic Job Admission for Fabric Spark. Introducing Job Queueing for Notebook in Microsoft Fabric. ☁️ AWS BI Meet Amazon QuickSight expert Sanjeeb Mohapatra. Handle tables without primary keys for Amazon Aurora MySQL and Amazon RDS for MySQL. Power analytics with Amazon Redshift. 🌐 Google Cloud Data Gemini 1.0 Pro Vision in BigQuery. BigQuery data canvas. Gemini in Looker AI-powered BI. Memorystore for Redis Cluster updates. Firestore launch updates. 📊Tableau Tableau vs Power BI: A Comparison of AI-Powered Analytics Tools. Salesforce-Informatica Deal Could Transform Enterprise GenAI Forever. ✨ Expert Insights from Packt Community ChatGPT for Cybersecurity Cookbook by Clint Bodungen. 💡 What's the Latest Scoop from the BI Community? Geospatial Data Analysis with Geemap. Microsoft Fabric Table Maintenance - Checkpoint and Statistics. Identifying Customer Buying Pattern in Power BI - Part 1. Full vs. Incremental Loads – Data Engineering with Fabric. Joining Queries in Azure Data Factory on Cosmos DB Sources. Feature Engineering with Microsoft Fabric and Dataflow Gen2. Stay ahead in the ever-evolving landscape of business intelligence with BI-Pro. Unleash the full potential of your data today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos 🐾 altair - Vega-Altair is a Python library for statistical visualization, offering simplicity, friendliness, and consistency for creating beautiful and effective visualizations. 🐾 bokeh - Bokeh is a Python library for creating interactive plots and data applications in web browsers, offering elegant and versatile graphics. 🐾 bqplot - bqplot is a 2-D visualization system for Jupyter, based on the Grammar of Graphics, enabling interactive plots with other Jupyter widgets. 🐾 cartopy - Cartopy simplifies map drawing in Python, offering easy projection definitions, point transformations, and integration with Matplotlib for advanced mapping. 🐾 diagrams - Diagrams simplifies cloud system architecture design in Python, supporting major providers and frameworks, allowing prototyping and visualization of existing architectures. Email Forwarded? Join BI-Pro Here!🔮 Data Viz with Python Libraries   🐍 Exploring causality with Python. Difference-in-differences: The series dives into causal inference, crucial in modern analytics, explaining tools like difference-in-differences. It explores how events impact outcomes, using examples such as minimum wage effects on employment. The setup involves treatment and control groups to establish cause-and-effect relationships in diverse real-world scenarios. 🐍 Meet the NiceGUI: Your Soon-to-be Favorite Python UI Library. NiceGUI is a Python UI framework for web and desktop apps, offering a simple interface for small projects, dashboards, and robotics. It simplifies state management and interaction, boasting features like easy layout, visualization tools, and integration with popular libraries. 🐍 Feature Engineering with Microsoft Fabric and PySpark: The post delves into feature engineering in Microsoft Fabric, emphasizing its importance in ML development. It explores PySpark's role in handling large datasets and provides a basic overview and example of using PySpark for feature engineering. 🐍 10 GitHub Repositories to Master Python: The blog explores 10 essential GitHub repositories for mastering Python, emphasizing hands-on experience and real-world projects to enhance skills. It covers a range of topics, from beginner to advanced, including machine learning, web development, and data analysis. Asabeneh/30-Days-Of-Python  trekhleb/learn-python  Avik-Jain/100-Days-Of-ML-Code  realpython/python-guide  zhiwehu/Python-programming-exercises  geekcomputers/Python  practical-tutorials/project-based-learning  avinashkranjan/Amazing-Python-Scripts  TheAlgorithms/Python  vinta/awesome-python   ⚡Stay Informed with Industry Highlights Power BI 📊 On-premises data gateway April 2024 release: This update to the on-premises data gateway aligns it with the April 2024 release of Power BI Desktop, ensuring consistency in query execution. Additionally, the gateway now supports refreshes longer than one hour, allowing tokens to be refreshed mid-stream for continuous operation.  📊 Copilot in Power BI: Soon available to more users in your organization. The update introduces changes to Copilot in Power BI, including enabling Copilot by default for all tenants starting May 20th, 2024. It also addresses features reported by customers and community, updates abuse monitoring to not store prompts, and improves geo mapping for EU data boundary customers. Microsoft Fabric📊 Introducing Optimistic Job Admission for Fabric Spark: The post introduces Optimistic Job Admission for Spark in Microsoft Fabric, a new feature aimed at improving concurrency and job admission experience. It explains how this feature optimizes resource allocation and increases the number of concurrent jobs that can be admitted to the cluster. 📊 Introducing Job Queueing for Notebook in Microsoft Fabric: Microsoft Fabric introduces Job Queueing for Notebook Jobs to streamline data engineering and data science processes. This feature automatically queues notebook jobs when Fabric capacity is maxed out, eliminating manual retries and improving user experience. Jobs are retried when resources become available, enhancing efficiency for enterprise users. AWS BI  📊 Meet one of Amazon QuickSight’s Top Community Experts: Sanjeeb Mohapatra. The Amazon QuickSight Community, launched in 2022, is a hub for BI authors and developers to collaborate, ask and answer questions, and learn about QuickSight. Sanjeeb Mohapatra, the top Community Expert for 2023, exemplifies the community's spirit by providing over 1,700 replies and 235 solutions in one year. 📊 Handle tables without primary keys while creating Amazon Aurora MySQL or Amazon RDS for MySQL zero-ETL integrations with Amazon Redshift: AWS is advancing its zero-ETL vision with Amazon Aurora zero-ETL integration to Amazon Redshift, combining transactional data with analytics capabilities. This integration, along with four new ones announced at re:Invent 2023, empowers customers to implement near real-time analytics for various use cases. 📊 Power analytics as a service capabilities using Amazon Redshift: Analytics as a service (AaaS) leverages cloud-based analytic capabilities to enable cost-effective, scalable solutions for organizations. Amazon Redshift, a cloud data warehouse service, facilitates real-time insights and predictive analytics, empowering AaaS providers to embed rich data analytics capabilities. Delivery models include managed, bring-your-own-Redshift (BYOR), and hybrid options, offering flexibility to meet customer needs. Google Cloud Data 📊 How to use Gemini 1.0 Pro Vision in BigQuery? BigQuery integrates with Vertex AI to leverage Gemini 1.0 Pro, PaLM, Vision AI, Speech AI, Doc AI, Natural Language AI, enabling analysis of unstructured data like images, audio, and documents. New integrations support multimodal generative AI, enhancing capabilities for object recognition, info seeking, captioning, digital content understanding, and structured content generation, allowing structured data output for deeper analysis. 📊 Get to know BigQuery data canvas: BigQuery Data Canvas simplifies the data-to-insights journey by offering a natural language-driven experience. It centralizes data tasks, accelerates analysis, and fosters collaboration, all within a unified workspace, enabling faster and more efficient data analytics. 📊 Gemini in Looker to bring intelligent AI-powered BI to everyone: Gemini in Looker introduces Conversational Analytics, transforming how businesses engage with data. It offers a natural language-driven experience, simplifying data analytics and fostering collaboration, all within a unified workspace. 📊 Memorystore for Redis Cluster updates at Next ‘24: The article elaborates on the rapid adoption and recent enhancements of Google Cloud's Memorystore for Redis Cluster. It features customer testimonials from companies like Statsig, Character.AI, and AXON Networks, showcasing the service's performance, scalability, and cost-effectiveness. It also highlights new features such as data persistence, new node types, and ultra-fast vector search. 📊 Firestore launches at Next ‘24: Firestore is beloved by developers for its speed in app development. Updates include improved developer productivity, AI-enabled app building, richer queries, and enterprise-level scalability. Gemini Code Assist now supports Firestore, allowing natural language queries and data model definitions, enhancing the development experience. Firestore also supports AI applications and integrations with LangChain and LlamaIndex for generative AI. Tableau📊 Tableau vs Power BI: A Comparison of AI-Powered Analytics Tools. The comparison delves into the unique strengths of Tableau and Power BI, showcasing how each excels in different areas of data visualization and analytics. It outlines Tableau's robust visualizations and analytics capabilities, especially for large datasets, contrasting with Power BI's integration with Microsoft services and affordability for small to medium-sized businesses. 📊 Salesforce-Informatica Deal Could Transform Enterprise GenAI Forever: Salesforce is reportedly in advanced talks to acquire Informatica, a data-management software provider, for $11 billion. This aligns with Salesforce's strategy to expand beyond CRM, bolstered by recent AI advancements like Einstein Copilot, complementing Informatica's data integration expertise and potential synergy with Tableau and MuleSoft. Additionally, it aligns with Salesforce's strategy to expand beyond CRM and become a comprehensive data journey platform. ✨ Expert Insights from Packt Community ChatGPT for Cybersecurity Cookbook - By Clint Bodungen Sending API Requests and Handling Responses with PythonIn this recipe, we will explore how to send requests to the OpenAI GPT API and handle the responses using Python. We’ll walk through the process of constructing API requests, sending them, and processing the responses using the openai module. Getting ready Ensure you have Python installed on your system. Install the OpenAI Python module by running the following command in your Terminal or command prompt: pip install openai How to do it… The importance of using the API lies in its ability to communicate with and get valuable insights from ChatGPT in real time. By sending API requests and handling responses, you can harness the power of GPT to answer questions, generate content, or solve problems in a dynamic and customizable way. In the following steps, we’ll demonstrate how to construct API requests, send them, and process the responses, enabling you to effectively integrate ChatGPT into your projects or applications: Start by importing the required modules: import openai from openai import OpenAI import os Set up your API key by retrieving it from an environment variable, as we did in the Setting the OpenAI API key as an Environment Variable recipe: openai.api_key = os.getenv("OPENAI_API_KEY") Define a function to send a prompt to the OpenAI API and receive a response:client = OpenAI() def get_chat_gpt_response(prompt):  response = client.chat.completions.create(    model="gpt-3.5-turbo",    messages=[{"role": "user", "content": prompt}],    max_tokens=2048,    temperature=0.7  )  return response.choices[0].message.content.strip() Call the function with a prompt to send a request and receive a response:prompt = "Explain the difference between symmetric and asymmetric encryption." response_text = get_chat_gpt_response(prompt) print(response_text) How it works… First, we import the required modules. The openai module is the OpenAI API library, and the os module helps us retrieve the API key from an environment variable. We set up the API key by retrieving it from an environment variable using the os module. Next, we define a function called get_chat_gpt_response() that takes a single argument: the prompt. This function sends a request to the OpenAI API using the openai.Completion.create() method. This method has several parameters: engine: Here, we specify the engine (in this case, chat-3.5-turbo). prompt: The input text for the model to generate a response. max_tokens: The maximum number of tokens in the generated response. A token can be as short as one character or as long as one word. n: The number of generated responses you want to receive from the model. In this case, we’ve set it to 1 to receive a single response. stop: A sequence of tokens that, if encountered by the model, will stop the generation process. This can be useful for limiting the response’s length or stopping at specific points, such as the end of a sentence or paragraph. temperature: A value that controls the randomness of the generated response. A higher temperature (for example, 1.0) will result in more random responses, while a lower temperature (for example, 0.1) will make the responses more focused and deterministic. Discover more insights from ChatGPT for Cybersecurity Cookbook - By Clint Bodungen. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!Read Here💡 What's the Latest Scoop from the BI Community?  🧠 Geospatial Data Analysis with Geemap: This article introduces geospatial data analysis, focusing on raster data from Google Earth Engine, accessed and analyzed using the Geemap Python library. Earth Engine offers a vast catalog of geospatial datasets, and Geemap simplifies access and analysis, making it easier to work with such data in Python. 🧠 Microsoft Fabric Table Maintenance - Checkpoint and Statistics: This article discusses the maintenance requirements for warehouse tables in Microsoft Fabric, particularly focusing on tasks like updating statistics, removing fragmentation, and managing log files. While some maintenance tasks, such as data compaction and log file checkpointing, are automated, others, like managing statistics, may require manual intervention. 🧠 Identifying Customer Buying Pattern in Power BI - Part 1: This article is part 1 of a retail analytics analysis in Power BI, focusing on customer purchasing frequency for various products over the years. It includes identifying data elements, creating calculated columns, and analyzing trends to aid in business decision-making. 🧠 Full vs. Incremental Loads – Data Engineering with Fabric: This article discusses using Apache Spark in Microsoft Fabric to achieve data quality zones (bronze and silver) in a data lake. It explores loading weather data, transforming it with Spark SQL and DataFrames, and implementing full and incremental load patterns. 🧠 Joining Queries in Azure Data Factory on Cosmos DB Sources: This article provides a detailed guide on joining two queries in Azure Data Factory (ADF). It covers prerequisites, creation of data sources, defining queries for each dataset, and using the "Join" transformation in ADF to merge data. Different join types such as inner, left outer, right outer, and full outer joins are explained. 🧠 Feature Engineering with Microsoft Fabric and Dataflow Gen2: This article introduces Dataflow Gen2 as a low-code data transformation and integration engine for creating data pipelines in Microsoft Fabric. It focuses on using Dataflow Gen2 to create features needed for training a machine learning model with college basketball game data, offering different approaches from no code to all code. See you next time!
Read more
  • 0
  • 0
  • 1054

article-image-elevate-your-llm-mastery
Merlyn Shelley
18 Apr 2024
13 min read
Save for later

Elevate Your LLM Mastery

Merlyn Shelley
18 Apr 2024
13 min read
Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,🚀 Welcome to DataPro Newsletter #84!  Dive into the dynamic world of data science and AI, where breakthroughs and trends shape our future.   🔍 Highlights:  Google's Genie   Meta AI's Priority Sampling   DeepMind's Hawk and Griffin   CMU's OmniACT   Qualcomm's GPTVQ   Azure PyRIT   Microsoft's ChunkAttention   ✨ Data Community Blogs:  ML Workflow with Scikit-learn Pipelines   Text Embeddings   AI System Design   Mixture of Thought LLM Cascades   GNN with Pytorch Implementation  Vertex AI MLOps Platform   🏭 Industry Updates:  Anthropic’s Claude 3 Sonnet in Amazon Bedrock    Anthropic’s Claude 3 models in Vertex AI    Microsoft’s Orca-Math   Table Meets LLM  OpenAI and Elon Musk   📚 New in Packt Library:  "Building AI Applications with ChatGPT APIs" by Martin Yanev   DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, Packt Sign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ VAST-AI-Research/TripoSR: TripoSR, developed by Tripo AI and Stability AI, is an open-source model for fast 3D reconstruction from a single image. It outperforms others in speed and quality, generating 3D models in under 0.5 seconds on NVIDIA A100 GPUs. 🛠️ facebookresearch/ViewDiff: ViewDiff creates consistent, high-quality images of 3D objects in real-world settings from multiple angles. 🛠️ YubiaoYue/MedMamba: MedMamba, inspired by visual state space models, sets a new baseline for medical image classification, excelling across diverse datasets. 🛠️ BAAI-Agents/Cradle: Cradle framework pioneers General Computer Control, enhancing agent capabilities for any task through reasoning and self-improvement. 📚 Expert Insights from Packt CommunityBuilding AI Applications with ChatGPT APIs - By Martin Yanev Setting Up the Code Bug Fixer Project Open PyCharm: Double-click on the PyCharm icon on your desktop or search for it in your applications folder to open it. On the PyCharm welcome screen, click on Create New Project or go to File | New Project. Choose the directory where you want to save your project. You can either create a new directory or select an existing one. Select the Python interpreter: Choose the version of Python you want to use for your project. Configure project settings: Give your project the name CodeBugFixer, and choose a project location. Once you’ve configured all the settings, click Create to create your new PyCharm project. After creating a new PyCharm project, the next step is to create the necessary files and folders for the CodeBugFixer project. Firstly, create two new Python files, called app.py and config.py, in the root directory of the project. The app.py file is where the main code for the CodeBugFixer app will be written, and the config.py file will contain any sensitive information such as API keys and passwords. Next, create a new folder called templates in the root directory of the project. This folder will contain the HTML templates that the Flask app will render. Inside the templates folder, create a new file called index.html. This file will contain the HTML code for the home page of the CodeBugFixer app. The project structure should look like the following: CodeBugFixer/ ├── config.py ├── app.py ├── templates/ │   └── index.html By following these steps, you have created the necessary files and folders for your CodeBugFixer project in your PyCharm project. You can now start writing the code for your Flask app in the app.py file and the HTML code in the index.html file. Once you have the correct interpreter, you can open the terminal within PyCharm by going to View | Tool Windows | Terminal. Check your terminal and ensure that you can see the (venv) indicator to confirm that you are working within your virtual environment. This is an essential step to prevent conflicting package installations between projects and guarantee that you are using the correct set of dependencies. In the terminal window, you can install any necessary libraries as follows: (venv)$ pip install flask (venv)$ pip install openai Finally, in order to establish the foundation for utilizing the ChatGPT API in your CodeBugFixer app, you’ll need to add the following code to config.py and app.py: config.py API_KEY = <Your API Key> app.py from flask import Flask, request, render_template import openai import config app = Flask(__name__) # API Token openai.api_key = config.API_KEY @app.route("/") def index():     return render_template("index.html") if __name__ == "__main__":     app.run() The config.py file will securely hold your OpenAI API key. Make sure to replace <Your API Key> with the actual API key that you obtained from OpenAI. Discover more insights from 'Building AI Applications with ChatGPT APIs' by Martin Yanev. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here!Message from our Partners!👉 Octane AI Insights Analyst: Explore how Octane AI is revolutionizing ecommerce. Over 3,000 Shopify merchants have harnessed AI Quiz Funnels and Insights, generating over $500 million in revenue. It's more than growth; it's understanding and engaging customers on a new level. Join the community and see the difference.  👉 Cognism: Transform your sales strategy with Cognism. Experience a 3x boost in connect rate, gain access to verified B2B contacts, and enjoy seamless integration with your CRM tools. Expand globally with our comprehensive data coverage. Streamline your outreach for better conversions. 👉 Freshdesk: Revolutionize your customer service with Freshworks Smart Suite's focus on analytics. Unlock actionable insights, anticipate needs, and streamline support through AI-driven dashboard. Empower your team with the tools to excel in efficiency and personalization. Start with a free trial and transform your service today! 👉 Murf AI: Enhance your projects with Murf's AI-powered voices, offering a range of realistic options for any use case. From corporate presentations to entertainment, find the perfect voice in over 20 languages. With Murf Studio, seamlessly integrate voice with your videos, music, or images, bringing your creative vision to life. Start your free trial and experience the difference. Thanks for reading Packt DataPro! Subscribe for free to receive new posts and support my work.⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 Anthropic’s Claude 3 Sonnet foundation model is now available in Amazon Bedrock: Amazon announced a collaboration with Anthropic to accelerate the development of Claude foundation models, making them accessible to AWS customers. Recently, Claude 3 was introduced, offering three models with varying levels of intelligence, speed, and cost. Claude 3 Sonnet is now available in Amazon Bedrock, providing faster speeds, increased steerability, and image-to-text vision capabilities. Mastering ML with Google 🌀 Announcing Anthropic’s Claude 3 models in Google Cloud Vertex AI: Google Cloud is enhancing customer choice and innovation in Vertex AI with the addition of Anthropic's Claude 3, a new family of state-of-the-art AI models. These models, optimized for various enterprise applications, include the highly capable Claude 3 Opus, the balanced Claude 3 Sonnet, and the fast, compact Claude 3 Haiku. Customers can soon access all three models via API in Vertex AI Model Garden, starting with private preview access to Claude 3 Sonnet. The Claude 3 models offer improved reasoning, content creation, language fluency, and vision capabilities, enabling customers to focus on applications while benefiting from flexible scaling, cost optimization, and Google Cloud's security and compliance. Microsoft Research Insights🌀 Orca-Math: Demonstrating the potential of SLMs with model specialization. The study on Orca and Orca 2 demonstrated how improved training methods can enhance the reasoning abilities of smaller language models, bringing them closer to larger models. Orca-Math, a 7 billion parameter model, specializes in solving math problems and outperforms larger models in this area. The research highlights the value of smaller models in specialized tasks and the potential of continual learning. The dataset and training procedure are available for further research. 🌀 Table Meets LLM: Improving LLM understanding of structured data and exploring advanced prompting methods: This paper explores how large language models (LLMs) understand structured table data. It investigates effective prompts, inherent structured data detection, leveraging existing knowledge, and trade-offs among input designs for better understanding and utilization of table-based data in LLMs. OpenAI Updates 🌀 OpenAI and Elon Musk: In a recent blog post, OpenAI shared its mission to ensure AGI benefits all of humanity, emphasizing the need for substantial resources. The post recounts disagreements with Elon Musk over funding and control, leading to his departure. OpenAI highlights its efforts to create widely available beneficial tools, such as GPT-4, and addresses ongoing legal disputes with Musk while reaffirming its commitment to its mission. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🧞 Google’s Genie: Generative Interactive Environments. Genie introduces a new generative AI paradigm for creating interactive, playable environments from a single image prompt. It can generate virtual worlds from unseen images, including real-world photos or sketches. Trained on a large dataset of Internet videos without action labels, Genie learns fine-grained controls, identifying controllable parts of an observation and inferring consistent latent actions across different environments.  🌀 Meta AI's Priority Sampling: Revolutionizing Machine Learning with Deterministic Code Generation. This research introduces Priority Sampling, a deterministic sampling technique for large language models that generates unique and confident code samples. It aims to improve code generation and optimization by providing a more structured and controllable exploration process, outperforming traditional sampling methods and enhancing model performance. 🌀 Google DeepMind Launches Hawk and Griffin: Efficient Language Models with Advanced Attention Mechanisms. This paper introduces Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model combining gated linear recurrences and local attention. Hawk outperforms Mamba on downstream tasks, while Griffin matches Llama-2's performance with significantly less training data. Both models are hardware-efficient, with Griffin showing exceptional scalability and the ability to extrapolate on long sequences. The study also details efficient distributed training for large-scale models. 🌀 CMU Unveils OmniACT: Groundbreaking AI Dataset for Measuring Program Execution Skills. OmniACT is a new dataset and benchmark designed to test if virtual agents can automate computer tasks by creating executable scripts. Initial tests show a significant gap between agent and human performance, highlighting the challenge and encouraging advancements in multimodal AI models. 🌀 Qualcomm's GPTVQ: Speeding Up Large AI Networks with Vector Quantization. GPTVQ is a new fast method for post-training vector quantization of Large Language Models (LLMs), improving size vs. accuracy trade-offs. It uses column-wise quantization and updates with Hessian information, efficient codebook initialization, and further compression techniques. GPTVQ sets new standards in LLM quantization efficiency and latency, even on mobile CPUs.   🌀 Azure PyRIT: Elevating ML Engineers with Python's Generative AI Risk Tool. PyRIT, a Python Risk Identification Tool for generative AI, automates AI Red Teaming tasks to assess the security of Language Model (LLM) endpoints. It employs proactive methods, categorizes risks, and offers detailed metrics, enabling researchers to mitigate potential risks in LLM deployment effectively. 🌀 Microsoft Introduces ChunkAttention: Accelerating Self-Attention for LLMs! This research introduces ChunkAttention, a novel self-attention module for large language models (LLMs) that optimizes compute and memory operations by detecting shared prefixes in LLM requests. It breaks key/value tensors into chunks and uses a prefix tree to share them, speeding up the self-attention kernel by 3.2-4.8×. ✨ On the Radar: Catch Up on What's Fresh🌀 Streamline Your Machine Learning Workflow with Scikit-learn Pipelines: This blog explores the benefits of using Scikit-learn pipelines for simplifying machine learning workflows. It covers how pipelines can streamline preprocessing, modeling, hyperparameter tuning, and workflow organization, making code more efficient and maintaining consistency in data preprocessing. 🌀 Do text embeddings perfectly encode text? The rapid advancement of generative AI has led to the widespread adoption of Retrieval Augmented Generation (RAG) systems, where AI retrieves relevant documents from a database to generate responses. This has given rise to vector databases, designed to store and search through embeddings, vector representations of documents. The paper "Text Embeddings Reveal as Much as Text" explores the security of embedding vectors, questioning whether they can be inverted back to text, posing challenges for privacy and information security. 🌀 End to End AI Use Case-Driven System Design: This blog explores the complexities of AI system performance beyond TOPs (Tera Operations Per Second), focusing on real AI use cases. It dives into optimizing an AI system for an infinite zoom feature, emphasizing power efficiency through model and memory optimizations, dynamic power scaling, and specialized hardware accelerators. 🌀 Navigating Cost-Complexity: Mixture of Thought LLM Cascades Illuminate a Path to Efficient Large Language Model Deployment: This post discusses how to significantly reduce costs while maintaining accuracy in utilizing Large Language Models (LLMs), crucial for various applications. It introduces a novel approach called Mixture of Thought (MoT) Cascades, employing a blend of weaker and stronger LLMs, along with innovative prompting techniques and consistency measurements.🌀 Structure and Relationships: Graph Neural Networks and a Pytorch Implementation. This article introduces Graph Neural Networks (GNNs), a powerful method for modeling spatial and graphical structures in data, such as molecular structures, social networks, and city designs. It covers the mathematical description of GNNs, including graph convolution networks (GCNs) and graph attention networks (GATs), and provides a regression example using the PyTorch library. The article aims to make GNNs more accessible by explaining their principles and demonstrating their potential applications. 🌀 Extensible and Customisable Vertex AI MLOps Platform: The article describes the development of an MLOps platform for scalable machine learning models on Vertex AI using Kubeflow pipelines. It aims to provide a modular, flexible, and integrated solution for building operationalized ML models, serving as an educational resource and foundation for teams. The platform addresses common challenges and emphasizes testing, configuration, and CI/CD orchestration. See you next time!Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support! 
Read more
  • 0
  • 0
  • 817
article-image-llmops-in-action
Mostafa Ibrahim
16 Apr 2024
6 min read
Save for later

LLMOps in Action

Mostafa Ibrahim
16 Apr 2024
6 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights. Don't miss out – sign up today!IntroductionIn an era dominated by the rise of artificial intelligence, the power and promise of Large Language Models (LLMs) stand distinct. These colossal architectures, designed to understand and generate human-like text, have revolutionized the realm of natural language processing. However, with great power comes great responsibility – the onus of managing, deploying, and refining these models in real-world scenarios. This article delves into the world of Large Language Model Operations (LLMOps), an emerging field that bridges the gap between the potential of LLMs and their practical application.BackgroundThe last decade has seen a significant evolution in language models, with models growing in size and capability. Starting with smaller models like Word2Vec and LSTM, we've advanced to behemoths like GPT-3, BERT, and T5.  With that said, as these models grew in size and complexity, so did their operational challenges. Deploying, maintaining, and updating these models requires substantial computational resources, expertise, and effective management strategies.MLOps vs LLMOpsIf you've ventured into the realm of machine learning, you've undoubtedly come across the term MLOps. MLOps, or Machine Learning Operations, encapsulates best practices and methodologies for deploying and maintaining machine learning models throughout their lifecycle. It caters to the wide spectrum of models that fall under the machine learning umbrella.On the other hand, with the growth of vast and intricate language models, a more specialized operational domain has emerged: LLMOps. While both MLOps and LLMOps share foundational principles, the latter specifically zeros in on the challenges and nuances of deploying and managing large-scale language models. Given the colossal size, data-intensive nature, and unique architecture of these models, LLMOps brings to the fore bespoke strategies and solutions that are fine-tuned to ensure the efficiency, efficacy, and sustainability of such linguistic powerhouses in real-world scenarios.Core Concepts of LLMOpsLarge Language Models Operations (LLMOps) focuses on the management, deployment, and optimization of large language models (LLMs). One of its foundational concepts is model deployment, emphasizing scalability to handle varied loads, reducing latency for real-time responses, and maintaining version control. As these LLMs demand significant computational resources, efficient resource management becomes pivotal. This includes the use of optimized hardware like GPUs and TPUs, effective memory optimization strategies, and techniques to manage computational costs.Continuous learning and updating, another core concept, revolve around fine-tuning models with new data, avoiding the pitfall of 'catastrophic forgetting', and effectively managing data streams for updates. Parallelly, LLMOps emphasizes the importance of continuous monitoring for performance, bias, fairness, and iterative feedback loops for model improvement. To cater to the vastness of LLMs, model compression techniques like pruning, quantization, and knowledge distillation become crucial.How do LLMOps workPre-training Model DevelopmentLarge Language Models typically start their journey through a process known as pre-training. This involves training the model on vast amounts of text data. The objective during this phase is to capture a broad understanding of language, learning from billions of sentences and paragraphs. This foundational knowledge helps the model grasp grammar, vocabulary, factual information, and even some level of reasoning.This massive-scale training is what makes them "large" and gives them a broad understanding of language. Optimization & CompressionModels trained to this extent are often so large that they become impractical for daily tasks.To make these models more manageable without compromising much on performance, techniques like model pruning, quantization, and knowledge distillation are employed.Model Pruning: After training, pruning is typically the first optimization step. This begins with trimming model weights and may advance to more intensive methods like neuron or channel pruning.Quantization: Following pruning, the model's weights, and potentially its activations, are streamlined. Though weight quantization is generally a post-training process, for deeper reductions, such as very low-bit quantization, one might adopt quantization-aware training from the beginning.Additional recommendations are:Optimizing the model specifically for the intended hardware can elevate its performance. Before initiating training, selecting inherently efficient architectures with fewer parameters is beneficial. Approaches that adopt parameter sharing or tensor factorization prove advantageous. For those planning to train a new model or fine-tune an existing one with an emphasis on sparsity, starting with sparse training is a prudent approach.Deployment Infrastructure After training and compressing our LLM, we will be using technologies like Docker and Kubernetes to deploy models scalably and consistently. This approach allows us to flexibly scale using as many pods as needed. Concluding the deployment process, we'll implement edge deployment strategies. This positions our models nearer to the end devices, proving crucial for applications that demand real-time responses.Continuous Monitoring & FeedbackThe process starts with the Active model in production. As it interacts with users and as language evolves, it can become less accurate, leading to the phase where the Model becomes stale as time passes.To address this, feedback and interactions from users are captured, forming a vast range of new data. Using this data, adjustments are made, resulting in a New fine-tuned model.As user interactions continue and the language landscape shifts, the current model is replaced with the new model. This iterative cycle of deployment, feedback, refinement, and replacement ensures the model always stays relevant and effective.Importance and Benefits of LLMOpsMuch like the operational paradigms of AIOps and MLOps, LLMOps brings a wealth of benefits to the table when managing Large Language Models.MaintenanceAs LLMs are computationally intensive. LLMOps streamlines their deployment, ensuring they run smoothly and responsively in real-time applications. This involves optimizing infrastructure, managing resources effectively, and ensuring that models can handle a wide variety of queries without hiccups.Consider the significant investment of effort, time, and resources required to maintain Large Language Models like Chat GPT, especially given its vast user base.Continuous ImprovementLLMOps emphasizes continuous learning, allowing LLMs to be updated with fresh data. This ensures that models remain relevant, accurate, and effective, adapting to the evolving nature of language and user needs.Building on the foundation of GPT-3, the newer GPT-4 model brings enhanced capabilities. Furthermore, while ChatGPT was previously trained on data up to 2021, it has now been updated to encompass information through 2022.It's important to recognize that constructing and sustaining large language models is an intricate endeavor, necessitating meticulous attention and planning.ConclusionThe ascent of Large Language Models marks a transformative phase in the evolution of machine learning. But it's not just about building them; it's about harnessing their power efficiently, ethically, and sustainably. LLMOps emerge as the linchpin, ensuring that these models not only serve their purpose but also evolve with the ever-changing dynamics of language and user needs. As we continue to innovate, the principles of LLMOps will undoubtedly play a pivotal role in shaping the future of language models and their place in our digital world.Author BioMostafa Ibrahim is a dedicated software engineer based in London, where he works in the dynamic field of Fintech. His professional journey is driven by a passion for cutting-edge technologies, particularly in the realms of machine learning and bioinformatics. When he's not immersed in coding or data analysis, Mostafa loves to travel.Medium
Read more
  • 0
  • 0
  • 867

article-image-ai-for-investment
Louis Owen
12 Apr 2024
12 min read
Save for later

AI for Investment

Louis Owen
12 Apr 2024
12 min read
Dive deeper into the world of AI innovation and stay ahead of the AI curve! Subscribe to our AI_Distilled newsletter for the latest insights and books. Don't miss out – sign up today!IntroductionOne of the most important activities for an investor is to always keep up to date with the latest and relevant news. Usually, it’s done by reading at least a dozen news articles starting from macroeconomic issues, political issues, news related to the sector of the corresponding stock, analyst reports, and whatnot. This, of course, takes a lot of time and also sometimes can be overwhelming for new investors since the amount of information to be processed is too much.Many ML developers have tried to solve this issue by building a traditional ML workflow usually called the sentiment analyzer. This system will take text from the news as the input and return the sentiment score as the output. This is no doubt helpful for the investor, but it doesn’t solve the bigger problem which is the need to curate relevant articles and also knowing what’s the impact of each news toward their investment decision. In other words, it’s lacking of broader insight. What if there’s an AI assistant that can act as our personal investment news analyst? What if there’s an AI assistant that is able to analyze dozens of news articles and generate the insights summary along with the investment recommendation? And, what if I told you that this AI assistant is personalized toward your risk appetite and investment portfolio allocation? In this article, I’ll guide you on how to build an AI assistant that can do all the above-mentioned things with only a few lines of code - thanks to GPT4! We’ll discuss several ways to get the news data in bulk and in real-time. We’ll discuss what are the important search keywords we need to use to get relevant news data. We’ll also discuss how to construct the prompt to fulfill all of the above-mentioned criteria while also getting a great generated output. Finally, we’ll see how to put all of this together to build our AI assistant!Without wasting any more time, let’s take a deep breath, make yourselves comfortable, and be ready to learn how to build your personal AI investment news analyst!News Data SourcesGetting as much news data as possible is important since we don’t want to miss any important information out there. Once we get all the information, we just need to filter them out with the help of our AI assistant.SerpAPI is one of the best all-in-one scraping tools that we can utilize to get news data from Google, Yahoo, Bing, DuckDuckGo, and many other search engines. It also provides a free plan with a 100 searches/month limit. However, this limit is surely not enough for our use case. If you don’t mind spending some money and want to get multiple search results from different search engines, then this tool is suitable for you.Another solution that is more budget-friendly is by utilizing DuckDuckGo search engine API directly. DuckDuckGo is a search engine that offers data privacy as their main unique selling point. No search history will be stored. Moreover, they also open their search engine API for free. We will use DuckDuckGo in this article and learn how to utilize it via Python!The more effective way to widen our search results is actually not by using different search engines but by having a diverse yet mutually exclusive set of search keywords. The goal of our AI investment assistant is to summarize the important insights that are relevant to a particular stock that we’re interested in. Hence, we need to provide relevant news data to be able to achieve our goal.The following are some of the search keywords that we can use. Note that this list is not exhaustive, you can surely expand the search keywords based on your own needs. We’ll use AAPL as the ticker example. You can change it to any ticker you want.$AAPL stock $AAPL industry and competitors $AAPL business model and strategy $AAPL management and leadershipBesides ticker-specific search keywords, we can also search for more general information that is not ticker-specific. The following is an example list of such keywords.economic growth this yearmonetary and fiscal policies todaypolitic todayeconomic todayinflation rate todayinterest rate todayreal estate todayDuckDuckGo APIOnce we have the keywords list, we can easily get the news data using DuckDuckGo via Python. First, we need to install the duckduckgo package by running the following command. pip install duckduckgo-searchOnce it is installed we can create the general Python function that can take the search keyword as the input and return the search results.from duckduckgo_search import DDGS import json ddgs = DDGS() def web_search(query: str, num_results: int = 4,debug=True) -> str:    """Useful for general internet search queries."""    if debug:        print("Searching with query {0}...".format(query))    search_results = []    if not query:        return json.dumps(search_results)    results = ddgs.text(query)    if not results:        return json.dumps(search_results)    total_added = 0    for j in results:        search_results.append(j.get('body',''))        total_added += 1        if total_added >= num_results:            break    return search_resultsUsing this function is very simple. We just need to pass the search keyword along with the number of search results to this function and get the list of search results.apple_competitors_news = web_search(“$AAPL industry and competitors”, num_results = 10)Prompt EngineeringThe next important thing to do is to build our AI assistant. Here, we’ll utilize GPT4 to build our assistant. Since it’s an LLM, we just need to provide the prompt without the need to train it from scratch. However, creating the prompt itself is indeed not an easy task. I have published another article regarding prompt engineering if you’re interested to learn more about it.Remember that the goal of our assistant is to analyze the provided news data dump and return the summary insights along with the recommendation as the output. However, to be able to give a recommendation, our assistant needs to know our risk appetite along with our portfolio condition. The following is an example of the system prompt that we can give to GPT4.system_prompt = “””You are an expert in giving recommendation to BUY / SELL / HOLD for {} ({}). You can only return in JSON format with 5 fields: "Investment Thesis" (dictionary of string. Consist of elaborated decision reasoning (in bullet points) based on the risk profile of the investor, unrealized profit, and all of the factors as the basis of your recommendation. Provide numbers to justify your assertions, a lot ideally. The deeper the analysis the better.), "Investor Profiling" (dictionary of string. Connect the investment thesis with each of the investor profiles, including risk profile and unrealized profit.) "Summary Thesis" (string. Summary of your all investment thesis as the basis of the given recommendation.  You have to take into account all factors in the investment thesis as well as the investor profiles.), "recommendation" ("BUY"/"SELL"/"HOLD") In the investment thesis, please cover the following factors. If a particular factor needed to write the investment thesis does not exist, don't try to make up the answer, just write "The information needed is unavailable". (1) Industry and Competitive Analysis: Assess the company's position within its industry and analyze industry trends, competition, barriers to entry, and market dynamics. (2) News and Events: Stay updated on relevant news, earnings announcements, product launches, regulatory changes, and other events that can impact the company or the overall market. (3) Market and Economic Conditions: Assess broader macroeconomic factors from news, including economic growth, interest rates, inflation, monetary and fiscal policies, geopolitical events, gold price, bond price, index price, real estate.”””And here’s an example of the user prompt that consists of all necessary data points. Risk profiles can be “Moderate”, “Aggresive”, or “Conservative”. user_prompt = “””<INVESTOR PROFILE> Risk Profile: {} Unrealized Profit: {}% {}”””Putting All TogetherNow, we just need to create the main function that will act as our personal AI investment assistant. def personal_investment_assistant(company_name:str, ticker:str, risk_profile: str,  unrealized_profit_perc: float):    news_data = []    for search_keyword in search_kwrds_lst:          news_data.extend(web_search(search_keyword))    news_data = "\n".join(news_data)            messages = [                        {                            "role": "system",                            "content": system_prompt.format(company_name,ticker)                        },                        {                            "role": "user",                            "content": user_prompt.format(risk_profile,unrealized_profit_perc,news_data)                }            ]    response = get_gpt_response("gpt-4",                                temperature = 0.0,                                messages = messages                                                )    return response["choices"][0]["message"]["content"].strip() import requests import json import os def get_gpt_response(model: str,temperature: float,messages: list): headers = {                       'content-type': "application/json",                       'Authorization': "Bearer " + os.environ["OPENAI_API_KEY"]                       } endpoint = 'https://api.openai.com/v1/chat/completions'           data = json.dumps({                                   "model": model, "messages": messages,                                   "temperature": temperature,                                   })             try: data = requests.post(endpoint, data=data, headers=headers)                       openai_response = json.loads(data.text)                       return openai_response           except Exception as e:                       print(e)                       return ""ConclusionCongratulations on keeping up to this point! Throughout this article, you have learned how to build your own personal AI investment analyst based on news data. You have learned how to get the news data, a list of useful search keywords, also the code implementation to build the AI assistant. Hope the best for your investment journey and see you in the next article!Author BioLouis Owen is a data scientist/AI engineer from Indonesia who is always hungry for new knowledge. Throughout his career journey, he has worked in various fields of industry, including NGOs, e-commerce, conversational AI, OTA, Smart City, and FinTech. Outside of work, he loves to spend his time helping data science enthusiasts to become data scientists, either through his articles or through mentoring sessions. He also loves to spend his spare time doing his hobbies: watching movies and conducting side projects. Currently, Louis is an NLP Research Engineer at Yellow.ai, the world’s leading CX automation platform. Check out Louis’ website to learn more about him! Lastly, if you have any queries or any topics to be discussed, please reach out to Louis via LinkedIn.
Read more
  • 0
  • 0
  • 1052

article-image-apples-realm-google-deepminds-gecko-xais-grok-15-salesforce-ais-moira-stability-ais-stable-audio-20-twin-gpt-chatgpt-instant-usage
Merlyn Shelley
08 Apr 2024
12 min read
Save for later

Apple’s ReALM, Google DeepMind’s Gecko, X.ai's Grok 1.5, Salesforce AI’s Moira, Stability AI’s Stable Audio 2.0, TWIN-GPT, ChatGPT Instant usage

Merlyn Shelley
08 Apr 2024
12 min read
Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to DataPro#88 – Your portal to the innovations in Data Science & Machine Learning! 🚀 In this edition, you'll find: ⚙️ LLMs & GPTs Unleashed TWIN-GPT: Digital Twins for Clinical Trials. Apple’s ReALM: AI with contextual understanding. Stability AI’s Stable Audio 2.0: Audio synthesis revolution. Salesforce AI’s Moira: Enhancing customer engagement. Google DeepMind’s Gecko: Versatile Text Embeddings. X.ai's Grok 1.5: Enhanced reasoning and context. ✨ What's Fresh & Exciting Distribute LLMs with llamafile: 5 Simple Steps. Dockerized Python Environment: The Elegant Way. Knowledge Distillation: Clone Powerful LLMs. Sora’s Diffusion Transformer (DiT): A Deep Dive. Generative AI: Copyright Reckoning. OpenAI Agent: Function Calling Capabilities. ⚡ Industry Pulse AWS & Mistral AI: Democratizing generative AI. Amazon SageMaker: No-code to code-first ML. Google Cloud Next: Database success stories. Google’s SEEDS in Weather Forecasting: AI quantifies uncertainty. Microsoft’s LLMs in the Imaginarium: Tool Learning. OpenAI: Fine-tuning API and custom models. ChatGPT: Instant usage. Synthetic Voices: Challenges and Opportunities. 📚 Packt's Latest Gem MATLAB for Machine Learning - Second Edition, By Giuseppe Ciaburro. DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ UpstageAI/dataverse: Dataverse simplifies ETL pipelines in Python, providing a user-friendly solution for data processing and management, accessible to all. 🛠️ GAP-LAB-CUHK-SZ/gaustudio: GauStudio is a modular framework for 3D Gaussian Splatting, providing streamlined pipelines and tools for easier implementation and deployment. 🛠️ TencentARC/BrushNet: BrushNet is a text-guided image inpainting model that enhances pre-trained diffusion models, focusing on divided features and dense control. 🛠️ agiresearch/AIOS: AIOS embeds LLMs into OS, enhancing resource allocation, context switch, concurrent execution, tool service, access control, and toolkit availability for developers. 🛠️ jasonppy/VoiceCraft: VoiceCraft excels in speech editing and zero-shot text-to-speech, requiring only a few seconds of reference to clone or edit voices. 📚 Expert Insights from Packt CommunityMATLAB for Machine Learning - Second Edition, By Giuseppe Ciaburro.Anomaly Detection in MATLAB Throughout the life cycle of a physical system, the occurrence of failures or malfunctions poses a potential threat to its normal functioning. To safeguard against critical interruptions, it becomes imperative to implement an anomaly detection system within the facility. Termed as a fault diagnosis system, this mechanism is designed to identify potential malfunctions within the monitored system. The pursuit of fault detection stands as a pivotal and defining phase in maintenance interventions, demanding a systematic and deterministic approach to comprehensively analyze all conceivable causes that might have led to the malfunction. Anomaly detection overview Anomaly detection is a technique used in data analysis and ML to identify data points or patterns that deviate significantly from the expected or normal behavior within a dataset. Anomalies, also known as outliers, are data points that do not conform to most of the data and may indicate errors, fraud, unusual events, or other important information. Anomaly detection has various applications across different domains, such as cybersecurity, industrial quality control (QC), finance, healthcare, and more. We can start to get an overview of different types of anomalies to understand what is intended with this term, we will list some types of anomalies: Point anomalies: These are individual data points that are considered anomalies, such as a single fraudulent transaction in a credit card dataset. Contextual anomalies: These are anomalies that are context-dependent. A data point might not be an anomaly on its own but is unusual in a particular context or time, such as a sudden spike in web traffic during a holiday sale. Collective anomalies: These are anomalies that are identified by examining a group of data points collectively. These anomalies involve patterns or relationships between data points. There are several methods for addressing anomaly detection problems, ranging from simple statistical techniques to complex ML algorithms. The choice of method depends on the nature of the data and the specific problem you are trying to solve. Here, we are listing the most used ones: Statistical methods: Statistical techniques such as z-scores, percentiles, and boxplots can be used to identify anomalies based on deviations from the mean or median of the data distribution. ML: Supervised, unsupervised, and semi-supervised ML algorithms can be used for anomaly detection. Some popular methods include Isolation Forest, One-Class Support Vector Machine (One-Class SVM), autoencoders (AEs), and k-means clustering. Time series analysis: Specialized techniques are used for detecting anomalies in time series data, such as autoregressive (AR) models, exponential smoothing, and moving averages (MAs). Density estimation: Methods such as kernel density estimation (KDE) and Gaussian Mixture Models (GMMs) are used to estimate the probability density function of the data and identify anomalies as low-density regions. Deep learning (DL): Neural networks (NNs), especially deep AEs (DAEs) and recurrent NNs (RNNs), are used for anomaly detection in high-dimensional data or sequences. Ensemble methods: Combining multiple anomaly detection models can improve overall performance and robustness. In addressing anomaly detection problems, we have to face some challenges. For example, determining an appropriate threshold for defining anomalies can be challenging. Imbalanced datasets, where anomalies are rare, can make model training and evaluation tricky. Handling high-dimensional data and noisy datasets can also be challenging. Anomaly detection is a valuable tool for identifying rare but potentially important events or patterns in large datasets. The choice of method depends on the specific domain, data characteristics, and the nature of anomalies that need to be detected. Discover more insights from "MATLAB for Machine Learning - Second Edition" by Giuseppe Ciaburro. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 AWS and Mistral AI commit to democratizing generative AI with a strengthened collaboration: The article discusses the growing use of generative AI applications across industries, facilitated by Amazon Bedrock. It highlights Mistral AI's Mistral Large model, now available on Amazon Bedrock, offering advanced language capabilities. This collaboration aims to provide customers with diverse model options to suit their specific business needs, promoting innovation in AI technology. 🌀 Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio: This post discusses Amazon SageMaker Studio, an integrated ML development environment, and SageMaker Canvas, a no-code ML tool, highlighting their features and integration for seamless collaboration between non-ML and ML experts. Google Research 🌀 Get inspired: Database success stories at Google Cloud Next. This blog post previews Google Cloud Next '24, focusing on customers using Google Cloud databases for transformative purposes. It highlights sessions featuring Nuro, Lightricks, Bayer, Yahoo!, and Statsig, showcasing their innovative use cases.🌀 Generative AI to quantify uncertainty in weather forecasting: Google is advancing weather forecasting with innovations like MetNet-3 and SEEDS, a generative AI model. SEEDS efficiently generates probabilistic ensembles, addressing the butterfly effect's uncertainty, and offers cost-effective solutions for extreme weather events. Microsoft Research🌀 LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error. This research enhances large language models' (LLMs) tool usage accuracy through simulated trial and error (STE), inspired by biological systems. STE improves learning by simulating tool use scenarios, interacting with tools, and leveraging short and long-term memory. Results show significant performance boosts over existing methods.OpenAI Updates🌀 Introducing improvements to the fine-tuning API and expanding our custom models program: This update discusses techniques to improve model performance, such as retrieval-augmented generation (RAG) and fine-tuning and introduces new API features for developers to control their fine-tuning jobs, enhancing model quality, reducing costs, and latency. 🌀 Start using ChatGPT instantly: This new initiative aims to make AI more accessible by allowing instant access to ChatGPT without the need to sign up. It targets those curious about AI's potential but hesitant to set up an account, offering a seamless experience for learning, creative inspiration, and answering questions. 🌀 Navigating the Challenges and Opportunities of Synthetic Voices: Voice Engine is a model by OpenAI that generates natural-sounding speech from text input and a short audio sample, closely resembling the original speaker. They're sharing insights from a small-scale preview, highlighting its potential for various applications like reading assistance and personalized responses in education. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🌀 TWIN-GPT: Digital Twins for Clinical Trials via LLM. The research explores virtual clinical trials' benefits in healthcare, emphasizing patient safety and cost reduction. Existing methods struggle with prediction accuracy due to limited data. TWIN-GPT, a proposed approach, uses large language models to create personalized digital twins, improving predictions and showcasing digital twins' potential in healthcare. 🌀 Apple’s ReALM: AI that can “See” to understand the context: ReALM (Reference Resolution As Language Modeling) addresses the challenge of context understanding, including non-conversational entities like on-screen elements. By leveraging Language Models (LLMs), it demonstrates significant improvements in reference resolution, even outperforming GPT-4, offering over 5% gains for on-screen references. 🌀 Stability AI’s Stable Audio 2.0: Stable Audio 2.0 introduces a groundbreaking AI-generated audio standard, offering high-quality, full tracks up to three minutes long at 44.1kHz stereo. It features audio-to-audio generation, honoring creator rights, and expands creative possibilities, available for free on the Stable Audio website. 🌀 Salesforce AI’s Moira: Moirai is a universal time series forecasting model designed to address diverse forecasting tasks across various domains, frequencies, and variables in a zero-shot manner. It tackles key challenges in forecasting and offers robust performance, making it valuable for IT operations, sales forecasting, and more. 🌀 Google DeepMind’s Gecko: Versatile Text Embeddings Distilled from LLMs. Gecko is a compact text embedding model that achieves strong retrieval performance by distilling knowledge from large language models (LLMs). Its two-step distillation process, generating synthetic paired data and refining data quality, outperforms larger models on the Massive Text Embedding Benchmark. Gecko with 256 dimensions outperforms all entries with 768 dimensions; Gecko with 768 dimensions competes with models 7x larger and 5x higher dimensional embeddings. 🌀 X.ai Unveils Grok 1.5: Enhanced Reasoning and Long Context Features. Grok-1.5, the latest version of x.ai's Grok model, offers improved reasoning and long context capabilities. It excels in coding and math tasks, scoring 50.6% on MATH and 90% on GSM8K benchmarks. Grok-1.5 can process long contexts up to 128K tokens and boasts robust infrastructure for large-scale training. Early testers and existing Grok users on the x.ai platform will soon have access to Grok-1.5, with further features expected to roll out gradually. ✨ On the Radar: Catch Up on What's Fresh🌀 Distribute and Run LLMs with llamafile in 5 Simple Steps: This blog introduces llamaFile, a framework that simplifies using large language models (LLMs) by providing a one-file executable that runs locally without installation. It explains how to use llamaFile with the LLaVa model, a 7-billion-parameter model quantized to 4 bits, for tasks like chat, image uploading, and question-answering. 🌀 Setting A Dockerized Python Environment — The Elegant Way. This blog post demonstrates a more elegant method for setting up a dockerized Python development environment using VScode and the Dev Containers extension. It provides step-by-step instructions and prerequisites, including Docker Desktop, a Docker Hub account, and VScode with the Dev Containers extension installed. The tutorial focuses on using the official Python image (`python:3.10`) and explains the Dev Containers extension's role in creating an isolated VScode session inside a docker container. 🌀 Clone the Abilities of Powerful LLMs into Small Local Models Using Knowledge Distillation: This post explores the use of specialized, smaller-scale language models for specific NLP tasks, such as grammatical error correction. It discusses the process of constructing tailored models through data annotation and fine-tuning, and the use of knowledge distillation to automate labeling. The post provides a workflow for distilling knowledge from a large language model to a smaller one, using prompts and APIs, and demonstrates this process in the context of building a grammatical error correction model. 🌀 Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand: This blog introduces Sora, OpenAI's text-to-video model, explaining its unique approach combining diffusion transformer and transformer strength for video prediction. It explores key concepts like diffusion, dimension reduction, and noise addition, offering insights into how Sora converts text prompts into realistic videos. Ideal for AI enthusiasts and those interested in video generation technologies. 🌀 The Coming Copyright Reckoning for Generative AI: This blog explores the complexities of copyright law in America, particularly in the context of generative AI. It discusses key concepts like original works, fair use, and the implications of generative AI on copyright. It also delves into legal cases and future considerations, offering insights for data scientists and AI enthusiasts. 🌀 Create an Agent with OpenAI Function Calling Capabilities: This article explores the advancements and challenges in developing AI-powered applications in 2024. It discusses how AI streamlines app features for a better user experience and introduces OpenAI's Function Calling to simplify structured data extraction. The article also highlights the ongoing innovations and the future of AI applications. See you next time!
Read more
  • 0
  • 0
  • 404
article-image-bi-pro49-microsoft-fabric-lifecycle-management-data-factory-adds-cicd-to-fabric-data-pipelines-database-mirroring-aws-well-architected-data-analytics-lens
Merlyn Shelley
04 Apr 2024
11 min read
Save for later

BI-Pro#49: Microsoft Fabric Lifecycle Management, Data Factory Adds CI/CD to Fabric Data Pipelines, Database Mirroring, AWS Well-Architected Data Analytics Lens

Merlyn Shelley
04 Apr 2024
11 min read
Subscribe to our BI Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to BI-Pro #49, your ultimate guide to data and BI insights! 🚀 ⏩ What's Inside? Python Simplified: Master data validation with Pydantic. Visualize Like a Pro: 30+ tools for stunning data visuals. R for Bioinformatics: Custom visuals for bio data. Interactive Data: JavaScript meets Handsontable. Seaborn Stories: Craft data tales with line plots. MetaGPT Insights: Next-gen data solutions unveiled. 🏭 Industry Scoop: Power BI’s Latest: March's must-know features. Fabric Innovations: Updates and new tools from Microsoft Fabric. AWS Well-Architected Data Analytics Lens: Analytics strategies for the real world. Google Cloud Savings: Cut costs on ETL workflows. Tableau Journeys: From student to BI analyst. 💎 Expert Takes: Deep Dive into Python Deep Learning: The latest from Packt. 👉 Community Buzz: Twitch Chat Analysis, Graph Networks, LLM Data Quality, and Ethical AI: Key conversations this week! Dive into the trends shaping data and BI today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."📣 And here's the twist – we're tuning into YOUR frequency! Inspired by a reader's request, we're launching a column just for you. Got a burning question or a topic you're itching to dive into? Drop your suggestions in our content box – because your journey of discovery is our blueprint.We appreciate your input and hope you enjoy the book!Share your thoughts and opinions here! Cheers,Merlyn ShelleyEditor-in-Chief, PacktThanks for reading Packt BI-Pro! Subscribe for free to receive new posts and support our work.Pledge your supportSign Up | Advertise | Archives🚀 GitHub's Most Sought-After Repos🌀 man-group/ArcticDB: ArcticDB is a high-performance DataFrame database designed for Python Data Science, with a Python-centric API for Pandas DataFrames. 🌀 gradio-app/gradio: Gradio is an open-source Python package for building demos or web apps for ML models or Python functions, with easy sharing via built-in features. 🌀 Sinaptik-AI/pandas-ai: PandasAI is a Python library using generative AI to explore, clean, and analyze data with natural language queries.🌀 OpenRefine/OpenRefine: OpenRefine is a powerful Java-based tool for loading, understanding, cleaning, reconciling, and augmenting data, accessible from a web browser. 🌀 Kanaries/pygwalker: PyGWalker simplifies Jupyter Notebook workflows by converting pandas dataframes into interactive user interfaces for data analysis and visualization. 🌀 cleanlab/cleanlab: cleanlab aids in data and label cleaning by identifying issues in ML datasets automatically, enabling better model training with real-world data.Email Forwarded? Join BI-Pro Here!🔮 Data Viz with Python Libraries  🌀 Pydantic Tutorial: Data Validation in Python Made Simple. This blog tutorial explains how to use Pydantic, a data validation and serialization library in Python, to validate and serialize data classes, offering support for custom validators and Python's type hints for field validation. 🌀 30+ Data Visualization Libraries, Frameworks and Apps, Mastering Data Presentation: Explore over 30 data visualization tools like Metabase, Gephi, and Grafana, offering a range of features to transform raw data into meaningful visualizations for better decision-making in industries like tech, healthcare, finance, and marketing. 🌀 Mastering Data Visualization in R for Bioinformatics:  The article delves into data visualization in R for bioinformatics, stressing its role in understanding complex biological data, communicating findings, hypothesis generation, and decision-making. It also discusses Anscombe's Quartet, highlighting the importance of visualizing data before analysis and the limitations of summary statistics. 🌀 Integrating JavaScript charting libraries with Handsontable: The article guides developers on integrating Highcharts, Recharts, and Chart.js with Handsontable for data visualization. It explains the features of each library and provides demos for creating a stock portfolio with interactive charts. 🌀 Data Visualization with Seaborn Line Plot: The article introduces Seaborn, a Python library for data visualization, built on top of Matplotlib. It covers installation and demonstrates creating single line plots and customizing styles for better presentation of data. 🌀 MetaGPT’s Data Interpreter: SOTA Open Source LLM-based Data Solutions. MetaGPT introduces its Data Interpreter, a new agent for streamlined data interpretation and analysis. The Data Interpreter employs advanced techniques for real-time data adaptability, tool integration, and logical inconsistency identification, showcasing superior performance in machine learning tasks. ⚡Stay Informed with Industry HighlightsPower BI 🌀 Power BI March 2024 Feature Summary: The Power BI update introduces visual calculation editing, data model editing in the Power BI Service, and report subscription delivery to OneDrive SharePoint. A new Microsoft Fabric certification exam, DP-600, is also available, with free certification opportunities through the Fabric AI Skills Challenge. 🌀 Announcing the Public Preview of Database Mirroring in Microsoft Fabric: Mirroring, now in Public Preview, allows seamless integration of databases into Microsoft Fabric's OneLake, providing real-time insights without ETL. It simplifies data replication and warehousing, enabling easy data access and analysis across different sources, including data lakes and warehouses. 🌀 Get data with Power Query available in Power BI Report Builder (Preview): Power BI Report Builder now allows connecting to 100+ data sources like Snowflake, Databricks, and AWS Redshift. You can transform data using M-Query for paginated reports. Install the latest version and connect from the "Data" tab. Microsoft Fabric🌀 Microsoft Fabric March 2024 Update: This update brings new features like OneLake File Explorer, Autotune Query Tuning, and Test Framework for Power Query SDK in VS Code to Power BI, enhancing reporting, modeling, service, mobile, and developer experiences. 🌀 Data Factory Adds CI/CD to Fabric Data Pipelines: Fabric engineers with Azure Synapse Analytics and Azure Data Factory experience can now utilize Git integration and built-in Deployment Pipelines in Data Factory data pipelines in Fabric. This public preview offers source control, CI/CD features, and collaborative development environments, enhancing data analytics projects. 🌀 Microsoft Fabric Lifecycle Management – Getting started with Git Integration and Deployment Pipelines: Microsoft Fabric makes Lifecycle Management easy, enabling continuous releases through Git and Deployment Pipelines. Git allows reliable updates for supported items like Lakehouse, Notebooks, and Reports, while Deployment Pipelines clone content between stages like DEV, TEST, UAT, and PROD. AWS BI  🌀 Announcing the AWS Well-Architected Data Analytics Lens: The Data Analytics Lens helps assess and improve analytics platforms on AWS. It offers best practices, such as building ACID-compliant data lakes and leveraging Serverless for data pipelines, aligned with the AWS Well-Architected Framework's pillars for secure, efficient, and cost-effective solutions. 🌀 Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics. The post discusses how healthcare providers can improve patient care by leveraging AWS services for real-time analytics and personalized healthcare, focusing on a zero-ETL approach to data integration.Google Cloud Dat🌀 Enrich streaming data in Bigtable with Dataflow: The post discusses the importance of event stream processing in data engineering and introduces Apache Beam's Enrichment transform, which simplifies the process of enriching streaming data with Bigtable, improving data context and enabling more meaningful analysis.🌀 Dataflow at-least-once vs. exactly-once streaming modes: The post compares exactly-once and at-least-once processing modes in Dataflow Streaming Engine for streaming jobs. It explains the trade-offs between the two modes and provides guidance on choosing the right mode based on use case requirements. Tableau🌀 Data is both art and science - My Tableau Story: Andy Cotgreave. The post highlights Andy Cotgreave's journey from a data analyst at Oxford to becoming a Senior Technical Evangelist at Tableau. It emphasizes the importance of community engagement, innovation, building a portfolio, and having fun in data visualization. 🌀 Student to BI Analyst, How Tableau Can Lead to a Successful Data Career: This blog discusses Karolina Grodzinska's data visualization journey, from discovering Tableau to winning Iron Viz: Student Edition and becoming a Business Intelligence Analyst at Schneider Electric. Karolina emphasizes the importance of an active Tableau Public profile in career development and shares tips for building a strong portfolio and networking with the Tableau Community. ✨ Expert Insights from Packt CommunityPython Deep Learning - Third Edition - By Ivan VasilevDeveloping NN models for edge devices with TF Lite TF Lite is a TF-derived set of tools that allows us to run models on mobile, embedded, and edge devices. Its versatility is part of TF’s appeal for industrial applications (as opposed to research applications, where PyTorch dominates).The key paradigm of TF Lite is that the models run on-device, contrary to client-server architecture, where the model is deployed on remote, more powerful, hardware. This organization has the following implications (both good and bad): Low-latency execution: The lack of server-round trip significantly reduces the model inference time and allows us to run real-time applications. Privacy: The user data never leaves the device. Internet connectivity: Internet connectivity is not required. Small model size: The devices have limited computational ability, hence the need for small and computationally efficient models. More specifically, TF Lite models are stored in the FlatBuffers (https://flatbuffers.dev/) special efficient portable format, identified by the .tflite file extension. Besides its small size, it allows us to access data directly without parsing/unpacking it first. TF Lite models support a subset of the TF Core operations and allow us to define custom ones: Low power consumption: The devices often run on battery. Divergent training and inference: NN training is a lot more computationally intensive compared to inference. Because of this, the model training runs on a different, more powerful, piece of hardware than the actual devices, where the models will run inference. In addition, TF Lite has the following key features: Multi-platform and multi-language support, including Android (Java), iOS (Objective-C and Swift) devices, web (JavaScript), and Python for all other environments. Google provides a TF Lite wrapper API called MediaPipe Solutions (https://developers.google.com/mediapipe, https://github.com/google/mediapipe/), which supersedes the previous TF Lite API. Optimized for performance. It has end-to-end solution pipelines. TF Lite is oriented toward practical applications, rather than research. Because of this, it includes different pipelines for common ML tasks such as image classification, object detection, text classification, and question answering among others. The computer vision pipelines use modified versions of EfficientNet or MobileNet, and the natural language processing pipelines use BERT-based models. So, how does TF Lite model development work? First, we’ll select a model in one of the following ways:  An existing pre-trained .tflite model (https://tfhub.dev/s?deployment-format=lite). Use MediaPipe Model Maker (https://developers.google.com/mediapipe/solutions/model_maker) to apply feature engineering transfer learning on an existing .tflite model with a custom training dataset. Model Maker only works with Python. Convert a full-fledged TF model into .tflite format. Discover more insights from 'Python Deep Learning - Third Edition' by Ivan Vasilev. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today! Read Here💡 What's the Latest Scoop from the BI Community? 🌀 Real-Time Twitch Chat Sentiment Analysis with Apache Flink: This blog explores building a real-time sentiment analysis application for Twitch chat using Apache Flink. It covers setting up the project, reading Twitch chat messages, performing sentiment analysis, and concludes with a demo. 🌀 Entity Type Prediction with Relational Graph Convolutional Network (PyTorch): This post discusses a Python setup for predicting entity types on heterogeneous graphs using the Relational Graph Convolutional Network (R-GCN) and the RGCNConv module from PyTorch. It explains knowledge graphs, entity type prediction, and the R-GCN model. 🌀 Data Quality Error Detection powered by LLMs: This article explores automating the identification of data errors in tabular datasets using Large Language Models (LLMs). It discusses the Data Dirtiness Score, challenges in data cleaning, and the potential of LLMs in detecting data quality issues. 🌀 Building Ethical AI Starts with the Data Team — Here’s Why: This article discusses the ethical considerations of AI, focusing on model bias, AI usage, and data responsibility. It emphasizes the role of data teams in ensuring ethical AI and suggests steps for data teams to take towards a more ethical future. See you next time!
Read more
  • 0
  • 0
  • 900

article-image-databricks-dbrx-stability-ais-stable-code-instruct-3b-sambanovas-samba-coe-v02-frugalgpt-advanced-rag-patterns-on-amazon-sagemaker
Merlyn Shelley
02 Apr 2024
10 min read
Save for later

Databricks' DBRX, Stability AI's Stable Code Instruct 3B, SambaNova's Samba CoE v0.2, FrugalGPT, Advanced RAG Patterns on Amazon SageMaker

Merlyn Shelley
02 Apr 2024
10 min read
Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!👋 Hello,Welcome to DataPro#87 – Your Gateway to the Cutting-Edge of Data Science & Machine Learning! 🚀 Dive into this edition to explore: ⚙️ LLMs & GPTs Unleashed Samba CoE v0.2: SambaNova's Speedy AI Models Efficient Training of Language Models with OpenAI AI21's Revolutionary SSM-Transformer Model: Jamba Databricks' DBRX: The New Open LLM Benchmark Stable Code Instruct 3B: Stability AI's Latest Offering HyperLLaVA: Boosting Multimodal Language Models ✨ What's Fresh & Exciting FrugalGPT: Cutting LLM Operating Costs Building a Reliable AI Agent from Scratch with OpenAI Tool Calling Fine-Tuning Instruct Models over Raw Text Data Crafting an OpenAI-Compatible API ⚡ Industry Pulse:  Deciphering Advanced RAG Patterns on Amazon SageMaker Unveil the Future with AutoBNN: Mastering Probabilistic Time Series Forecasting! Engaging with Microsoft Copilot (web): Learning from Interaction 📚 Packt's Latest Gem "Principles of Data Science - Third Edition" by Sinan Ozdemir DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ Zejun-Yang/AniPortrait: AniPortrait is a new framework for creating high-quality animations using audio input and a reference portrait image, with face reenactment capabilities.🛠️ agiresearch/AIOS: AIOS embeds large language models into operating systems, enabling smarter resource allocation, context switching, and concurrent agent execution, advancing AGI. 🛠️ lichao-sun/Mora: Mora is a multi-agent framework for video generation, enhancing OpenAI's Sora capabilities through collaborative visual agents for diverse tasks. 🛠️ jasonppy/VoiceCraft: VoiceCraft is a high-performing neural codec language model for speech editing and zero-shot text-to-speech, excelling with diverse real-world data. 🛠️ dvlab-research/MiniGemini: Mini-Gemini enhances LLMs (Large Language Models) from 2B to 34B, integrating image understanding, reasoning, and generation, inspired by LLaVA. 🛠️ Picsart-AI-Research/StreamingT2V: StreamingT2V is a technique for creating long videos with rich motion dynamics, ensuring temporal consistency and high image quality. 📚 Expert Insights from Packt Community"Principles of Data Science - Third Edition" by Sinan Ozdemir. The Five Steps of Data Science A question I’ve gotten at least once a month for the past decade is What’s the difference between data science and data analytics? One could argue that there is no difference between the two; others will argue that there are hundreds of differences! I believe that, regardless of how many differences there are between the two terms, the following applies: Data science follows a structured, step-by-step process that, when followed, preserves the integrity of the results and leads to a deeper understanding of the data and the environment the data comes from. As with any other scientific endeavor, this process must be adhered to, or else the analysis and the results are in danger of scrutiny. On a simpler level, following a strict process can make it much easier for any data scientist, hobbyist, or professional to obtain results faster than if they were exploring data with no clear vision. While these steps are a guiding lesson for amateur analysts, they also provide the foundation for all data scientists, even those in the highest levels of business and academia. Every data scientist recognizes the value of these steps and follows them in some way or another. Overview of the five steps The process of data science involves a series of steps that are essential for effectively extracting insights and knowledge from data. These steps are presented as follows: Asking an interesting question: The first step in any data science project is to identify a question or challenge that you want to address with your analysis. This involves finding a topic that is relevant, important, and that can be addressed with data. Obtaining the data: Once you have identified your question, the next step is to collect the data that you will need to answer it. This can involve sourcing data from a variety of sources, such as databases, online platforms, or through data scraping or data collection methods. Exploring the data: After you have collected your data, the next step is to explore it and get a better understanding of its characteristics and patterns. This might involve examining summary statistics, visualizing the data, or applying statistical or machine learning (ML) techniques to identify trends or relationships. Modeling the data: Once you have explored your data, the next step is to build models that can be used to make predictions or inform decision-making. This might involve applying ML algorithms, building statistical models, or using other techniques to find patterns in the data. Communicating and visualizing the results: Finally, it’s important to communicate your findings to others in a clear and effective way. This might involve creating reports, presentations, or visualizations that help to explain your results and their implications. By following these five essential steps, you can effectively use data science to solve real-world problems and extract valuable insights from data. It’s important to note that different data scientists may have different approaches to the data science process, and the steps outlined previously are just one way of organizing the process. Some data scientists might group the steps differently or include additional steps such as feature engineering or model evaluation. Despite these differences, most data scientists agree that the steps listed previously are essential to the data science process. Whether they are organized in this specific way or not, these steps are all crucial for effectively using data to solve problems and extract valuable insights. Let’s dive into these steps one by one.Discover more insights from "Principles of Data Science - Third Edition" by Sinan Ozdemir. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!    Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 Advanced RAG patterns on Amazon SageMaker: This post discusses how customers across various industries are utilizing large language models (LLMs) like Mixtral-8x7B Instruct to build generative AI applications such as QnA chatbots and search engines. It highlights the challenges and solutions in improving the accuracy and performance of these applications, focusing on Retrieval Augmented Generation (RAG) patterns implemented with LangChain.Google Research 🌀 AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks. This research introduces AutoBNN, an open-source package for automated, interpretable time series forecasting using Bayesian neural networks (BNNs). It addresses limitations of traditional methods like Gaussian processes (GPs) and Structural Time Series by combining the interpretability of GPs with the scalability and flexibility of neural networks. AutoBNN automates model discovery, provides high-quality uncertainty estimates, and scales effectively for large datasets. Microsoft Research🌀 Learning from interaction with Microsoft Copilot (web): This research focuses on how AI systems like Bing and Microsoft Copilot learn and improve from user interactions, particularly through reinforcement learning from human feedback (RLHF). It also explores how Bing has evolved its search capabilities and how Copilot is changing user interactions to be more conversational and workflow oriented. The research introduces frameworks like TnT-LLM and SPUR to improve taxonomy generation and user satisfaction estimation in AI interactions. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🌀 Samba CoE v0.2 from SambaNova delivers accurate AI models at blazing speeds: This blog post highlights Samba's advancements in AI architecture, specifically focusing on the introduction of Samba-1, a CoE architecture for enterprise AI. It discusses the features and benefits of Samba-1, its performance benchmarks, and plans for future releases, emphasizing the role of RDUs in driving efficiency and speed in AI models. 🌀 OpenAI’s Efficient Training of Language Models to Fill in the Middle: OpenAI demonstrates that autoregressive language models can effectively learn to infill text by moving a span of text from the middle of a document to its end, without harming generative capability. They propose training models with this method by default and provide benchmarks and best practices. 🌀 Jamba: AI21's Groundbreaking SSM-Transformer Model. Jamba is a groundbreaking model that merges Mamba SSM with Transformer elements, offering a 256K context window and outperforming similar models. Released under Apache 2.0, it will be available in the NVIDIA API catalog. Jamba optimizes memory, throughput, and performance, delivering remarkable efficiency. 🌀 Databricks’ DBRX: A New State-of-the-Art Open LLM. Databricks introduces DBRX, an open LLM setting new benchmarks in language understanding, programming, and math. With a 256K context window, it outperforms GPT-3.5 and competes with Gemini 1.0 Pro. DBRX is 40% smaller than Grok-1, offering 2x faster inference than LLaMA2-70B. 🌀 Introducing Stable Code Instruct 3B — Stability AI: Stable Code Instruct 3B, built on Stable Code 3B, offers state-of-the-art performance in code completion and natural language interactions for programming tasks. It outperforms Codellama 7B Instruct and matches StarChat 15B, with a focus on popular languages like Python and Java. Available for commercial use with a Stability AI Membership, the model is accessible on Hugging Face. 🌀 HyperLLaVA: Enhancing Multimodal Language Models with Dynamic Visual and Language Experts. This blog explores the advancements in Multimodal Large Language Models (MLLMs) and introduces HyperLLaVA, a dynamic model that improves performance by adaptively tuning parameters for handling diverse multimodal tasks, surpassing existing benchmarks and opening new avenues for multimodal learning systems. ✨ On the Radar: Catch Up on What's Fresh🌀 FrugalGPT and Reducing LLM Operating Costs: The blog discusses the high cost of running Large Language Models (LLMs) and introduces the "FrugalGPT" framework, which reduces operating costs significantly while maintaining quality. It explains how different models cost different amounts and proposes using a cascade of LLMs to minimize costs while maximizing answer quality. 🌀 Leverage OpenAI Tool calling: Building a reliable AI Agent from Scratch. The blog discusses the future role of AI in everyday tasks, focusing on text creation, correction, and brainstorming. It highlights the importance of Retrieval-Augmented Generation (RAG) pipelines and aims to provide Large Language Models with better context to generate more valuable content. 🌀 Fine-tune an Instruct model over raw text data: The blog explores the challenges of integrating modern chatbots with large datasets, focusing on context window sizes and the use of Retrieval-Augmented Generation (RAG) techniques. It proposes a lighter approach to fine-tuning chatbots on smaller datasets, aiming to bridge the gap between the constraints of a 128K context window and the complexities of models fine-tuned on billions of tokens. The experiment involves fine-tuning a model on The Guardian's dataset and aims to provide reproducible instructions for cost-effective model training using accessible hardware. 🌀 How to build an OpenAI-compatible API: The blog discusses the dominance of OpenAI in the Gen AI market, and the reasons developers might choose alternative LLM providers. It explores implementing a Python FastAPI server compatible with the OpenAI API specs to wrap any LLM, aiming for flexibility and cost-effectiveness. See you next time!
Read more
  • 0
  • 0
  • 566