Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!
👋 Hello,
Welcome to DataPro#87 – Your Gateway to the Cutting-Edge of Data Science & Machine Learning! 🚀
Dive into this edition to explore:
⚙️ LLMs & GPTs Unleashed
Samba CoE v0.2: SambaNova's Speedy AI Models
Efficient Training of Language Models with OpenAI
AI21's Revolutionary SSM-Transformer Model: Jamba
Databricks' DBRX: The New Open LLM Benchmark
Stable Code Instruct 3B: Stability AI's Latest Offering
HyperLLaVA: Boosting Multimodal Language Models
✨ What's Fresh & Exciting
FrugalGPT: Cutting LLM Operating Costs
Building a Reliable AI Agent from Scratch with OpenAI Tool Calling
Fine-Tuning Instruct Models over Raw Text Data
Crafting an OpenAI-Compatible API
⚡ Industry Pulse:
Deciphering Advanced RAG Patterns on Amazon SageMaker
Unveil the Future with AutoBNN: Mastering Probabilistic Time Series Forecasting!
Engaging with Microsoft Copilot (web): Learning from Interaction
📚 Packt's Latest Gem
"Principles of Data Science - Third Edition" by Sinan Ozdemir
DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today!
📥 Feedback on the Weekly Edition
Take our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."
We appreciate your input and hope you enjoy the book!
Cheers,
Merlyn Shelley
Editor-in-Chief, Packt
Sign Up | Advertise | Archives
🛠️ Zejun-Yang/AniPortrait: AniPortrait is a new framework for creating high-quality animations using audio input and a reference portrait image, with face reenactment capabilities.
🛠️ agiresearch/AIOS: AIOS embeds large language models into operating systems, enabling smarter resource allocation, context switching, and concurrent agent execution, advancing AGI.
🛠️ lichao-sun/Mora: Mora is a multi-agent framework for video generation, enhancing OpenAI's Sora capabilities through collaborative visual agents for diverse tasks.
🛠️ jasonppy/VoiceCraft: VoiceCraft is a high-performing neural codec language model for speech editing and zero-shot text-to-speech, excelling with diverse real-world data.
🛠️ dvlab-research/MiniGemini: Mini-Gemini enhances LLMs (Large Language Models) from 2B to 34B, integrating image understanding, reasoning, and generation, inspired by LLaVA.
🛠️ Picsart-AI-Research/StreamingT2V: StreamingT2V is a technique for creating long videos with rich motion dynamics, ensuring temporal consistency and high image quality.
"Principles of Data Science - Third Edition" by Sinan Ozdemir.
The Five Steps of Data Science
A question I’ve gotten at least once a month for the past decade is What’s the difference between data science and data analytics? One could argue that there is no difference between the two; others will argue that there are hundreds of differences! I believe that, regardless of how many differences there are between the two terms, the following applies:
Data science follows a structured, step-by-step process that, when followed, preserves the integrity of the results and leads to a deeper understanding of the data and the environment the data comes from.
As with any other scientific endeavor, this process must be adhered to, or else the analysis and the results are in danger of scrutiny. On a simpler level, following a strict process can make it much easier for any data scientist, hobbyist, or professional to obtain results faster than if they were exploring data with no clear vision.
While these steps are a guiding lesson for amateur analysts, they also provide the foundation for all data scientists, even those in the highest levels of business and academia. Every data scientist recognizes the value of these steps and follows them in some way or another.
Overview of the five steps
The process of data science involves a series of steps that are essential for effectively extracting insights and knowledge from data. These steps are presented as follows:
Asking an interesting question: The first step in any data science project is to identify a question or challenge that you want to address with your analysis. This involves finding a topic that is relevant, important, and that can be addressed with data.
Obtaining the data: Once you have identified your question, the next step is to collect the data that you will need to answer it. This can involve sourcing data from a variety of sources, such as databases, online platforms, or through data scraping or data collection methods.
Exploring the data: After you have collected your data, the next step is to explore it and get a better understanding of its characteristics and patterns. This might involve examining summary statistics, visualizing the data, or applying statistical or machine learning (ML) techniques to identify trends or relationships.
Modeling the data: Once you have explored your data, the next step is to build models that can be used to make predictions or inform decision-making. This might involve applying ML algorithms, building statistical models, or using other techniques to find patterns in the data.
Communicating and visualizing the results: Finally, it’s important to communicate your findings to others in a clear and effective way. This might involve creating reports, presentations, or visualizations that help to explain your results and their implications.
By following these five essential steps, you can effectively use data science to solve real-world problems and extract valuable insights from data.
It’s important to note that different data scientists may have different approaches to the data science process, and the steps outlined previously are just one way of organizing the process. Some data scientists might group the steps differently or include additional steps such as feature engineering or model evaluation.
Despite these differences, most data scientists agree that the steps listed previously are essential to the data science process. Whether they are organized in this specific way or not, these steps are all crucial for effectively using data to solve problems and extract valuable insights. Let’s dive into these steps one by one.
Discover more insights from "Principles of Data Science - Third Edition" by Sinan Ozdemir. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!
🌀 Advanced RAG patterns on Amazon SageMaker: This post discusses how customers across various industries are utilizing large language models (LLMs) like Mixtral-8x7B Instruct to build generative AI applications such as QnA chatbots and search engines. It highlights the challenges and solutions in improving the accuracy and performance of these applications, focusing on Retrieval Augmented Generation (RAG) patterns implemented with LangChain.
🌀 AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks. This research introduces AutoBNN, an open-source package for automated, interpretable time series forecasting using Bayesian neural networks (BNNs). It addresses limitations of traditional methods like Gaussian processes (GPs) and Structural Time Series by combining the interpretability of GPs with the scalability and flexibility of neural networks. AutoBNN automates model discovery, provides high-quality uncertainty estimates, and scales effectively for large datasets.
🌀 Learning from interaction with Microsoft Copilot (web): This research focuses on how AI systems like Bing and Microsoft Copilot learn and improve from user interactions, particularly through reinforcement learning from human feedback (RLHF). It also explores how Bing has evolved its search capabilities and how Copilot is changing user interactions to be more conversational and workflow oriented. The research introduces frameworks like TnT-LLM and SPUR to improve taxonomy generation and user satisfaction estimation in AI interactions.
Email Forwarded? Join DataPro Here!
🌀 Samba CoE v0.2 from SambaNova delivers accurate AI models at blazing speeds: This blog post highlights Samba's advancements in AI architecture, specifically focusing on the introduction of Samba-1, a CoE architecture for enterprise AI. It discusses the features and benefits of Samba-1, its performance benchmarks, and plans for future releases, emphasizing the role of RDUs in driving efficiency and speed in AI models.
🌀 OpenAI’s Efficient Training of Language Models to Fill in the Middle: OpenAI demonstrates that autoregressive language models can effectively learn to infill text by moving a span of text from the middle of a document to its end, without harming generative capability. They propose training models with this method by default and provide benchmarks and best practices.
🌀 Jamba: AI21's Groundbreaking SSM-Transformer Model. Jamba is a groundbreaking model that merges Mamba SSM with Transformer elements, offering a 256K context window and outperforming similar models. Released under Apache 2.0, it will be available in the NVIDIA API catalog. Jamba optimizes memory, throughput, and performance, delivering remarkable efficiency.
🌀 Databricks’ DBRX: A New State-of-the-Art Open LLM. Databricks introduces DBRX, an open LLM setting new benchmarks in language understanding, programming, and math. With a 256K context window, it outperforms GPT-3.5 and competes with Gemini 1.0 Pro. DBRX is 40% smaller than Grok-1, offering 2x faster inference than LLaMA2-70B.
🌀 Introducing Stable Code Instruct 3B — Stability AI: Stable Code Instruct 3B, built on Stable Code 3B, offers state-of-the-art performance in code completion and natural language interactions for programming tasks. It outperforms Codellama 7B Instruct and matches StarChat 15B, with a focus on popular languages like Python and Java. Available for commercial use with a Stability AI Membership, the model is accessible on Hugging Face.
🌀 HyperLLaVA: Enhancing Multimodal Language Models with Dynamic Visual and Language Experts. This blog explores the advancements in Multimodal Large Language Models (MLLMs) and introduces HyperLLaVA, a dynamic model that improves performance by adaptively tuning parameters for handling diverse multimodal tasks, surpassing existing benchmarks and opening new avenues for multimodal learning systems.
🌀 FrugalGPT and Reducing LLM Operating Costs: The blog discusses the high cost of running Large Language Models (LLMs) and introduces the "FrugalGPT" framework, which reduces operating costs significantly while maintaining quality. It explains how different models cost different amounts and proposes using a cascade of LLMs to minimize costs while maximizing answer quality.
🌀 Leverage OpenAI Tool calling: Building a reliable AI Agent from Scratch. The blog discusses the future role of AI in everyday tasks, focusing on text creation, correction, and brainstorming. It highlights the importance of Retrieval-Augmented Generation (RAG) pipelines and aims to provide Large Language Models with better context to generate more valuable content.
🌀 Fine-tune an Instruct model over raw text data: The blog explores the challenges of integrating modern chatbots with large datasets, focusing on context window sizes and the use of Retrieval-Augmented Generation (RAG) techniques. It proposes a lighter approach to fine-tuning chatbots on smaller datasets, aiming to bridge the gap between the constraints of a 128K context window and the complexities of models fine-tuned on billions of tokens. The experiment involves fine-tuning a model on The Guardian's dataset and aims to provide reproducible instructions for cost-effective model training using accessible hardware.
🌀 How to build an OpenAI-compatible API: The blog discusses the dominance of OpenAI in the Gen AI market, and the reasons developers might choose alternative LLM providers. It explores implementing a Python FastAPI server compatible with the OpenAI API specs to wrap any LLM, aiming for flexibility and cost-effectiveness.
See you next time!