Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Transforming Web Data with Browse AI

Save for later
  • 14 min read
  • 26 Mar 2024

article-image

Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!

Partnering with Browse AI 

Turn Web Data into Your Business Superpower!

👉 Train a robot in 2 minutes, no coding needed. 🤖 

👉 Ideal for web scraping and data monitoring. 🌐 

Here’s what you get: 

Monitor Websites for Changes ✅ 

Download Data from Any Website ✅ 

Turn Any Website into an API ✅ 

Product data extraction ✅ 

Also, extract data from news, stocks, jobs, social media, and more. 

Check out this 1-minute explainer video on how to extract data to Excel, Airtable, and connect to 5,000+ apps using Zapier! 

Start for free with up to 50 credits, and for a limited time, enjoy free setup and onboarding for Team and Company plans, saving up to 20% on Annual plans. 

Get Scraping Today!

👋 Hello,

Welcome to DataPro#85 – Your one-stop shop for the latest in Data Science and ML Algorithms! 🚀 

In this issue:

⚙️ Keeping Up with LLMs & GPTs  

Meet Devin: The pioneering AI software engineer. 

Google's Croissant: A fresh take on metadata for ML-ready datasets. 

INSTRUCTIR by Kaist AI: Setting new standards in instruction-following for information retrieval models. 

Spyx by Sussex AI: Turbocharging spiking neural networks with just-in-time compiled optimization. 

SynCode by VMware: Enhancing LLM code generation with a touch of grammar. 

Chatbot Arena: The ultimate battleground for evaluating LLMs by human preference. 

Apollo: Bringing medical AI to the masses with a multilingual medical LLM. 

On the Radar

Top AI tools for code generation in 2024. 

Setting up a Pypi mirror in AWS with Terraform. 

Ensuring safer code changes with custom pre-commit hooks. 

Deciphering the AQLM Quantization Algorithm

AI's role in revolutionizing web browsing. 

Tackling tensors through three tricky errors. 

Running RStudio inside a container. 

Harnessing PyTorch and MLX for Apple Silicon. 

🏭 Industry Highlights 

Google Research: Boosting LLMs with Cappy, evolving tables with Chain-of-table, and Scalable Instructable Multiworld Agent (SIMA)

AWS: Streamlining code review with generative AI using Amazon Bedrock

OpenAI Updates: Leadership continuity and global news partnerships

📚 New in Packt Library 

Practical Guide to Applied Conformal Prediction in Python by Valery Manokhin. 

DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 

📥 Feedback on the Weekly Edition

Take our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."

We appreciate your input and hope you enjoy the book!

Share your Feedback!

Cheers,
Merlyn Shelley
Editor-in-Chief, Packt

Sign Up | Advertise | Archives

🔰 GitHub Finds: Any of These Repos in Your Toolbox?

🛠️ deepseek-ai/DeepSeek-VL: Open-source Vision-Language (VL) model for real-world tasks, handling logical diagrams, web pages, formulas, scientific literature, and more. 

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at R$50/month. Cancel anytime

🛠️ OpenGVLab/VideoMamba: VideoMamba enhances 3D CNNs and video transformers, excelling in long-term video understanding with scalability and modality compatibility. 

🛠️ showlab/DragAnything: DragAnything uses entity representation for motion control in video generation, offering user-friendly interaction and outperforming existing methods. 

🛠️ pkunlp-icler/FastV: FastV accelerates large vision language models by pruning redundant visual tokens, achieving 45% FLOPs reduction without performance loss. 

🛠️ cnulab/RealNet: RealNet introduces SDAS for anomaly strength control, AFS for feature selection, and RRS for anomaly region identification. 

Partnering with Surfshark

transforming-web-data-with-browse-ai-img-0

Surfshark is allowing our readers to enjoy a full 2 years of their award-winning VPN protection for 79% off, plus 2 months free

With Surfshark One, you get: 

Unlimited devices and connections ✅ 

One account for the entire household ✅ 

Your online activity, made safe, secure, and invisible ✅ 

Plus, identity protection, ad blocking, antivirus, and data breach monitoring.

Claim your VPN protection today! 

📚 Expert Insights from Packt Community

Practical Guide to Applied Conformal Prediction in Python - By Valery Manokhin 

Basic components of a conformal predictor 

We will now look at the basic components of a conformal predictor: 

Nonconformity measure: The nonconformity measure is a function that evaluates how much a new data point differs from the existing data points. It compares the new observation to either the entire dataset (in the full transductive version of conformal prediction) or the calibration set (in the most popular variant – ICP. The selection of the nonconformity measure is based on a particular machine learning task, such as classification, regression, or time series forecasting, as well as the underlying model. This will examine several nonconformity measures suitable for classification and regression tasks. 

Calibration set: The calibration set is a portion of the dataset used to calculate nonconformity scores for the known data points. These scores are a reference for establishing prediction intervals or regions for new test data points. The calibration set should be a representative sample of the entire data distribution and is typically randomly selected. The calibration set should contain a sufficient number of data points (at least 500). If the dataset is small and insufficient to reserve enough data for the calibration set, the user should consider other variants of conformal prediction – including TCP (see, for example, Mastering Classical Transductive Conformal Prediction in Action – https://medium.com/@valeman/how-to-use-full-transductive-conformal-prediction-7ed54dc6b72b). 

Test set: The test set contains new data points for generating predictions. For every data point in the test set, the conformal prediction model calculates a nonconformity score using the nonconformity measure and compares it to the scores from the calibration set. Using this comparison, the conformal predictor generates a prediction region that includes the target value with a user-defined confidence level. 

All these components work in tandem to create a conformal prediction framework that facilitates valid and efficient uncertainty quantification in a wide range of machine learning tasks. 

Discover more insights from 'Practical Guide to Applied Conformal Prediction in Python' by Valery Manokhin. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!   

Read Here!

⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! 

AWS ML Made Easy 

🌀 Enhance code review and approval efficiency with generative AI using Amazon Bedrock: This post discusses the challenges faced by managers in overseeing code review and approval processes in software development, such as lack of technical expertise, time constraints, volume of change requests, manual effort, and the need for documentation. It also introduces a solution that leverages generative artificial intelligence and integrates it with AWS deployment tools to streamline the review and approval process. The solution includes automated change analysis, summarization, and an approval workflow. 

Google Research 

🌀 Cappy: Outperforming and boosting large multi-task language models with a small scorer. This blog discusses advancements in large language models (LLMs) and their use in natural language processing (NLP). It introduces the concept of multi-task LLMs, such as T0, FLAN, and OPT-IML, which excel at understanding and solving various tasks. It also presents a new approach called Cappy, a lightweight pre-trained scorer that enhances the performance and efficiency of multi-task LLMs. 

🌀 Chain-of-table: Evolving tables in the reasoning chain for table understanding. This research focuses on improving how large language models (LLMs) reason over tabular data, which is challenging due to the structured nature of tables. The proposed framework, Chain-of-Table, trains LLMs to iteratively update tables, mimicking human reasoning, resulting in improved performance on table understanding tasks. 

🌀 Talk like a graph: Encoding graphs for large language models. This research explores how to teach large language models (LLMs) to reason with graph information, crucial for understanding interconnected data. They introduce GraphQA, a benchmark to evaluate LLMs on graph problems, revealing insights into effective graph encoding methods and improving LLM performance on graph tasks by up to 60%. 

🌀 Scalable Instructable Multiworld Agent (SIMA): A generalist AI agent for 3D virtual environments. Google DeepMind has developed SIMA, a versatile AI agent trained on multiple video games to follow natural-language instructions, akin to human behavior. Collaborating with game studios, SIMA navigates various environments, showcasing potential for AI to understand and execute diverse tasks. 

OpenAI Updates 

🌀 Review completed & Altman, Brockman to continue to lead OpenAI: The OpenAI Board completed a review by WilmerHale, expressing full confidence in Sam Altman and Greg Brockman's leadership. They also elected new board members and adopted governance enhancements. WilmerHale's review found a breakdown in trust between the prior Board and Mr. Altman, leading to his removal, but concluded that his conduct did not mandate removal. Following the review, the Board endorsed the decision to rehire Mr. Altman and Mr. Brockman. 

🌀 Global news partnerships: Le Monde and Prisa Media: OpenAI has partnered with Le Monde and Prisa Media to bring French and Spanish news content to ChatGPT. This partnership aims to enhance user interaction with news content and contribute to the training of OpenAI's models. Through these partnerships, users will access summaries and links to original articles, expanding their news consumption experience. This collaboration supports the news industry and its role in providing reliable information globally. 

Email Forwarded? Join DataPro Here!

🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 

🌀 Introducing Devin, the first AI software engineer: Meet Devin, the autonomous AI software engineer, skilled in long-term reasoning and planning. Devin can learn new technologies, build and deploy apps, find and fix bugs, train AI models, and contribute to open source. Devin excels in resolving real-world GitHub issues, outperforming previous models. Cognition, the AI lab behind Devin, aims to unlock new possibilities beyond coding. 

🌀 Google’s Croissant: a metadata format for ML-ready datasets. Croissant is a new metadata format for ML datasets, aiming to simplify the use of existing datasets for training ML models. It standardizes dataset descriptions and organization, supporting responsible AI practices. Croissant builds upon schema.org and is supported by major tools and repositories like Kaggle, Hugging Face, and OpenML. It includes a specification, example datasets, a Python library, and a visual editor to facilitate dataset usage and publication. 

🌀 Kaist AI’s INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models. This research focuses on enhancing search accuracy by improving retrievers to understand users' intentions, similar to language models. It introduces INSTRUCTIR, a benchmark for evaluating retrievers' ability to follow user-aligned instructions in retrieval tasks. The study addresses limitations in existing benchmarks and highlights potential overfitting issues in instruction-aware retrieval datasets.  

🌀 Sussex AI’s Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks. Advancements in large neural architectures have led to powerful AI accelerators for training deep neural networks. However, these networks often incur high costs. Neuromorphic computing with Spiking Neural Networks (SNNs) offers energy-efficient alternatives, but training SNNs is challenging. Spyx, a new lightweight SNN simulation and optimization library designed in JAX, aims to facilitate SNN architecture investigation by bridging Python-based deep learning frameworks with custom compute kernels, achieving optimal hardware utilization. 

🌀 VMware’s SynCode: Improving LLM Code Generation with Grammar Augmentation. SynCode is a novel framework for efficient syntactical decoding of code with large language models (LLMs). It leverages grammar of a programming language using an offline-constructed efficient lookup table called Deterministic Finite Automaton (DFA) mask store. SynCode seamlessly integrates with any context-free grammar (CFG) defined language, reducing syntax errors by 96.07% when combined with LLMs. 

🌀 Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. Chatbot Arena is an open platform designed to evaluate Large Language Models (LLMs) by considering human preferences. Utilizing a pairwise comparison method and crowdsourced input, it assesses LLMs' alignment with user preferences. The platform, operational for months with over 240K votes, provides a credible and valuable resource for ranking LLMs. Check out the tool here

🌀 Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People. The project aims to develop medical Large Language Models (LLMs) in the six most spoken languages, benefiting 6.1 billion people. This includes creating the ApolloCorpora multilingual medical dataset and the XMedBench benchmark, with Apollo models achieving top performance among models of similar sizes. The project will open-source training data, code, model weights, and evaluation benchmarks. You can check for the demo here

✨ On the Radar: Catch Up on What's Fresh

🌀 Top Artificial Intelligence (AI) Tools That Can Generate Code To Help Programmers (2024): The article discusses how AI is changing programming, with tools like OpenAI Codex and GitHub Copilot generating code. It explores AI's impact on code quality and development speed, showcasing various AI-powered tools like Tabnine, CodeT5, and Polycoder. Additionally, it mentions AI tools for code review, static code analysis, and AI-assisted coding in IDEs like PyCharm and Visual Studio. 

🌀 Pypi mirror in a private AWS environment Terraform: This article explains how to install Python packages in an AWS Sagemaker Studio environment without internet access. It covers setting up Sagemaker in VPC Only mode, using VPC Endpoint interfaces for network communications, and accessing the Pypi package repository through AWS Codeartifact, which allows defining Pypi as an upstream repository. 

🌀 Custom pre-commit hooks for safer code changes: This blog post explains the importance of using pre-commit hooks in software development, particularly with the git version control system. It discusses the challenges of maintaining coding standards in collaborative projects and provides a step-by-step tutorial on how to set up and use custom pre-commit hooks for a Python project, using the example of validating dataflow definitions for the Hamilton library. 

🌀 AQLM Quantization Algorithm, explained: A new quantization algorithm, AQLM (Additive Quantization of Language Models), was recently released and integrated into HuggingFace Transformers and HuggingFace PEFT. AQLM sets a new state-of-the-art for 2-bit quantization while providing improvements for 3-bit and 4-bit ranges, pushing the boundaries of model accuracy and memory footprint. 

🌀 Revolutionize Web Browsing with AI: This article explores creating an AI agent using the gpt-4-vision-preview model from OpenAI, enabling it to navigate the web like a human. It discusses the agent's browser control, content browsing, and decision-making processes, showcasing potential use cases such as aiding visually challenged users and automating web browsing tasks. 

🌀 Understanding Tensors: Learning a Data Structure Through 3 Pesky Errors. This article discusses transitioning from managing tabular data to working with tensors in TensorFlow, offering debugging tips and code recipes. It covers visualizing TensorFlow datasets, understanding tensor specs, and augmenting model summaries, while addressing common errors related to tensor rank and shape. 

🌀 Running RStudio Inside a Container: This tutorial focuses on setting up RStudio using Docker, particularly leveraging the Rocker RStudio image. It covers pulling the image, launching RStudio in a container, and ensuring persistence of data by using volume mapping. The tutorial provides step-by-step instructions and explanations for each stage. 

🌀 PyTorch and MLX for Apple Silicon: The blog discusses Apple's MLX framework, which is optimized for Apple Silicon and serves as a bridge between PyTorch, NumPy, and Jax. It details a comparison between MLX and PyTorch through a custom convolutional neural network implementation for image classification tasks. The discussion includes insights into MLX's features, such as its array class, lazy computation, and compilation for performance optimization. The post also highlights the ease of converting PyTorch code to MLX, despite some differences in API compatibility and coding conventions. 

See you next time!

Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support!