Exploring the data ingredients
Important note
If you have a background in data science, you may skip this section.
However, if you do not, this basic introduction is essential to understand the concepts and tools discussed throughout the book.
Data science is an interdisciplinary field that combines mathematics, statistics, programming, and machine learning with specific subject matter knowledge to extract meaningful insights.
Imagine you work at a top-tier bank that is considering making its first investment in a blockchain protocol, and they have asked you to present a shortlist of protocols to invest in based on relevant metrics. You may have some ideas about what metrics to consider, but how do you know which metric and value is the most relevant to determine which protocol should make it on your list? And once you know the metric, how do you find the data and calculate it?
This is where data science comes in. By analyzing transaction data (on-chain data) and data that is not on-chain (off-chain data), we can identify patterns and insights that will help us make informed decisions. For example, we might find that certain protocols are more active during business hours in a time zone different from where the bank is located. In this case, the bank can decide whether they are ready to make an investment in a product serving clients in a different time zone. We may also check the value locked in the protocol to assess the general investors’ trust in that smart contract, among many other metrics.
But data science is not just about analyzing past data. We can also use predictive modeling to forecast future trends and add those trends to our assessment. For instance, we could use machine learning algorithms to predict the price range of the token issued by the protocol based on its price history.
For this data analysis, we require the right tools, skills, and business knowledge. We’ll need to know how to collect and clean our data, how to analyze it using statistical techniques, how to separate what is business-relevant from what is not, and how to visualize our findings so we can communicate them effectively. Making data-driven decisions is the most effective way to improve all the relevant metrics for a business, which is more valuable than ever in this competitive world.
Due to the fast pace of data creation and the shortage of data scientists on the market, data scientist has been referred to as “the sexiest job of the 21st century” by the Harvard Business Review. The data economy has opened the door to multiple roles, such as data analyst, data scientist, data engineer, data architect, Business Intelligence (BI) analyst, and machine learning engineer. Depending on the complexity of the problem and the size of the data, we can see them playing a role in a typical data science project.
A typical Web3 data science project involves the following steps:
- Problem definition: At this stage, we try to answer the question of whether the problem can be solved with data, and if so, what data would be useful to answer it. Collaboration between data scientists and business users is crucial in defining the problem, as the latter are the specialists and those who will use what the data scientist produces. BI tools such as Tableau, Looker, and Power BI, or Python data visualization libraries such as Seaborn and Matplotlib, are useful in meetings with business stakeholders. It is worth noting that while many BI tools currently provide optimization packages for commonly used data sources, such as Facebook Ads or HubSpot, as of the time of writing, I have not seen any optimization for on-chain data. Therefore, it is preferable to choose highly flexible data visualization tools that can adapt to any visualization needs.
- Investigation and data ingestion: At this stage, we try to answer the question: where can we find the necessary data to use for this project? Throughout this book, especially Chapters 2 and 3, we will list multiple data sources related to Web3 that will help answer this question. Once we find where the data is, we need to build an ingestion pipeline for consumption by the data scientist. This process is called ETL, which stands for extract, transform, and load. These steps are necessary to make clean and organized data available to the data analyst or data scientist.
Data collection or extraction is the first step of the ETL process and can include manual entry, web scraping, live streaming from devices, or a connection to an API. Data can be presented in a structured format, meaning that it is stored in a predefined way, or an unstructured format, meaning that it has no predefined storage format and is simply stored in its native way. Transformation consists of modifying the raw data to be stored or analyzed. Some of the activities that data transformation can involve include data normalization, data deduplication, and data cleaning. Finally, loading is the act of moving the transformed data into data storage and making it available. There are a few additional aspects to consider when referring to data availability, such as storing data in the correct format, including all related metadata, providing the correct access privileges to the right team members, and ensuring the data is up to date, accurate, or enough to fulfill the data scientist’s needs. The ETL process is generally led by the data engineer, but the business owner and the data scientist will have a say when identifying the data source.
- Analysis/modeling: In this stage, we analyze the data to extract conclusions and may need to model it to try to predict future outcomes. Once the data is available, we can perform the following:
- Descriptive analysis: This uses data analysis and methods to describe what the data shows, gaining insights into trends, composition, distribution, and more. For example, a descriptive analysis of a Decentralized Finance (DeFi) protocol can reveal when its clients are most active and the Total Value Locked (TVL) and how the locked value has evolved over time.
- Diagnostic analysis: This uses data analysis to explain the reasons behind the occurrence of certain matters. Techniques such as data composition, correlations, and drill-down are used in these types of analyses. For example, a blockchain analyst may try to understand the correlation between a peak in new addresses and the activity of certain addresses to identify the applications that these users give to the chain.
- Predictive analysis: This uses historical data to make forecasts about trends or events in the future. Techniques can include machine learning, cluster analysis, and time series forecasting. For example, a trader may try to predict the evolution of a certain cryptocurrency based on its historical performance.
- Prescriptive analysis: This uses the result of predictive analysis as an input to suggest the optimum response or best course of action. For example, a bot can suggest whether to sell or buy certain cryptocurrency.
- Generative AI: This uses machine learning techniques and huge amounts of data to learn patterns and generates new and original outputs. Artificial intelligence can create images, videos, audio, text, and more. Applications of generative models include ChatGPT, Leonardo AI, and Midjourney.
- Evaluation: In this stage, the result of our analysis or modeling is evaluated and tested to confirm it meets the project goals and provides value to the business. Any bias or weakness of our models is identified, and if necessary, the process starts again to address those errors.
- Presentation/deployment: The final stage of the process depends on the problem. If it is an analysis from which the company will make a decision, our job will probably conclude with a presentation and explanation of our findings. Alternatively, if we are working as part of a larger software pipeline, our model will most likely be deployed or integrated into the data pipeline.
This is an iterative process, meaning that many times, especially in step 4, we will receive valuable feedback from the business team about our analysis, and we will change the initial conclusions accordingly. What is true for traditional data science is reinforced for the Web3 industry as this is one of the industries where data plays a key role in building trust, leading investments, and, in general, unlocking new value.
Although data science is not a programming career, it heavily relies on programming because of the large amount of data available. In this book, we will work with the Python language and some SQL to query databases. Python is a general-purpose programming language commonly used by the data science community, and it is easy to learn due to its simple syntax. An alternative to Python is R, which is a statistical programming language commonly used for data analysis, machine learning, scientific research, and data visualization. A simple way to access Python or R and their associated libraries and tools is to install the Anaconda distribution. It includes popular data science libraries (such as NumPy, pandas, and Matplotlib for Python) and simplifies the process of setting up an environment to start working on data analysis and machine learning projects.
The activities in this book will be carried out in three work environments:
- Notebooks: For example, Anaconda Jupyter notebooks or Google Colaboratory (also frequently referred to as Colab). These files are saved in
.ipynb
format and are very useful for data analysis or training models. We will use Colab notebooks in the machine learning chapters due to the access it provides to GPU resources in its free tier. - IDEs: PyCharm, Visual Studio Code, or any other IDE that supports Python. Their files are saved in
.py
format and are very useful for building applications. Most IDEs allow the user to download extensions to work with notebook files. - Query platforms: In Chapter 2, we will access on-chain data platforms that have built-in query systems. Examples of those platforms are Dune Analytics, Flipside, Footprint Analytics, and Increment.
Anaconda Jupyter notebooks and IDEs use our computer resources (e.g., RAM), while Google Colaboratory uses cloud services (more on resources can be found in the Appendix 1).
Please refer to the Appendix 1 to install any of the environments mentioned previously.
Once we have a clean notebook, we will warm up our Python skills with the Chapter01/Python_warm_up
notebook, which follows a tutorial by https://learnxinyminutes.com/docs/python/. For a more thorough study of Python, we encourage you to check out Data Science with Python, by Packt Publishing, or Python Data Science Handbook, both of which are listed in the Further reading section of this chapter.
Once we have completed the warm-up exercise, we will initiate the Web3 client using the Web3.py library. Let’s learn about these concepts in the following section.