Data pull
The amount of data we collect through GitHub API is such that it fits in memory. We can deal with it directly in a pandas dataframe. If more data is required, we would recommend storing it in a database, such as MongoDB.
We use JSON tools to convert the results into a clean JSON and to create a dataframe.
from pandas.io.json import json_normalize import json import pandas as pd import bson.json_util as json_util sanitized = json.loads(json_util.dumps(results)) normalized = json_normalize(sanitized) df = pd.DataFrame(normalized)
The dataframe df
contains columns related to all the results returned by GitHub API. We can list them by typing the following:
df.columns Index(['archive_url', 'assignees_url', 'blobs_url', 'branches_url', 'clone_url', 'collaborators_url', 'comments_url', 'commits_url', 'compare_url', 'contents_url', 'contributors_url', 'default_branch', 'deployments_url', 'description', 'downloads_url', 'events_url', 'fork', ...