The initial code base
Let's now list all of the code that we'll optimize in future, based on the earlier description.
The first of the following points is quite simple: a single file script that takes care of scraping and saving in JSON format like we discussed earlier. The flow is simple, and the order is as follows:
It will query the list of questions page by page.
For each page, it will gather the question's links.
Then, for each link, it will gather the information listed from the previous points.
It will move on to the next page and start over again.
It will finally save all of the data into a JSON file.
The code is as follows:
from bs4 import BeautifulSoup import requests import json SO_URL = "http://scifi.stackexchange.com" QUESTION_LIST_URL = SO_URL + "/questions" MAX_PAGE_COUNT = 20 global_results = [] initial_page = 1 #first page is page 1 def get_author_name(body): link_name = body.select(".user-details a") if len(link_name) == 0: text_name = body.select(".user-details") ...