In this project, we will be building a mobile app, which, when pointed at any scenery, will be able to create captions describing that scenery. Such an app is highly beneficial for people with visual defects, as it can be used both as an assistive technology on the web and as a day-to-day app if paired with a voice interface such as Alexa or Google Home. The app will be calling a hosted API that will produce captions for any given image passed to it. The API returns three best possible captions for the image, and the app then displays them right below the camera view in the app.
From a bird's-eye view, the project architecture can be illustrated by means of the following diagram:
The input will be the camera feed obtained in a smartphone, which is sent to an image caption generation model hosted as a web API. The model is hosted as a Docker...