An architecture like the one defined so far that has no information about the class of the object it's localizing is called a region proposal.
It is possible to perform object detection and localization using a single neural network. In fact, there is nothing stopping us adding a second head on top of the feature extractor and training it to classify the image and at the same time training the regression head to regress the bounding box coordinates.
Solving multiple tasks at the same time is the goal of multitask learning.