Amazon Comprehend—Topic Modeling Guidelines
The most accurate results are obtained if you provide Comprehend with the largest possible corpus. More specifically:
- You should use no fewer than 1,000 records in every subject.
- Each document ought to be something like three sentences in length.
- If a document comprises, for the most part, numeric information, you should expel it from the corpus.
Currently, Topic Modeling is limited to two document languages: English
and Spanish.
A Topic Modeling job allows two format types for input data (refer to the following Figure 3.1). This allows users to process both collections of large documents (for example, newspaper articles or scientific journals), and short documents (for example, tweets or social media posts).
Input Format Options:
Output Format Options: