In this article, we will explore how to set up a PyCharm project and install the docx
Python library to extract text from Word documents. The docx
library is a Python package that allows us to read and write Microsoft Word ( .docx
) files and provides a convenient interface to access information stored in these files.
The first step is to initiate your work by creating a new PyCharm project. This will enable you to have a dedicated area to craft and systematize your Translation app code.
openai
: The openai library allows you to interact with the OpenAI API and perform various natural language processing tasks. docx
: The docx library allows you to read and write Microsoft Word files .docx using Python. tkinter
: The tkinter library is a built-in Python library that allows you to create graphical user interfaces (GUIs) for your desktop app. As tkinter
is a built-in library, there is no need for installation since it already exists within your Python environment. To install the openai and docx libraries, access the PyCharm terminal by clicking on View | Tool Windows | Terminal, and then execute the following commands:
pip install openai
pip install python-docx
To access and read the contents of a Word document, you will need to create a sample Word file inside your PyCharm project. Here are the steps to create a new Word file in PyCharm:
files
files
folder and select New | File.docx
. For example, info.doc
. You can now add some text or content to this file, which we will later access and read using the docx
library in Python. For this example, we have created an article about on New York City. However, you can choose any Word document containing text that you want to analyze.
The United States' most populous city, often referred to as New York City or NYC, is New York. In 2020, its population reached
8,804,190 people across 300.46 square miles, making it the most densely populated major city in the country and over two times
more populous than the nation's second-largest city, Los Angeles. The city's population also exceeds that of 38 individual U.S.
states. Situated at the southern end of New York State, New York City serves as the Northeast megalopolis and New York
metropolitan area's geographic and demographic center - the largest metropolitan area in the country by both urban area and
population. Over 58 million people also live within 250 miles of the city. A significant influencer on commerce, health care and
life sciences, research, technology, education, politics, tourism, dining, art, fashion, and sports, New York City is a global
cultural, financial, entertainment, and media hub. It houses the headquarters of the United Nations, making it a significant
center for international diplomacy, and is often referred to as the world's capital.
Now that you have created the Word file inside your PyCharm project, you can move on to the next step, which is to create a new Python file called app.py
inside the Translation App root directory. This file will contain the code to read and manipulate the contents of the Word file using the docx
library. With the Word file and the Python file in place, you are ready to start writing the code to extract data from the document and use it in your application.
To test if we can read word files with the docx
Python library, we can implement the subsequent code in our app.py
file:
Import docx
doc = docx.Document(“<full_path_to_docx_file>”)
text = “”
for para in doc.paragraphs:
text += para.text
print(text)
Make sure to replace the <full_path_to_docx_file>
with the actual path to your Word document file. Obtaining the file path is a simple task, achieved by Right Click on your docx
file in PyCharm and selecting the option Copy Path/Reference… from the drop-down menu.
Once you have done that, run the app.py
file and verify the output. This code will read the contents of your Word document and print them to the Run Window console. If the text extraction works correctly, you should see the text of your document printed in the console (see figure below). The text variable now holds the data from the info.docx
as a Python string.
Figure: Word text extraction console output
This section provided a step-by-step guide on how to set up a PyCharm project and install the docx
Python library to extract text from Word documents. The section also included instructions on how to create a new Word file in PyCharm and use the docx
library to read and manipulate its contents using Python.
Martin Yanev is an experienced Software Engineer who has worked in the aerospace and medical industries for over 8 years. He specializes in developing and integrating software solutions for air traffic control and chromatography systems. Martin is a well-respected instructor with over 280,000 students worldwide, and he is skilled in using frameworks like Flask, Django, Pytest, and TensorFlow. He is an expert in building, training, and fine-tuning AI systems with the full range of OpenAI APIs. Martin has dual master's degrees in Aerospace Systems and Software Engineering, which demonstrates his commitment to both practical and theoretical aspects of the industry.