Splitting sentences into clauses
When we work with text, we frequently deal with compound (sentences with two parts that are equally important) and complex sentences (sentences with one part depending on another). It is sometimes useful to split these composite sentences into its component clauses for easier processing down the line. This recipe uses the dependency parse from the previous recipe.
Getting ready
You will only need the spacy
package in this recipe.
How to do it…
We will work with two sentences, He eats cheese, but he won't eat ice cream and If it rains later, we won't be able to go to the park. Other sentences may turn out to be more complicated to deal with, and I leave it as an exercise for you to split such sentences. Follow these steps:
- Import the
spacy
package:import spacy
- Load the
spacy
engine:nlp = spacy.load('en_core_web_sm')
- Set the sentence to
He eats cheese, but he won't eat ice cream
:sentence = "He eats cheese, but he won't eat ice cream."
- Process the sentence with the
spacy
engine:doc = nlp(sentence)
- It is instructive to look at the structure of the input sentence by printing out the part of speech, dependency tag, ancestors, and children of each token. This can be accomplished using the following code:
for token in doc: ancestors = [t.text for t in token.ancestors] children = [t.text for t in token.children] print(token.text, "\t", token.i, "\t", token.pos_, "\t", token.dep_, "\t", ancestors, "\t", children)
- We will use the following function to find the root token of the sentence, which is usually the main verb. In instances where there is a dependent clause, it is the verb of the independent clause:
def find_root_of_sentence(doc): root_token = None for token in doc: if (token.dep_ == "ROOT"): root_token = token return root_token
- We will now find the root token of the sentence:
root_token = find_root_of_sentence(doc)
- We can now use the following function to find the other verbs in the sentence:
def find_other_verbs(doc, root_token): other_verbs = [] for token in doc: ancestors = list(token.ancestors) if (token.pos_ == "VERB" and len(ancestors) == 1\ and ancestors[0] == root_token): other_verbs.append(token) return other_verbs
- Use the preceding function to find the remaining verbs in the sentence:
other_verbs = find_other_verbs(doc, root_token)
We will use the following function to find the token spans for each verb:
def get_clause_token_span_for_verb(verb, doc, all_verbs): first_token_index = len(doc) last_token_index = 0 this_verb_children = list(verb.children) for child in this_verb_children: if (child not in all_verbs): if (child.i < first_token_index): first_token_index = child.i if (child.i > last_token_index): last_token_index = child.i return(first_token_index, last_token_index)
- We will put together all the verbs in one array and process each using the preceding function. This will return a tuple of start and end indices for each verb's clause:
token_spans = [] all_verbs = [root_token] + other_verbs for other_verb in all_verbs: (first_token_index, last_token_index) = \ get_clause_token_span_for_verb(other_verb, doc, all_verbs) token_spans.append((first_token_index, last_token_index))
- Using the start and end indices, we can now put together token spans for each clause. We sort the
sentence_clauses
list at the end so that the clauses are in the order they appear in the sentence:sentence_clauses = [] for token_span in token_spans: start = token_span[0] end = token_span[1] if (start < end): clause = doc[start:end] sentence_clauses.append(clause) sentence_clauses = sorted(sentence_clauses, key=lambda tup: tup[0])
- Now, we can print the final result of the processing for our initial sentence; that is,
He eats cheese, but he won't eat ice cream
:clauses_text = [clause.text for clause in sentence_clauses] print(clauses_text)
The result is as follows:
['He eats cheese,', 'he won't eat ice cream']
Important note
The code in this section will work for some cases, but not others; I encourage you to test it out on different cases and amend the code.
How it works…
The way the code works is based on the way complex and compound sentences are structured. Each clause contains a verb, and one of the verbs is the main verb of the sentence (root). The code looks for the root verb, always marked with the ROOT
dependency tag in spaCy processing, and then looks for the other verbs in the sentence.
The code then uses the information about each verb's children to find the left and right boundaries of the clause. Using this information, the code then constructs the text of the clauses. A step-by-step explanation follows.
In step 1, we import the spaCy
package and in step 2, we load the spacy
engine. In step 3, we set the sentence variable and in step 4, we process it using the spacy
engine. In step 5, we print out the dependency parse information. It will help us determine how to split the sentence into clauses.
In step 6, we define the find_root_of_sentence
function, which returns the token that has a dependency tag of ROOT
. In step 7, we find the root of the sentence we are using as an example.
In step 8, we define the find_other_verbs
function, which will find other verbs in the sentence. In this function, we look for tokens that have the VERB
part of speech tag and has the root token as its only ancestor. In step 9, we apply this function.
In step 10, we define the get_clause_token_span_for_verb
function, which will find the beginning and ending index for the verb. The function goes through all the verb's children; the leftmost child's index is the beginning index, while the rightmost child's index is the ending index for this verb's clause.
In step 11, we use the preceding function to find the clause indices for each verb. The token_spans
variable contains the list of tuples, where the first tuple element is the beginning clause index and the second tuple element is the ending clause index.
In step 12, we create token Span
objects for each clause in the sentence using the list of beginning and ending index pairs we created in step 11. We get the Span
object by slicing the Doc
object and then appending the resulting Span
objects to a list. As a final step, we sort the list to make sure that the clauses in the list are in the same order as in the sentence.
In step 13, we print the clauses in our sentence. You will notice that the word but is missing, since its parent is the root verb eats, although it appears in the other clause. The exercise of including but is left to you.