Understanding the Output of str_locate_all
In R, the str_locate_all function is used to find all occurrences of a pattern in a character string. It returns a data frame with three columns: start, end, and length. However, when working with tibbles (a type of data frame) and wanting to manipulate or plot the results, you might encounter difficulties in extracting values from the motif columns.
In this article, we will explore how to deal with the output of str_locate_all in tibbles. We’ll create a function called make_vec that takes a sequence and a pattern as input and returns the row of a matrix.
The Problem
The problem arises when you have multiple motifs in your sequence, and each motif has different start and end positions. For example, let’s say we have two sequences: seq1 and seq2, both with the following motifs:
AAAAAAATTTTTTT
When using str_locate_all on these sequences, we get a table like this:
| id | seq | AAAAAAA | TTTTTTT |
|---|---|---|---|
| 1 | … | ||
| 2 | … | ||
| 3 | … |
Here, start and end are the positions of the motifs in the sequence. We can see that each motif has different start and end positions.
The Solution
To solve this problem, we need to create a function called make_vec that takes the sequence and a pattern as input and returns the row of a matrix. Here’s how it works:
make_vec <- function(seq, pattern) {
# Find all occurrences of the pattern in the sequence
loc <- str_locate_all(seq, pattern)[[1]]
# Create an integer vector to store the result
res <- integer(100)
# Create an index vector for the start position
idx <- as.numeric(sapply(loc[,1], function(x) x + seq_len(nchar(pattern)) - 1))
# Set the value at each index to 1 if it's within the sequence length
res[idx] <- 1L
# Return the result vector
return(res)
}
Creating a Table with Motif Positions
Next, we can use this function for every sequence and extract the results. Here’s how:
# Create two new columns in the table for motif positions
table2 <- table %>%
mutate(
a = map(seq, make_vec, pattern = "AAAAAAA"),
t = map(seq, make_vec, pattern = "TTTTTTT")
)
# Convert the results to a matrix
matrix(unlist(table2$a), nrow = 3, byrow = TRUE)
Plotting Motif Positions
Finally, we can join all three tables together and plot the motif positions using ggplot. Here’s how:
# Create a new column in the table for sequence length
table3 <- table %>%
mutate(
seq_len = row_length(seq)
)
# Join the tables together
table4 <- merge(table, table2$a, by.x = "id", by.y = "row.names")
# Plot the motif positions
ggplot(table4, aes(x = seq_len, y = a)) +
geom_bar(stat = "identity") +
theme_classic()
Conclusion
In this article, we learned how to deal with the output of str_locate_all in tibbles. We created a function called make_vec that takes a sequence and a pattern as input and returns the row of a matrix. We then used this function for every sequence and extracted the results, creating a table with motif positions. Finally, we plotted these motif positions using ggplot.
Last modified on 2024-05-26