Understanding the Output of str_locate_all in Tibbles: A Step-by-Step Guide to Manipulating Motif Positions

Understanding the Output of `str_locate_all`

In R, the str_locate_all function is used to find all occurrences of a pattern in a character string. It returns a data frame with three columns: start, end, and length. However, when working with tibbles (a type of data frame) and wanting to manipulate or plot the results, you might encounter difficulties in extracting values from the motif columns.

In this article, we will explore how to deal with the output of str_locate_all in tibbles. We’ll create a function called make_vec that takes a sequence and a pattern as input and returns the row of a matrix.

The Problem

The problem arises when you have multiple motifs in your sequence, and each motif has different start and end positions. For example, let’s say we have two sequences: seq1 and seq2, both with the following motifs:

AAAAAAA
TTTTTTT

When using str_locate_all on these sequences, we get a table like this:

id	seq	AAAAAAA	TTTTTTT
1	…
2	…
3	…

Here, start and end are the positions of the motifs in the sequence. We can see that each motif has different start and end positions.

The Solution

To solve this problem, we need to create a function called make_vec that takes the sequence and a pattern as input and returns the row of a matrix. Here’s how it works:

make_vec <- function(seq, pattern) {
  # Find all occurrences of the pattern in the sequence
  loc <- str_locate_all(seq, pattern)[[1]]
  
  # Create an integer vector to store the result
  res <- integer(100)
  
  # Create an index vector for the start position
  idx <- as.numeric(sapply(loc[,1], function(x) x + seq_len(nchar(pattern)) - 1))
  
  # Set the value at each index to 1 if it's within the sequence length
  res[idx] <- 1L
  
  # Return the result vector
  return(res)
}

Creating a Table with Motif Positions

Next, we can use this function for every sequence and extract the results. Here’s how:

# Create two new columns in the table for motif positions
table2 <- table %>% 
  mutate(
    a = map(seq, make_vec, pattern = "AAAAAAA"),
    t = map(seq, make_vec, pattern = "TTTTTTT")
  )

# Convert the results to a matrix
matrix(unlist(table2$a), nrow = 3, byrow = TRUE)

Plotting Motif Positions

Finally, we can join all three tables together and plot the motif positions using ggplot. Here’s how:

# Create a new column in the table for sequence length
table3 <- table %>% 
  mutate(
    seq_len = row_length(seq)
  )

# Join the tables together
table4 <- merge(table, table2$a, by.x = "id", by.y = "row.names")

# Plot the motif positions
ggplot(table4, aes(x = seq_len, y = a)) +
  geom_bar(stat = "identity") +
  theme_classic()

Conclusion

In this article, we learned how to deal with the output of str_locate_all in tibbles. We created a function called make_vec that takes a sequence and a pattern as input and returns the row of a matrix. We then used this function for every sequence and extracted the results, creating a table with motif positions. Finally, we plotted these motif positions using ggplot.

Last modified on 2024-05-26