## Finding Conserved Structure in Time Series Sets

STUMPY is a powerful and scalable Python library for modern time series analysis and, at its core, efficiently computes something called a matrix profile. The goal of this multi-part series is to explain what the matrix profile is and how you can start leveraging STUMPY for all of your modern time series data mining tasks!

Note: These tutorials were originally featured in the STUMPY documentation.

Part 1: The Matrix Profile
Part 2: STUMPY Basics
Part 3: Time Series Chains
Part 4: Semantic Segmentation
Part 5: Fast Approximate Matrix Profiles with STUMPY
Part 6: Matrix Profiles for Streaming Time Series Data
Part 7: Fast Pattern Searching with STUMPY
Part 8: AB-Joins with STUMPY
Part 9: Time Series Consensus Motifs

This tutorial utilizes the main takeaways from the Matrix Profile XV paper.

Matrix profiles can be used to find conserved patterns within a single time series (self-join) and across two time series (AB-join). In both cases, these conserved patterns are often called “motifs”. And, when considering a set of three or more time series, one common trick for identifying a conserved motif across the entire set is to:

1. Append a `np.nan` to the end of each time series. This is used to identify the boundary between neighboring time series and ensures that any identified motif will not straddle multiple time series.
2. Concatenate all of the time series into a single long time series
3. Compute the matrix profile (self-join) on the aforementioned concatenated time series

However, this is not guaranteed to find patterns that are conserved across all of the time series within the set. This idea of a finding a conserved motif that is common to all of the time series in a set is referred to as a “consensus motif”. In this tutorial, we will introduce the “Ostinato” algorithm, which is an efficient way to find the consensus motif amongst a set of time series.

Let’s import the packages that we’ll need to load, analyze, and plot the data.

`%matplotlib inlineimport stumpyimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdfrom itertools import cycle, combinationsfrom matplotlib.patches import Rectanglefrom scipy.cluster.hierarchy import linkage, dendrogramfrom scipy.special import combfig_size = plt.rcParams["figure.figsize"]fig_size[0] = 20fig_size[1] = 6plt.rcParams["figure.figsize"] = fig_sizeplt.rcParams['xtick.direction'] = 'out'`

In the following dataset, a volunteer was asked to “spell” out different Japanese sentences by performing eye movements that represented writing strokes of individual Japanese characters. Their eye movements were recorded by an electrooculograph (EOG) and they were given one second to “visually trace” each Japanese character. For our purposes we’re only using the vertical eye positions and, conceptually, this basic example reproduced Figure 1 and Figure 2 of the Matrix Profile XV paper.

`sentence_idx = [6, 7, 9, 10, 16, 24]Ts = [None] * len(sentence_idx)fs = 50  # eog signal was downsampled to 50 Hzfor i, s in enumerate(sentence_idx):Ts[i] = pd.read_csv(f'https://zenodo.org/record/4288978/files/EOG_001_01_{s:03d}.csv?download=1').iloc[:, 0].values# the literal sentencessentences = pd.read_csv(f'https://zenodo.org/record/4288978/files/test_sent.jp.csv?download=1', index_col=0)`

Below, we plotted six time series that each represent the vertical eye position while a person “wrote” Japanese sentences using their eyes. As you can see, some of the Japanese sentences are longer and contain more words while others are shorter. However, there is one common Japanese word (i.e., a “common motif”) that is contained in all six examples. Can you spot the one second long pattern that is common across these six time series?

`def plot_vertical_eog():fig, ax = plt.subplots(6, sharex=True, sharey=True)prop_cycle = plt.rcParams['axes.prop_cycle']colors = cycle(prop_cycle.by_key()['color'])for i, e in enumerate(Ts):ax[i].plot(np.arange(0, len(e)) / fs, e, color=next(colors))ax[i].set_ylim((-330, 1900))plt.subplots_adjust(hspace=0)plt.xlabel('Time (s)')return axplot_vertical_eog()plt.suptitle('Vertical Eye Position While Writing Different Japanese Sentences', fontsize=14)plt.show()`

To find out, we can use the `stumpy.ostinato` function to help us discover the “consensus motif” by passing in the list of time series, `Ts`, along with the subsequence window size, `m`:

`m = fsradius, Ts_idx, subseq_idx = stumpy.ostinato(Ts, m)print(f'Found Best Radius {np.round(radius, 2)} in time series {Ts_idx} starting at subsequence index location {subseq_idx}.')Found Best Radius 0.87 in time series 4 starting at subsequence index location 1271.`

Now, Let’s plot the individual subsequences from each time series that correspond to the matching consensus motif:

`seed_motif = Ts[Ts_idx][subseq_idx : subseq_idx + m]x = np.linspace(0,1,50)nn = np.zeros(len(Ts), dtype=np.int64)nn[Ts_idx] = subseq_idxfor i, e in enumerate(Ts):if i != Ts_idx:nn[i] = np.argmin(stumpy.core.mass(seed_motif, e))lw = 1label = Noneelse:lw = 4label = 'Seed Motif'plt.plot(x, e[nn[i]:nn[i]+m], lw=lw, label=label)plt.title('The Consensus Motif')plt.xlabel('Time (s)')plt.legend()plt.show()`

There is a striking similarity between the subsequences. The most central “seed motif” is plotted with a thicker purple line.

When we highlight the above subsequences in their original context (light blue boxes below), we can see that they occur at different times:

`ax = plot_vertical_eog()for i in range(len(Ts)):y = ax[i].get_ylim()r = Rectangle((nn[i] / fs, y[0]), 1, y[1]-y[0], alpha=0.3)ax[i].add_patch(r)plt.suptitle('Vertical Eye Position While Writing Different Japanese Sentences', fontsize=14)plt.show()`

The discovered conserved motif (light blue boxes) correspond to writing the Japanese character `ア`, which occurs at different times in different example sentences.

In this next example, we’ll reproduce Figure 9 from the Matrix Profile XV paper.

Mitochondrial DNA (mtDNA) has been successfully used to determine evolutionary relationships between organisms (phylogeny). Since DNAs are essentially ordered sequences of letters, we can loosely treat them as time series and use all of the available time series tools.

`animals = ['python', 'hippo', 'red_flying_fox', 'alpaca']data = {}for animal in animals:data[animal] = pd.read_csv(f"https://zenodo.org/record/4289120/files/{animal}.csv?download=1").iloc[:,0].valuescolors = {'python': 'tab:blue', 'hippo': 'tab:green', 'red_flying_fox': 'tab:purple', 'alpaca': 'tab:red'}`

Naively, using `scipy.cluster.hierarchy` we can cluster the mtDNAs based on the majority of the sequences. A correct clustering would place the two “artiodactyla”, hippo and alpaca, closest and, together with the red flying fox, we would expect them to form a cluster of “mammals”. Finally, the python, a “reptile”, should be furthest away from all of the “mammals”.

`fig, ax = plt.subplots(ncols=2)truncate = 15000for k, v in data.items():ax[0].plot(v[:truncate], label=k, color=colors[k])ax[0].legend()ax[0].set_xlabel('Number of mtDNA Base Pairs')ax[0].set_title('mtDNA Sequences')truncate = 16000dp = np.zeros(int(comb(4, 2)))for i, a_c in enumerate(combinations(data.keys(), 2)):dp[i] = stumpy.core.mass(data[a_c[0]][:truncate], data[a_c[1]][:truncate])Z = linkage(dp, optimal_ordering=True)dendrogram(Z, labels=[k for k in data.keys()], ax=ax[1])ax[1].set_ylabel('Z-Normalized Euclidean Distance')ax[1].set_title('Clustering')plt.show()`

Uh oh, the clustering is clearly wrong! Amongst other problems, the alpaca (a mammal) should not be most closely related to the python (a reptile).

In order to obtain the correct relationships, we need to identify and then compare the parts of the mtDNA that is the most conserved across the mtDNA sequences. In other words, we need to cluster based on their consensus motif. Let’s limit the subsequence window size to 1,000 base pairs and identify the consensus motif again using the `stumpy.ostinato` function:

`m = 1000bsf_radius, bsf_Ts_idx, bsf_subseq_idx = stumpy.ostinato(list(data.values()), m)print(f'Found best radius {np.round(bsf_radius, 2)} in time series {bsf_Ts_idx} starting at subsequence index location {bsf_subseq_idx}.')Found best radius 2.73 in time series 1 starting at subsequence index location 602.`

Now, let’s perform the clustering again but, this time, using the consensus motif:

`consensus_motifs = {}best_motif = list(data.items())[bsf_Ts_idx][1][bsf_subseq_idx : bsf_subseq_idx + m]for i, (k, v) in enumerate(data.items()):if i == bsf_Ts_idx:consensus_motifs[k] = best_motifelse:idx = np.argmin(stumpy.core.mass(best_motif, v))consensus_motifs[k] = v[idx : idx + m]fig, ax = plt.subplots(ncols=2)# plot the consensus motifsfor animal, motif in consensus_motifs.items():ax[0].plot(motif, label=animal, color=colors[animal])ax[0].legend()# cluster consensus motifsdp = np.zeros(int(comb(4, 2)))for i, motif in enumerate(combinations(list(consensus_motifs.values()), 2)):dp[i] = stumpy.core.mass(motif[0], motif[1])Z = linkage(dp, optimal_ordering=True)dendrogram(Z, labels=[k for k in consensus_motifs.keys()])ax[0].set_title('Consensus mtDNA Motifs')ax[0].set_xlabel('Number of mtDNA Base Pairs')ax[1].set_title('Clustering Using the Consensus Motifs')ax[1].set_ylabel('Z-normalized Euclidean Distance')plt.show()`

Now this looks much better! Hierarchically, the python is “far away” from the other mammals and, amongst the mammalia, the red flying fox (a bat) is less related to both the alpaca and the hippo which are the closest evolutionary relatives in this set of animals.

And that’s it! You have now learned how to search for a consensus motif amongst a set of times series using the awesome `stumpy.ostinato` function. You can now import this package and use it in your own projects. Happy coding!

Matrix Profile XV

STUMPY Matrix Profile Documentation

STUMPY Matrix Profile Github Code Repository