Suffix arrays in optimal compressed space and δ-SA with Tomasz Kociumaka and Dominik Kempa

Released Friday, 29th September 2023

Good episode? Give it some love!

Suffix arrays in optimal compressed space and δ-SA with Tomasz Kociumaka and Dominik Kempa

Friday, 29th September 2023

Good episode? Give it some love!

Rate Episode

Today on the podcast we have Tomasz Kociumaka and Dominik Kempa,the authors of the preprintCollapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space.

The suffix array is one of the foundational data structures in bioinformatics,serving as an index that allows fast substring searches in a large text.However, in its raw form, the suffix array occupies the space proportional to (andseveral times larger than) the original text.

In their paper, Tomasz and Dominik construct a new index, δ-SA, which on theone hand can be used in the same way (answer the same queries) as the suffixarray and the inverse suffix array, and on the other hand, occupies the spaceroughly proportional to the gzip’ed text (or, more precisely, to the measure δthat they define — hence the name).

Moreover, they mathematically prove that this index is optimal, in the sensethat any index that supports these queries — or even much weaker queries, suchas simply accessing the i-th character of the text — cannot be significantlysmaller (as a function of δ) than δ-SA.

Links: