Building a Semantic Search for Subtitles with Python, Streamlit, and PostgreSQL

Since I’ve got the tools to create GIFs, I needed a way to find these clips in videos. In a movie for instance, Office Space, I remember a scene where Peter was talking to his neighbour about getting out of working the weekend. The phrase “I’m doing the drywall at the new McDonald’s” was somewhere in there. I don’t know the exact line, which that could very well be it, but how would you search for it if you didn’t.

Semantic search is a great solution for this. It tries to get the meaning of sentences instead of a strict keyword search. This way, if you get the gist of the sentence you’re looking for in a scene, you can find it.

So we need the text from the videos in order to search them. You could get fancy and use something like Whisper in order to generate the script from the video. But, why generate it when it exists already.

That’s where the subtitles come in. There’s plenty of them on the internet. They may not match your video exactly because they were probably meant for a torrent, but they can get you within seconds of your own library media.

I won’t go over every piece of code because it’s currently changing as I improve this personal project, but I’ll go over some of the necessities. The first is vectorization.

I use sentence_transformers to handle the vectorizing. You simply use the model name you wish to use and encode a vector from a string. What’s important is to match the size of the vector with your database table definition. Here’s a quick example of using the sentence_transformers:

from sentence_transformers import SentenceTransformer

class SubtitleParser:
    def __init__(self):
        # Load the pre-trained model
        self.model = SentenceTransformer('all-mpnet-base-v2')

    def vectorize_text(self, text):
        # Generate a vector for the subtitle text
        vector = self.model.encode(text)
        return vector

This example creates a vector with a size of 768, so we need a table that can hold it. Here’s an example of a PostgreSQL script to create such table:

-- Table to store subtitle data with vectorization
CREATE TABLE subtitles (
    id SERIAL PRIMARY KEY,
    episode_id INT,  -- Nullable because movies won't have episodes
    media_id INT,  -- Nullable because TV shows have episodes
    start_time TEXT NOT NULL,
    end_time TEXT NOT NULL,
    text TEXT NOT NULL,
    vector vector(768),  -- Vectorized representation of the subtitle text
    FOREIGN KEY (episode_id) REFERENCES episodes(id) ON DELETE CASCADE,
    FOREIGN KEY (media_id) REFERENCES media(id) ON DELETE CASCADE
);

You can find many resources on how to insert data into tables so I’ll skip that. Querying however is slightly different than simple select statements. Check it out:

def search_subtitles(self, query_vector, top_n=10):
    """Perform a vector similarity search on the subtitles using pg_vector."""
    cursor = self.conn.cursor()

    # Convert the query vector to a proper string representation with brackets
    query_vector_str = "[" + ",".join(map(str, query_vector)) + "]"

    # SQL query to find the closest matches
    cursor.execute(f"""
        SELECT subtitles.text, media.title, subtitles.start_time, subtitles.end_time, media.media_type, 
            1 - (subtitles.vector <=> '{query_vector_str}'::vector) AS similarity_score
        FROM subtitles
        JOIN media ON media.id = subtitles.media_id
        ORDER BY subtitles.vector <=> '{query_vector_str}'::vector
        LIMIT %s;
    """, (top_n,))

    # Fetch results and create a DataFrame
    results = cursor.fetchall()
    df = pd.DataFrame(results, columns=['Subtitle', 'Media Title', 'Start Time', 'End Time', 'Media Type', 'Similarity Score'])

    cursor.close()
    return df

This will take your query vector (the text you want to search with, create a vector of it before comparison) with each vector in the subtitles table.

Now we can put it all together with streamlit. Streamlit is a great framework for creating prototypes. Here’s a quick example of using it to fetch the subtitles:

# User input to search for subtitles
query = st.text_input("Enter your query:")

if st.button("Search") and query:
    # Query vector for searching
    model = SentenceTransformer('all-mpnet-base-v2')
    query_vector = model.encode(query)

    # Perform search on pg_vector with vector similarity
    query_results = db_manager.search_subtitles(query_vector)

    # Display search results in a DataFrame
    if not query_results.empty:
        st.write(query_results)
    else:
        st.write("No results found.")

I will share my personal project on this shortly, but in the mean time I hope this helps some people get started with semantic search, a very powerful tool that can be used with Retrieval Augmented Generation, which I play more with in the future.


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *