{"id":83,"date":"2024-09-14T05:39:30","date_gmt":"2024-09-14T05:39:30","guid":{"rendered":"https:\/\/kylehayhurst.com\/?p=83"},"modified":"2024-09-14T05:39:30","modified_gmt":"2024-09-14T05:39:30","slug":"building-a-semantic-search-for-subtitles-with-python-streamlit-and-postgresql","status":"publish","type":"post","link":"https:\/\/kylehayhurst.com\/?p=83","title":{"rendered":"Building a Semantic Search for Subtitles with Python, Streamlit, and PostgreSQL"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Since I&#8217;ve got the tools to create GIFs, I needed a way to find these clips in videos. In a movie for instance, Office Space, I remember a scene where Peter was talking to his neighbour about getting out of working the weekend. The phrase &#8220;I&#8217;m doing the drywall at the new McDonald&#8217;s&#8221; was somewhere in there. I don&#8217;t know the exact line, which that could very well be it, but how would you search for it if you didn&#8217;t.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Semantic search is a great solution for this. It tries to get the meaning of sentences instead of a strict keyword search. This way, if you get the gist of the sentence you&#8217;re looking for in a scene, you can find it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So we need the text from the videos in order to search them. You could get fancy and use something like Whisper in order to generate the script from the video. But, why generate it when it exists already.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That&#8217;s where the subtitles come in. There&#8217;s plenty of them on the internet. They may not match your video exactly because they were probably meant for a torrent, but they can get you within seconds of your own library media.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I won&#8217;t go over every piece of code because it&#8217;s currently changing as I improve this personal project, but I&#8217;ll go over some of the necessities. The first is vectorization.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I use sentence_transformers to handle the vectorizing. You simply use the model name you wish to use and encode a vector from a string. What&#8217;s important is to match the size of the vector with your database table definition. Here&#8217;s a quick example of using the sentence_transformers:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"\" data-line=\"\">from sentence_transformers import SentenceTransformer\n\nclass SubtitleParser:\n    def __init__(self):\n        # Load the pre-trained model\n        self.model = SentenceTransformer(&#039;all-mpnet-base-v2&#039;)\n\n    def vectorize_text(self, text):\n        # Generate a vector for the subtitle text\n        vector = self.model.encode(text)\n        return vector<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This example creates a vector with a size of 768, so we need a table that can hold it. Here&#8217;s an example of a PostgreSQL script to create such table:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"\" data-line=\"\">-- Table to store subtitle data with vectorization\nCREATE TABLE subtitles (\n    id SERIAL PRIMARY KEY,\n    episode_id INT,  -- Nullable because movies won&#039;t have episodes\n    media_id INT,  -- Nullable because TV shows have episodes\n    start_time TEXT NOT NULL,\n    end_time TEXT NOT NULL,\n    text TEXT NOT NULL,\n    vector vector(768),  -- Vectorized representation of the subtitle text\n    FOREIGN KEY (episode_id) REFERENCES episodes(id) ON DELETE CASCADE,\n    FOREIGN KEY (media_id) REFERENCES media(id) ON DELETE CASCADE\n);<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You can find many resources on how to insert data into tables so I&#8217;ll skip that. Querying however is slightly different than simple select statements. Check it out:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"\" data-line=\"\">def search_subtitles(self, query_vector, top_n=10):\n    &quot;&quot;&quot;Perform a vector similarity search on the subtitles using pg_vector.&quot;&quot;&quot;\n    cursor = self.conn.cursor()\n\n    # Convert the query vector to a proper string representation with brackets\n    query_vector_str = &quot;[&quot; + &quot;,&quot;.join(map(str, query_vector)) + &quot;]&quot;\n\n    # SQL query to find the closest matches\n    cursor.execute(f&quot;&quot;&quot;\n        SELECT subtitles.text, media.title, subtitles.start_time, subtitles.end_time, media.media_type, \n            1 - (subtitles.vector &lt;=&gt; &#039;{query_vector_str}&#039;::vector) AS similarity_score\n        FROM subtitles\n        JOIN media ON media.id = subtitles.media_id\n        ORDER BY subtitles.vector &lt;=&gt; &#039;{query_vector_str}&#039;::vector\n        LIMIT %s;\n    &quot;&quot;&quot;, (top_n,))\n\n    # Fetch results and create a DataFrame\n    results = cursor.fetchall()\n    df = pd.DataFrame(results, columns=[&#039;Subtitle&#039;, &#039;Media Title&#039;, &#039;Start Time&#039;, &#039;End Time&#039;, &#039;Media Type&#039;, &#039;Similarity Score&#039;])\n\n    cursor.close()\n    return df\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This will take your query vector (the text you want to search with, create a vector of it before comparison) with each vector in the subtitles table.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now we can put it all together with streamlit. Streamlit is a great framework for creating prototypes. Here&#8217;s a quick example of using it to fetch the subtitles:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"\" data-line=\"\"># User input to search for subtitles\nquery = st.text_input(&quot;Enter your query:&quot;)\n\nif st.button(&quot;Search&quot;) and query:\n    # Query vector for searching\n    model = SentenceTransformer(&#039;all-mpnet-base-v2&#039;)\n    query_vector = model.encode(query)\n\n    # Perform search on pg_vector with vector similarity\n    query_results = db_manager.search_subtitles(query_vector)\n\n    # Display search results in a DataFrame\n    if not query_results.empty:\n        st.write(query_results)\n    else:\n        st.write(&quot;No results found.&quot;)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I will share my personal project on this shortly, but in the mean time I hope this helps some people get started with semantic search, a very powerful tool that can be used with Retrieval Augmented Generation, which I play more with in the future.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since I&#8217;ve got the tools to create GIFs, I needed a way to find these clips in videos. In a movie for instance, Office Space, I remember a scene where Peter was talking to his neighbour about getting out of working the weekend. The phrase &#8220;I&#8217;m doing the drywall at the new McDonald&#8217;s&#8221; was somewhere [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[14,13,12,11,15],"class_list":["post-83","post","type-post","status-publish","format-standard","hentry","category-semantic-search","tag-postgresql","tag-python","tag-semantic-search","tag-streamlit","tag-vector"],"_links":{"self":[{"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=\/wp\/v2\/posts\/83","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=83"}],"version-history":[{"count":2,"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=\/wp\/v2\/posts\/83\/revisions"}],"predecessor-version":[{"id":85,"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=\/wp\/v2\/posts\/83\/revisions\/85"}],"wp:attachment":[{"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=83"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=83"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kylehayhurst.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=83"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}