Author | Liu Guangdong, Apache SeaTunnel Committer
Currently, existing book search solutions (such as those used in public libraries) heavily rely on keyword matching rather than a semantic understanding of the actual content of book titles. As a result, search results may not meet our needs very well, or even be vastly different from what we expect. This is because relying solely on keyword matching is not enough, as it cannot achieve semantic understanding and therefore cannot understand the searcher’s true intent.
So, is there a better way to conduct book searches more accurately and efficiently? The answer is yes! In this article, I will introduce how to combine the use of Apache SeaTunnel, Milvus, and OpenAI for similarity search, to achieve a semantic understanding of the entire book title and make search results more accurate.
Using trained models to represent input data is called semantic search, and this approach can be extended to various different text-based use cases, including anomaly detection and document search. Therefore, the technology introduced in this article can bring significant breakthroughs and impacts to the field of book search.
Next, I will briefly introduce several concepts and tools/platforms related to this article, in order to better understand this article.
Apache SeaTunnel is an open-source, high-performance, distributed data management and computing platform. It is a top-level project supported by the Apache Foundation, capable of handling massive data, providing real-time data queries and computing, and supporting multiple data sources and formats. The goal of SeaTunnel is to provide a scalable, enterprise-level data management and integration platform to meet various large-scale data processing needs.
Milvus is an open-source vector similarity search engine that supports the storage, retrieval, and similarity search of massive vectors. It is a high-performance, low-cost solution for large-scale vector data. Milvus can be used in various scenarios, such as recommendation systems, image search, music recommendation, etc.
ChatGPT is a conversational AI system based on the Generative Pre-trained Transformer (GPT) model, developed by OpenAI. The system mainly uses natural language processing and deep learning technologies to generate natural language text similar to human conversation. ChatGPT has a wide range of applications, including developing intelligent customer service, chatbots, intelligent assistants, and language model research and development. In recent years, ChatGPT has become one of the research hotspots in the field of natural language processing.
A Large Language Model (LLM) is a natural language processing model based on deep learning technology that can analyze and understand a given text and generate text content related to it. Large language models typically use deep neural networks to learn the grammar and semantic rules of natural language and convert text data into vector representations in continuous vector space. During training, large language models use a large amount of text data to learn language patterns and statistical rules, which enables them to generate high-quality text content such as articles, news, and conversations. Large language models have a wide range of applications, including machine translation, text generation, question-answering systems, speech recognition, etc. Currently, many open-source deep learning frameworks provide implementations of large language models, such as TensorFlow, PyTorch, etc.
Here we go! I will show you how to combine Apache SeaTunnel, OpenAI’s Embe