Google BigQuery Vector Search
Google Cloud BigQuery Vector Search lets you use GoogleSQL to do semantic search, using vector indexes for fast approximate results, or using brute force for exact results.
This tutorial illustrates how to work with an end-to-end data and embedding management system in LangChain, and provides a scalable semantic search in BigQuery using theBigQueryVectorStore
class. This class is part of a set of 2 classes capable of providing a unified data storage and flexible vector search in Google Cloud:
- BigQuery Vector Search: with
BigQueryVectorStore
class, which is ideal for rapid prototyping with no infrastructure setup and batch retrieval. - Feature Store Online Store: with
VertexFSVectorStore
class, enables low-latency retrieval with manual or scheduled data sync. Perfect for production-ready user-facing GenAI applications.
Getting startedโ
Install the libraryโ
%pip install --upgrade --quiet langchain langchain-google-vertexai "langchain-google-community[featurestore]"
To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Before you beginโ
Set your project IDโ
If you don't know your project ID, try the following:
- Run
gcloud config list
. - Run
gcloud projects list
. - See the support page: Locate the project ID.
PROJECT_ID = "" # @param {type:"string"}
# Set the project id
! gcloud config set project {PROJECT_ID}
Set the regionโ
You can also change the REGION
variable used by BigQuery. Learn more about BigQuery regions.
REGION = "us-central1" # @param {type: "string"}
Set the dataset and table namesโ
They will be your BigQuery Vector Store.
DATASET = "my_langchain_dataset" # @param {type: "string"}
TABLE = "doc_and_vectors" # @param {type: "string"}
Authenticating your notebook environmentโ
- If you are using Colab to run this notebook, uncomment the cell below and continue.
- If you are using Vertex AI Workbench, check out the setup instructions here.
# from google.colab import auth as google_auth
# google_auth.authenticate_user()
Demo: BigQueryVectorStoreโ
Create an embedding class instanceโ
You may need to enable Vertex AI API in your project by running
gcloud services enable aiplatform.googleapis.com --project {PROJECT_ID}
(replace {PROJECT_ID}
with the name of your project).
You can use any LangChain embeddings model.
from langchain_google_vertexai import VertexAIEmbeddings
embedding = VertexAIEmbeddings(
model_name="textembedding-gecko@latest", project=PROJECT_ID
)
Initialize BigQueryVectorStoreโ
BigQuery Dataset and Table will be automatically created if they do not exist. See class definition here for all optional paremeters.
from langchain_google_community import BigQueryVectorStore
store = BigQueryVectorStore(
project_id=PROJECT_ID,
dataset_name=DATASET,
table_name=TABLE,
location=REGION,
embedding=embedding,
)
Add textsโ
all_texts = ["Apples and oranges", "Cars and airplanes", "Pineapple", "Train", "Banana"]
metadatas = [{"len": len(t)} for t in all_texts]
store.add_texts(all_texts, metadatas=metadatas)
Search for documentsโ
query = "I'd like a fruit."
docs = store.similarity_search(query)
print(docs)
Search for documents by vectorโ
query_vector = embedding.embed_query(query)
docs = store.similarity_search_by_vector(query_vector, k=2)
print(docs)
Searching Documents with Metadata Filtersโ
The vectorstore supports two methods for applying filters to metadata fields when performing document searches:
- Dictionary-based Filters
- You can pass a dictionary (dict) where the keys represent metadata fields and the values specify the filter condition. This method applies an equality filter between the key and the corresponding value. When multiple key-value pairs are provided, they are combined using a logical AND operation.
- SQL-based Filters
- Alternatively, you can provide a string representing an SQL WHERE clause to define more complex filtering conditions. This allows for greater flexibility, supporting SQL expressions such as comparison operators and logical operators.
# Dictionary-based Filters
# This should only return "Banana" document.
docs = store.similarity_search_by_vector(query_vector, filter={"len": 6})
print(docs)
# SQL-based Filters
# This should return "Banana", "Apples and oranges" and "Cars and airplanes" documents.
docs = store.similarity_search_by_vector(query_vector, filter={"len = 6 AND len > 17"})
print(docs)
Batch searchโ
BigQueryVectorStore offers a batch_search
method for scalable Vector similarity search.
results = store.batch_search(
embeddings=None, # can pass embeddings or
queries=["search_query", "search_query"], # can pass queries
)
Add text with embeddingsโ
You can also bring your own embeddings with the add_texts_with_embeddings
method.
This is particularly useful for multimodal data which might require custom preprocessing before the embedding generation.
items = ["some text"]
embs = embedding.embed(items)
ids = store.add_texts_with_embeddings(
texts=["some text"], embs=embs, metadatas=[{"len": 1}]
)
Low-latency serving with Feature Storeโ
You can simply use the method .to_vertex_fs_vector_store()
to get a VertexFSVectorStore object, which offers low latency for online use cases. All mandatory parameters will be automatically transferred from the existing BigQueryVectorStore class. See the class definition for all the other parameters you can use.
Moving back to BigQueryVectorStore is equivalently easy with the .to_bq_vector_store()
method.
store.to_vertex_fs_vector_store() # pass optional VertexFSVectorStore parameters as arguments
Relatedโ
- Vector store conceptual guide
- Vector store how-to guides