{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "94a7aa12-e8cc-4ebc-8412-10d1a301bb60",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "_ = load_dotenv(find_dotenv())\n",
    "openai_api_key = os.environ[\"OPENAI_API_KEY\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fd15744-823a-490f-8840-e385a2153485",
   "metadata": {},
   "source": [
    "## Splitters\n",
    "* We use splitters to divide documents in small chunks for the RAG technique.\n",
    "* The way we split one document is very relevant, since it has a big impact in the quality of the RAG retrieval.\n",
    "* It is important to understand how to optimize the splitting process.\n",
    "* Two important techniques to optimize splitting are:\n",
    "    * How we build the chunks (ideally, whole sentences or paragraphs).\n",
    "    * The metadata we add to each chunk."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04c59a93-1989-451b-a39c-79bd3a405dcd",
   "metadata": {},
   "source": [
    "## Character Splitter\n",
    "* This splits based on characters (by default \"\\n\\n\") and measure chunk length by number of characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f1e9f23a-9b57-4b53-94df-3ba67cf48349",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e83634bf-3f08-4aa8-aa04-9e750706c40f",
   "metadata": {},
   "outputs": [],
   "source": [
    "chunk_size =26\n",
    "chunk_overlap = 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d17b722e-8ad9-4a31-957d-74e498cba01a",
   "metadata": {},
   "outputs": [],
   "source": [
    "character_splitter = CharacterTextSplitter(\n",
    "    chunk_size=chunk_size,\n",
    "    chunk_overlap=chunk_overlap\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "79989c01-ba63-4636-86c1-1abb991b6f4d",
   "metadata": {},
   "outputs": [],
   "source": [
    "text1 = 'abcdefghijklmnopqrstuvwxyzabcdefg'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "97c3ae3f-a935-44ae-87de-93e4b509b456",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['abcdefghijklmnopqrstuvwxyzabcdefg']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "character_splitter.split_text(text1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4bdcfbe9-4b65-42e6-b181-83157ceb9836",
   "metadata": {},
   "outputs": [],
   "source": [
    "text2 = \"\"\"\n",
    "Data that Speak\n",
    "LLM Applications are revolutionizing industries such as \n",
    "banking, healthcare, insurance, education, legal, tourism, \n",
    "construction, logistics, marketing, sales, customer service, \n",
    "and even public administration.\n",
    "\n",
    "The aim of our programs is for students to learn how to \n",
    "create LLM Applications in the context of a business, \n",
    "which presents a set of challenges that are important \n",
    "to consider in advance.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8a6b2dc2-ad64-4aee-8198-28552b523b36",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Created a chunk of size 227, which is longer than the specified 26\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['Data that Speak\\nLLM Applications are revolutionizing industries such as \\nbanking, healthcare, insurance, education, legal, tourism, \\nconstruction, logistics, marketing, sales, customer service, \\nand even public administration.',\n",
       " 'The aim of our programs is for students to learn how to \\ncreate LLM Applications in the context of a business, \\nwhich presents a set of challenges that are important \\nto consider in advance.']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "character_splitter.split_text(text2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e120cb1-8400-4b0a-b124-3d9a59dd8382",
   "metadata": {},
   "source": [
    "## Recursive Character Splitter\n",
    "This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is [\"\\n\\n\", \"\\n\", \" \", \"\"]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "a59e8873-966f-4600-9c18-b607b100047c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "6d080c3f-c5b0-4e68-a581-760d3ad21594",
   "metadata": {},
   "outputs": [],
   "source": [
    "recursive_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=chunk_size,\n",
    "    chunk_overlap=chunk_overlap\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e9e510e5-6bd9-4086-a6ae-5d2948585e93",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recursive_splitter.split_text(text1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "2eae4742-977f-4f05-a123-feb2fafa348d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Data that Speak',\n",
       " 'LLM Applications are',\n",
       " 'are revolutionizing',\n",
       " 'industries such as',\n",
       " 'banking, healthcare,',\n",
       " 'insurance, education,',\n",
       " 'legal, tourism,',\n",
       " 'construction, logistics,',\n",
       " 'marketing, sales,',\n",
       " 'customer service,',\n",
       " 'and even public',\n",
       " 'administration.',\n",
       " 'The aim of our programs',\n",
       " 'is for students to learn',\n",
       " 'how to',\n",
       " 'create LLM Applications',\n",
       " 'in the context of a',\n",
       " 'a business,',\n",
       " 'which presents a set of',\n",
       " 'of challenges that are',\n",
       " 'are important',\n",
       " 'to consider in advance.']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recursive_splitter.split_text(text2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "dcbeea81-e6be-4b02-8920-be55ed69bd0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "second_recursive_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=150,\n",
    "    chunk_overlap=0,\n",
    "    separators=[\"\\n\\n\", \"\\n\", \"(?<=\\. )\", \" \", \"\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "2384ba71-5334-486d-913c-255f495e5674",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Data that Speak\\nLLM Applications are revolutionizing industries such as \\nbanking, healthcare, insurance, education, legal, tourism,',\n",
       " 'construction, logistics, marketing, sales, customer service, \\nand even public administration.',\n",
       " 'The aim of our programs is for students to learn how to \\ncreate LLM Applications in the context of a business,',\n",
       " 'which presents a set of challenges that are important \\nto consider in advance.']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "second_recursive_splitter.split_text(text2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "ec980872-7d93-4e09-ae25-31a7f9461b9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "chunks = second_recursive_splitter.split_text(text2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "6061ac23-c9e5-4812-a185-69e749ff504e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(chunks)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de2b0937-8a8a-4f1c-988f-076260425aa4",
   "metadata": {},
   "source": [
    "## Adding helpful metadata to the text chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "325f1b20-dccc-427f-becc-0008a4497d20",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import MarkdownHeaderTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "fd4659e2-a471-48fb-b356-c47855753589",
   "metadata": {},
   "outputs": [],
   "source": [
    "document_with_markdown = \"\"\"\n",
    "# Title: My book\\n\\n \\\n",
    "\n",
    "## Chapter 1: The day I was born\\n\\n \\\n",
    "I was born in a very sunny day of summer...\\n\\n \\\n",
    "\n",
    "### Section 1.1: My family \\n\\n \\\n",
    "My father had a big white car... \\n\\n \n",
    "\n",
    "## Chapter 2: My school\\n\\n \\\n",
    "My first day at the school was...\\n\\n \\\n",
    "\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "4a2b1442-cf99-4b60-8261-feac11137c62",
   "metadata": {},
   "outputs": [],
   "source": [
    "headers_to_split_on = [\n",
    "    (\"#\", \"Book title\"),\n",
    "    (\"##\", \"Chapter\"),\n",
    "    (\"###\", \"Section\"),\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "8a06c979-da25-4b04-807a-d6c259753142",
   "metadata": {},
   "outputs": [],
   "source": [
    "markdown_splitter = MarkdownHeaderTextSplitter(\n",
    "    headers_to_split_on=headers_to_split_on\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "ee59ad05-07bc-4736-89a4-719e1ee11f2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "md_header_splits = markdown_splitter.split_text(document_with_markdown)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "2c3581c4-c3d4-4cde-b86a-08e23012eba0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='I was born in a very sunny day of summer...', metadata={'Book title': 'Title: My book', 'Chapter': 'Chapter 1: The day I was born'})"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md_header_splits[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "5d8b6d67-f1f2-425f-807f-78123439e7e6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='My father had a big white car...', metadata={'Book title': 'Title: My book', 'Chapter': 'Chapter 1: The day I was born', 'Section': 'Section 1.1: My family'})"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md_header_splits[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "726faf2c-4180-4c97-bfb9-b9662e2f2b8c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Document(page_content='My first day at the school was...', metadata={'Book title': 'Title: My book', 'Chapter': 'Chapter 2: My school'})"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "md_header_splits[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f433f91-6f03-484a-9cf3-7016309a984e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
