• United States+1
  • United Kingdom+44
  • Afghanistan (‫افغانستان‬‎)+93
  • Albania (Shqipëri)+355
  • Algeria (‫الجزائر‬‎)+213
  • American Samoa+1684
  • Andorra+376
  • Angola+244
  • Anguilla+1264
  • Antigua and Barbuda+1268
  • Argentina+54
  • Armenia (Հայաստան)+374
  • Aruba+297
  • Australia+61
  • Austria (Österreich)+43
  • Azerbaijan (Azərbaycan)+994
  • Bahamas+1242
  • Bahrain (‫البحرين‬‎)+973
  • Bangladesh (বাংলাদেশ)+880
  • Barbados+1246
  • Belarus (Беларусь)+375
  • Belgium (België)+32
  • Belize+501
  • Benin (Bénin)+229
  • Bermuda+1441
  • Bhutan (འབྲུག)+975
  • Bolivia+591
  • Bosnia and Herzegovina (Босна и Херцеговина)+387
  • Botswana+267
  • Brazil (Brasil)+55
  • British Indian Ocean Territory+246
  • British Virgin Islands+1284
  • Brunei+673
  • Bulgaria (България)+359
  • Burkina Faso+226
  • Burundi (Uburundi)+257
  • Cambodia (កម្ពុជា)+855
  • Cameroon (Cameroun)+237
  • Canada+1
  • Cape Verde (Kabu Verdi)+238
  • Caribbean Netherlands+599
  • Cayman Islands+1345
  • Central African Republic (République centrafricaine)+236
  • Chad (Tchad)+235
  • Chile+56
  • China (中国)+86
  • Christmas Island+61
  • Cocos (Keeling) Islands+61
  • Colombia+57
  • Comoros (‫جزر القمر‬‎)+269
  • Congo (DRC) (Jamhuri ya Kidemokrasia ya Kongo)+243
  • Congo (Republic) (Congo-Brazzaville)+242
  • Cook Islands+682
  • Costa Rica+506
  • Côte d’Ivoire+225
  • Croatia (Hrvatska)+385
  • Cuba+53
  • Curaçao+599
  • Cyprus (Κύπρος)+357
  • Czech Republic (Česká republika)+420
  • Denmark (Danmark)+45
  • Djibouti+253
  • Dominica+1767
  • Dominican Republic (República Dominicana)+1
  • Ecuador+593
  • Egypt (‫مصر‬‎)+20
  • El Salvador+503
  • Equatorial Guinea (Guinea Ecuatorial)+240
  • Eritrea+291
  • Estonia (Eesti)+372
  • Ethiopia+251
  • Falkland Islands (Islas Malvinas)+500
  • Faroe Islands (Føroyar)+298
  • Fiji+679
  • Finland (Suomi)+358
  • France+33
  • French Guiana (Guyane française)+594
  • French Polynesia (Polynésie française)+689
  • Gabon+241
  • Gambia+220
  • Georgia (საქართველო)+995
  • Germany (Deutschland)+49
  • Ghana (Gaana)+233
  • Gibraltar+350
  • Greece (Ελλάδα)+30
  • Greenland (Kalaallit Nunaat)+299
  • Grenada+1473
  • Guadeloupe+590
  • Guam+1671
  • Guatemala+502
  • Guernsey+44
  • Guinea (Guinée)+224
  • Guinea-Bissau (Guiné Bissau)+245
  • Guyana+592
  • Haiti+509
  • Honduras+504
  • Hong Kong (香港)+852
  • Hungary (Magyarország)+36
  • Iceland (Ísland)+354
  • India (भारत)+91
  • Indonesia+62
  • Iran (‫ایران‬‎)+98
  • Iraq (‫العراق‬‎)+964
  • Ireland+353
  • Isle of Man+44
  • Israel (‫ישראל‬‎)+972
  • Italy (Italia)+39
  • Jamaica+1876
  • Japan (日本)+81
  • Jersey+44
  • Jordan (‫الأردن‬‎)+962
  • Kazakhstan (Казахстан)+7
  • Kenya+254
  • Kiribati+686
  • Kosovo+383
  • Kuwait (‫الكويت‬‎)+965
  • Kyrgyzstan (Кыргызстан)+996
  • Laos (ລາວ)+856
  • Latvia (Latvija)+371
  • Lebanon (‫لبنان‬‎)+961
  • Lesotho+266
  • Liberia+231
  • Libya (‫ليبيا‬‎)+218
  • Liechtenstein+423
  • Lithuania (Lietuva)+370
  • Luxembourg+352
  • Macau (澳門)+853
  • Macedonia (FYROM) (Македонија)+389
  • Madagascar (Madagasikara)+261
  • Malawi+265
  • Malaysia+60
  • Maldives+960
  • Mali+223
  • Malta+356
  • Marshall Islands+692
  • Martinique+596
  • Mauritania (‫موريتانيا‬‎)+222
  • Mauritius (Moris)+230
  • Mayotte+262
  • Mexico (México)+52
  • Micronesia+691
  • Moldova (Republica Moldova)+373
  • Monaco+377
  • Mongolia (Монгол)+976
  • Montenegro (Crna Gora)+382
  • Montserrat+1664
  • Morocco (‫المغرب‬‎)+212
  • Mozambique (Moçambique)+258
  • Myanmar (Burma) (မြန်မာ)+95
  • Namibia (Namibië)+264
  • Nauru+674
  • Nepal (नेपाल)+977
  • Netherlands (Nederland)+31
  • New Caledonia (Nouvelle-Calédonie)+687
  • New Zealand+64
  • Nicaragua+505
  • Niger (Nijar)+227
  • Nigeria+234
  • Niue+683
  • Norfolk Island+672
  • North Korea (조선 민주주의 인민 공화국)+850
  • Northern Mariana Islands+1670
  • Norway (Norge)+47
  • Oman (‫عُمان‬‎)+968
  • Pakistan (‫پاکستان‬‎)+92
  • Palau+680
  • Palestine (‫فلسطين‬‎)+970
  • Panama (Panamá)+507
  • Papua New Guinea+675
  • Paraguay+595
  • Peru (Perú)+51
  • Philippines+63
  • Poland (Polska)+48
  • Portugal+351
  • Puerto Rico+1
  • Qatar (‫قطر‬‎)+974
  • Réunion (La Réunion)+262
  • Romania (România)+40
  • Russia (Россия)+7
  • Rwanda+250
  • Saint Barthélemy (Saint-Barthélemy)+590
  • Saint Helena+290
  • Saint Kitts and Nevis+1869
  • Saint Lucia+1758
  • Saint Martin (Saint-Martin (partie française))+590
  • Saint Pierre and Miquelon (Saint-Pierre-et-Miquelon)+508
  • Saint Vincent and the Grenadines+1784
  • Samoa+685
  • San Marino+378
  • São Tomé and Príncipe (São Tomé e Príncipe)+239
  • Saudi Arabia (‫المملكة العربية السعودية‬‎)+966
  • Senegal (Sénégal)+221
  • Serbia (Србија)+381
  • Seychelles+248
  • Sierra Leone+232
  • Singapore+65
  • Sint Maarten+1721
  • Slovakia (Slovensko)+421
  • Slovenia (Slovenija)+386
  • Solomon Islands+677
  • Somalia (Soomaaliya)+252
  • South Africa+27
  • South Korea (대한민국)+82
  • South Sudan (‫جنوب السودان‬‎)+211
  • Spain (España)+34
  • Sri Lanka (ශ්‍රී ලංකාව)+94
  • Sudan (‫السودان‬‎)+249
  • Suriname+597
  • Svalbard and Jan Mayen+47
  • Swaziland+268
  • Sweden (Sverige)+46
  • Switzerland (Schweiz)+41
  • Syria (‫سوريا‬‎)+963
  • Taiwan (台灣)+886
  • Tajikistan+992
  • Tanzania+255
  • Thailand (ไทย)+66
  • Timor-Leste+670
  • Togo+228
  • Tokelau+690
  • Tonga+676
  • Trinidad and Tobago+1868
  • Tunisia (‫تونس‬‎)+216
  • Turkey (Türkiye)+90
  • Turkmenistan+993
  • Turks and Caicos Islands+1649
  • Tuvalu+688
  • U.S. Virgin Islands+1340
  • Uganda+256
  • Ukraine (Україна)+380
  • United Arab Emirates (‫الإمارات العربية المتحدة‬‎)+971
  • United Kingdom+44
  • United States+1
  • Uruguay+598
  • Uzbekistan (Oʻzbekiston)+998
  • Vanuatu+678
  • Vatican City (Città del Vaticano)+39
  • Venezuela+58
  • Vietnam (Việt Nam)+84
  • Wallis and Futuna+681
  • Western Sahara (‫الصحراء الغربية‬‎)+212
  • Yemen (‫اليمن‬‎)+967
  • Zambia+260
  • Zimbabwe+263
  • Åland Islands+358
Thanks! We'll be in touch in the next 12 hours
Oops! Something went wrong while submitting the form.

Unlocking Legal Insights: Effortless Document Summarization with OpenAI's LLM and LangChain

Shreyash Panchal

Artificial Intelligence / Machine Learning

The Rising Demand for Legal Document Summarization:

  • In a world where data, information, and legal complexities is prevalent, the volume of legal documents is growing rapidly. Law firms, legal professionals, and businesses are dealing with an ever-increasing number of legal texts, including contracts, court rulings, statutes, and regulations. 
  • These documents contain important insights, but understanding them can be overwhelming. This is where the demand for legal document summarization comes in. 
  • In this blog, we'll discuss the increasing need for summarizing legal documents and how modern technology is changing the way we analyze legal information, making it more efficient and accessible.

Overview OpenAI and LangChain

  • We'll use the LangChain framework to build our application with LLMs. These models, powered by deep learning, have been extensively trained on large text datasets. They excel in various language tasks like translation, sentiment analysis, chatbots, and more. 
  • LLMs can understand complex text, identify entities, establish connections, and generate coherent content. We can use meta LLaMA LLMs, OpenAI LLMs and others as well. For this case, we will be using OpenAI’s LLM.

  • OpenAI is a leader in the field of artificial intelligence and machine learning. They have developed powerful Large Language Models (LLMs) that are capable of understanding and generating human-like text.
  •  These models have been trained on vast amounts of textual data and can perform a wide range of natural language processing tasks.

LangChain is an innovative framework designed to simplify and enhance the development of applications and systems that involve natural language processing (NLP) and large language models (LLMs). 

It provides a structured and efficient approach for working with LLMs like OpenAI's GPT-3 and GPT-4 to tackle various NLP tasks. Here's an overview of LangChain's key features and capabilities:

  • Modular NLP Workflow: Build flexible NLP pipelines using modular blocks. 
  • Chain-Based Processing: Define processing flows using chain-based structures. 
  • Easy Integration: Seamlessly integrate LangChain with other tools and libraries.
  • Scalability: Scale NLP workflows to handle large datasets and complex tasks. 
  • Extensive Language Support: Work with multiple languages and models. 
  • Data Visualization: Visualize NLP pipeline results for better insights.
  • Version Control: Track changes and manage NLP workflows efficiently. 
  • Collaboration: Enable collaborative NLP development and experimentation.

Setting Up Environment

Setting Up Google Colab

Google Colab provides a powerful and convenient platform for running Python code with the added benefit of free GPU support. To get started, follow these steps:

  1. Visit Google Colab: Open your web browser and navigate to Google Colab.
  2. Sign In or Create a Google Account: You'll need to sign in with your Google account to use Google Colab. If you don't have one, you can create an account for free.
  3. Create a New Notebook: Once signed in, click on "New Notebook" to create a new Colab notebook.
  4. Choose Python Version: In the notebook, click on "Runtime" in the menu and select "Change runtime type." Choose your preferred Python version (usually Python 3) and set the hardware accelerator to "GPU." Also, make sure to turn on the "Internet" toggle.

OpenAI API Key Generation:-

  1. Visit the OpenAI Website Go to the OpenAI website.
  2.  Sign In or Create an Account Sign in or create a new OpenAI account. 
  3. Generate a New API Key Access the API section and generate a new API key. 
  4. Name Your API Key Give your API key a name that reflects its purpose. 
  5. Copy the API Key Copy the generated API key to your clipboard. 
  6. Store the API Key Safely Securely store the API key and do not share it publicly.

Understanding Legal Document Summarization Workflow

1. Map Step:

  • At the heart of our legal document summarization process is the Map-Reduce paradigm.
  • In the Map step, we treat each legal document individually. Think of it as dissecting a large puzzle into smaller, manageable pieces.
  • For each document, we employ a sophisticated Language Model (LLM). This LLM acts as our expert, breaking down complex legal language and extracting meaningful content.
  • The LLM generates concise summaries for each document section, essentially translating legalese into understandable insights.
  • These individual summaries become our building blocks, our pieces of the puzzle.

2. Reduce Step:

  • Now, let's shift our focus to the Reduce step.
  • Here's where we bring everything together. We've generated summaries for all the document sections, and it's time to assemble them into a cohesive whole.
  • Imagine the Reduce step as the puzzle solver. It takes all those individual pieces (summaries) and arranges them to form the big picture.
  • The goal is to produce a single, comprehensive summary that encapsulates the essence of the entire legal document.

3. Compression - Ensuring a Smooth Fit:

  • One challenge we encounter is the potential length of these individual summaries. Some legal documents can produce quite lengthy summaries.
  • To ensure a smooth flow within our summarization process, we've introduced a compression step.

4. Recursive Compression:

  • In some cases, even the compressed summaries might need further adjustment.
  • That's where the concept of recursive compression comes into play.
  • If necessary, we'll apply compression multiple times, refining and optimizing the summaries until they seamlessly fit into our summarization pipeline.

Let’s Get Started

Step 1: Installing python libraries

Create a new notebook in Google Colab and install the required Python libraries.

!pip install openai langchain tiktoken
view raw .py hosted with ❤ by GitHub

OpenAI: Installed to access OpenAI's powerful language models for legal document summarization.

LangChain: Essential for implementing document mapping, reduction, and combining workflows efficiently.

Tiktoken: Helps manage token counts within text data, ensuring efficient usage of language models and avoiding token limit issues.

Step 2: Adding OpenAI API key to Colab

Integrate your openapi key in Google Colab Secrets.

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
API_KEY= user_secrets.get_secret("YOUR_SECRET_KEY_NAME")
view raw .py hosted with ❤ by GitHub

Step 3: Initializing OpenAI LLM

Here, we import the OpenAI module from LangChain and initialize it with the provided API key to utilize advanced language models for document summarization.

from langchain.llms import OpenAI
llm = OpenAI(openai_api_key=API_KEY)
view raw .py hosted with ❤ by GitHub

Step 4: Splitting text by Character

The Text Splitter, in this case, overcomes the token limit by breaking down the text into smaller chunks that are each within the token limit. This ensures that the text can be processed effectively by the language model without exceeding its token capacity. 

The "chunk_overlap" parameter allows for some overlap between chunks to ensure that no information is lost during the splitting process.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=1000, chunk_overlap=120
)
view raw .py hosted with ❤ by GitHub

Step 5 : Loading PDF documents

from langchain.document_loaders import PyPDFLoader
def chunks(pdf_file_path):
loader = PyPDFLoader(pdf_file_path)
docs = loader.load_and_split()
return docs
view raw .py hosted with ❤ by GitHub

It initializes a PyPDFLoader object named "loader" using the provided PDF file path. This loader is responsible for loading and processing the contents of the PDF file. 

It then uses the "loader" to load and split the PDF document into smaller "docs" or document chunks. These document chunks likely represent different sections or pages of the PDF file. 

Finally, it returns the list of document chunks, making them available for further processing or analysis.

Step 6: Map Reduce Prompt Templates

Import libraries required for the implementation of LangChain MapReduce.

from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
view raw .py hosted with ❤ by GitHub

map_template = """The following is a set of documents
{docs}
Based on this list of docs, summarised into meaningful
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)
reduce_template = """The following is set of summaries:
{doc_summaries}
Take these and distil it into a final consolidated summary with title(mandatory) in bold with important key points .
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)
view raw .py hosted with ❤ by GitHub

Template Definition

The code defines two templates, map_template and reduce_template, which serve as structured prompts for instructing a language model on how to process and summarise sets of documents. 

LLMChains for Mapping and Reduction

Two LLMChains, map_chain, and reduce_chain, are configured with these templates to execute the mapping and reduction steps in the document summarization process, making it more structured and manageable.

Step 7 : Map and Reduce LLM Chains

combine_documents_chain = StuffDocumentsChain(
llm_chain=reduce_chain, document_variable_name="doc_summaries"
)
reduce_documents_chain = ReduceDocumentsChain(
combine_documents_chain=combine_documents_chain,
collapse_documents_chain=combine_documents_chain,
token_max=5000,
)
view raw .py hosted with ❤ by GitHub

map_reduce_chain = MapReduceDocumentsChain(
llm_chain=map_chain,
reduce_documents_chain=reduce_documents_chain,
document_variable_name="docs",
return_intermediate_steps=False,
)
view raw .py hosted with ❤ by GitHub

Combining Documents Chain (combine_documents_chain): 

  • This chain plays a crucial role in the document summarization process. It takes the individual legal document summaries, generated in the "Map" step, and combines them into a single, cohesive text string. 
  • By consolidating the summaries, it prepares the data for further processing in the "Reduce" step. The resulting combined document string is assigned the variable name "doc_summaries." 

Reduce Documents Chain (reduce_documents_chain): 

  • This chain represents the final phase of the summarization process. Its primary function is to take the combined document string from the combine_documents_chain and perform in-depth reduction and summarization. 
  • To address potential issues related to token limits (where documents may exceed a certain token count), this chain offers a clever solution. It can recursively collapse or compress lengthy documents into smaller, more manageable chunks. 
  • This ensures that the summarization process remains efficient and avoids token limit constraints. The maximum token limit for each chunk is set at 5,000 tokens, helping control the size of the summarization output. 

Map-Reduce Documents Chain (map_reduce_chain): 

  • This chain follows the well-known MapReduce paradigm, a framework often used in distributed computing for processing and generating large datasets. In the "Map" step, it employs the map_chain to process each individual legal document. 
  • This results in initial document summaries. In the subsequent "Reduce" step, the chain uses the reduce_documents_chain to consolidate these initial summaries into a final, comprehensive document summary. 
  • The summarization result, representing the distilled insights from the legal documents, is stored in the variable named "docs" within the LLM chain. 

Step 8: Summarization Function

def summarize_pdf(file_path):
split_docs = text_splitter.split_documents(chunks(file_path))
return map_reduce_chain.run(split_docs)
result_sumary=summarize_pdf(file_path)
print(result_summary)
view raw .py hosted with ❤ by GitHub

Our summarization process centers around the 'summarize_pdf' function. This function takes a PDF file path as input and follows a two-step approach. 

First, it splits the PDF into manageable sections using the 'text_splitter' module. Then, it runs the 'map_reduce_chain,' which handles the summarization process. 

By providing the PDF file path as input, you can easily generate a concise summary of the legal document within the Google Colab environment, thanks to LangChain and LLM.

Output

1. Original Document - https://www.safetyforward.com/docs/legal.pdf

This document is about not using mobile phones while driving a motor vehicle and prohibits disabling its motion restriction features.

Summarization -

2. Original Document - https://static.abhibus.com/ks/pdf/Loan-Agreement.pdf

India and the International Bank for Reconstruction and Development have formed an agreement for the Sustainable Urban Transport Project, focusing on sustainable transportation while adhering to anti-corruption guidelines.

Summarization -

Limitations :

Complex Legal Terminology: 

LLMs may struggle with accurately summarizing documents containing intricate legal terminology, which requires domain-specific knowledge to interpret correctly. 

Loss of Context: 

Summarization processes, especially in lengthy legal documents, may result in the loss of important contextual details, potentially affecting the comprehensiveness of the summaries. 

Inherent Bias: 

LLMs can inadvertently introduce bias into summaries based on the biases present in their training data. This is a critical concern when dealing with legal documents that require impartiality. 

Document Structure: 

Summarization models might not always understand the hierarchical or structural elements of legal documents, making it challenging to generate summaries that reflect the intended structure.

Limited Abstraction: 

LLMs excel at generating detailed summaries, but they may struggle with abstracting complex legal arguments, which is essential for high-level understanding.

Conclusion : 

  • In a nutshell, this project uses LangChain and OpenAI's LLM to bring in a fresh way of summarizing legal documents. This collaboration makes legal document management more accurate and efficient.
  • However, we faced some big challenges, like handling lots of legal documents and dealing with AI bias. As we move forward, we need to find new ways to make our automated summarization even better and meet the demands of the legal profession.
  • In the future, we're committed to improving our approach. We'll focus on fine-tuning algorithms for more accuracy and exploring new techniques, like combining different methods, to keep enhancing legal document summarization. Our aim is to meet the ever-growing needs of the legal profession.
Get the latest engineering blogs delivered straight to your inbox.
No spam. Only expert insights.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Unlocking Legal Insights: Effortless Document Summarization with OpenAI's LLM and LangChain

The Rising Demand for Legal Document Summarization:

  • In a world where data, information, and legal complexities is prevalent, the volume of legal documents is growing rapidly. Law firms, legal professionals, and businesses are dealing with an ever-increasing number of legal texts, including contracts, court rulings, statutes, and regulations. 
  • These documents contain important insights, but understanding them can be overwhelming. This is where the demand for legal document summarization comes in. 
  • In this blog, we'll discuss the increasing need for summarizing legal documents and how modern technology is changing the way we analyze legal information, making it more efficient and accessible.

Overview OpenAI and LangChain

  • We'll use the LangChain framework to build our application with LLMs. These models, powered by deep learning, have been extensively trained on large text datasets. They excel in various language tasks like translation, sentiment analysis, chatbots, and more. 
  • LLMs can understand complex text, identify entities, establish connections, and generate coherent content. We can use meta LLaMA LLMs, OpenAI LLMs and others as well. For this case, we will be using OpenAI’s LLM.

  • OpenAI is a leader in the field of artificial intelligence and machine learning. They have developed powerful Large Language Models (LLMs) that are capable of understanding and generating human-like text.
  •  These models have been trained on vast amounts of textual data and can perform a wide range of natural language processing tasks.

LangChain is an innovative framework designed to simplify and enhance the development of applications and systems that involve natural language processing (NLP) and large language models (LLMs). 

It provides a structured and efficient approach for working with LLMs like OpenAI's GPT-3 and GPT-4 to tackle various NLP tasks. Here's an overview of LangChain's key features and capabilities:

  • Modular NLP Workflow: Build flexible NLP pipelines using modular blocks. 
  • Chain-Based Processing: Define processing flows using chain-based structures. 
  • Easy Integration: Seamlessly integrate LangChain with other tools and libraries.
  • Scalability: Scale NLP workflows to handle large datasets and complex tasks. 
  • Extensive Language Support: Work with multiple languages and models. 
  • Data Visualization: Visualize NLP pipeline results for better insights.
  • Version Control: Track changes and manage NLP workflows efficiently. 
  • Collaboration: Enable collaborative NLP development and experimentation.

Setting Up Environment

Setting Up Google Colab

Google Colab provides a powerful and convenient platform for running Python code with the added benefit of free GPU support. To get started, follow these steps:

  1. Visit Google Colab: Open your web browser and navigate to Google Colab.
  2. Sign In or Create a Google Account: You'll need to sign in with your Google account to use Google Colab. If you don't have one, you can create an account for free.
  3. Create a New Notebook: Once signed in, click on "New Notebook" to create a new Colab notebook.
  4. Choose Python Version: In the notebook, click on "Runtime" in the menu and select "Change runtime type." Choose your preferred Python version (usually Python 3) and set the hardware accelerator to "GPU." Also, make sure to turn on the "Internet" toggle.

OpenAI API Key Generation:-

  1. Visit the OpenAI Website Go to the OpenAI website.
  2.  Sign In or Create an Account Sign in or create a new OpenAI account. 
  3. Generate a New API Key Access the API section and generate a new API key. 
  4. Name Your API Key Give your API key a name that reflects its purpose. 
  5. Copy the API Key Copy the generated API key to your clipboard. 
  6. Store the API Key Safely Securely store the API key and do not share it publicly.

Understanding Legal Document Summarization Workflow

1. Map Step:

  • At the heart of our legal document summarization process is the Map-Reduce paradigm.
  • In the Map step, we treat each legal document individually. Think of it as dissecting a large puzzle into smaller, manageable pieces.
  • For each document, we employ a sophisticated Language Model (LLM). This LLM acts as our expert, breaking down complex legal language and extracting meaningful content.
  • The LLM generates concise summaries for each document section, essentially translating legalese into understandable insights.
  • These individual summaries become our building blocks, our pieces of the puzzle.

2. Reduce Step:

  • Now, let's shift our focus to the Reduce step.
  • Here's where we bring everything together. We've generated summaries for all the document sections, and it's time to assemble them into a cohesive whole.
  • Imagine the Reduce step as the puzzle solver. It takes all those individual pieces (summaries) and arranges them to form the big picture.
  • The goal is to produce a single, comprehensive summary that encapsulates the essence of the entire legal document.

3. Compression - Ensuring a Smooth Fit:

  • One challenge we encounter is the potential length of these individual summaries. Some legal documents can produce quite lengthy summaries.
  • To ensure a smooth flow within our summarization process, we've introduced a compression step.

4. Recursive Compression:

  • In some cases, even the compressed summaries might need further adjustment.
  • That's where the concept of recursive compression comes into play.
  • If necessary, we'll apply compression multiple times, refining and optimizing the summaries until they seamlessly fit into our summarization pipeline.

Let’s Get Started

Step 1: Installing python libraries

Create a new notebook in Google Colab and install the required Python libraries.

!pip install openai langchain tiktoken
view raw .py hosted with ❤ by GitHub

OpenAI: Installed to access OpenAI's powerful language models for legal document summarization.

LangChain: Essential for implementing document mapping, reduction, and combining workflows efficiently.

Tiktoken: Helps manage token counts within text data, ensuring efficient usage of language models and avoiding token limit issues.

Step 2: Adding OpenAI API key to Colab

Integrate your openapi key in Google Colab Secrets.

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
API_KEY= user_secrets.get_secret("YOUR_SECRET_KEY_NAME")
view raw .py hosted with ❤ by GitHub

Step 3: Initializing OpenAI LLM

Here, we import the OpenAI module from LangChain and initialize it with the provided API key to utilize advanced language models for document summarization.

from langchain.llms import OpenAI
llm = OpenAI(openai_api_key=API_KEY)
view raw .py hosted with ❤ by GitHub

Step 4: Splitting text by Character

The Text Splitter, in this case, overcomes the token limit by breaking down the text into smaller chunks that are each within the token limit. This ensures that the text can be processed effectively by the language model without exceeding its token capacity. 

The "chunk_overlap" parameter allows for some overlap between chunks to ensure that no information is lost during the splitting process.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
chunk_size=1000, chunk_overlap=120
)
view raw .py hosted with ❤ by GitHub

Step 5 : Loading PDF documents

from langchain.document_loaders import PyPDFLoader
def chunks(pdf_file_path):
loader = PyPDFLoader(pdf_file_path)
docs = loader.load_and_split()
return docs
view raw .py hosted with ❤ by GitHub

It initializes a PyPDFLoader object named "loader" using the provided PDF file path. This loader is responsible for loading and processing the contents of the PDF file. 

It then uses the "loader" to load and split the PDF document into smaller "docs" or document chunks. These document chunks likely represent different sections or pages of the PDF file. 

Finally, it returns the list of document chunks, making them available for further processing or analysis.

Step 6: Map Reduce Prompt Templates

Import libraries required for the implementation of LangChain MapReduce.

from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
view raw .py hosted with ❤ by GitHub

map_template = """The following is a set of documents
{docs}
Based on this list of docs, summarised into meaningful
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)
reduce_template = """The following is set of summaries:
{doc_summaries}
Take these and distil it into a final consolidated summary with title(mandatory) in bold with important key points .
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)
view raw .py hosted with ❤ by GitHub

Template Definition

The code defines two templates, map_template and reduce_template, which serve as structured prompts for instructing a language model on how to process and summarise sets of documents. 

LLMChains for Mapping and Reduction

Two LLMChains, map_chain, and reduce_chain, are configured with these templates to execute the mapping and reduction steps in the document summarization process, making it more structured and manageable.

Step 7 : Map and Reduce LLM Chains

combine_documents_chain = StuffDocumentsChain(
llm_chain=reduce_chain, document_variable_name="doc_summaries"
)
reduce_documents_chain = ReduceDocumentsChain(
combine_documents_chain=combine_documents_chain,
collapse_documents_chain=combine_documents_chain,
token_max=5000,
)
view raw .py hosted with ❤ by GitHub

map_reduce_chain = MapReduceDocumentsChain(
llm_chain=map_chain,
reduce_documents_chain=reduce_documents_chain,
document_variable_name="docs",
return_intermediate_steps=False,
)
view raw .py hosted with ❤ by GitHub

Combining Documents Chain (combine_documents_chain): 

  • This chain plays a crucial role in the document summarization process. It takes the individual legal document summaries, generated in the "Map" step, and combines them into a single, cohesive text string. 
  • By consolidating the summaries, it prepares the data for further processing in the "Reduce" step. The resulting combined document string is assigned the variable name "doc_summaries." 

Reduce Documents Chain (reduce_documents_chain): 

  • This chain represents the final phase of the summarization process. Its primary function is to take the combined document string from the combine_documents_chain and perform in-depth reduction and summarization. 
  • To address potential issues related to token limits (where documents may exceed a certain token count), this chain offers a clever solution. It can recursively collapse or compress lengthy documents into smaller, more manageable chunks. 
  • This ensures that the summarization process remains efficient and avoids token limit constraints. The maximum token limit for each chunk is set at 5,000 tokens, helping control the size of the summarization output. 

Map-Reduce Documents Chain (map_reduce_chain): 

  • This chain follows the well-known MapReduce paradigm, a framework often used in distributed computing for processing and generating large datasets. In the "Map" step, it employs the map_chain to process each individual legal document. 
  • This results in initial document summaries. In the subsequent "Reduce" step, the chain uses the reduce_documents_chain to consolidate these initial summaries into a final, comprehensive document summary. 
  • The summarization result, representing the distilled insights from the legal documents, is stored in the variable named "docs" within the LLM chain. 

Step 8: Summarization Function

def summarize_pdf(file_path):
split_docs = text_splitter.split_documents(chunks(file_path))
return map_reduce_chain.run(split_docs)
result_sumary=summarize_pdf(file_path)
print(result_summary)
view raw .py hosted with ❤ by GitHub

Our summarization process centers around the 'summarize_pdf' function. This function takes a PDF file path as input and follows a two-step approach. 

First, it splits the PDF into manageable sections using the 'text_splitter' module. Then, it runs the 'map_reduce_chain,' which handles the summarization process. 

By providing the PDF file path as input, you can easily generate a concise summary of the legal document within the Google Colab environment, thanks to LangChain and LLM.

Output

1. Original Document - https://www.safetyforward.com/docs/legal.pdf

This document is about not using mobile phones while driving a motor vehicle and prohibits disabling its motion restriction features.

Summarization -

2. Original Document - https://static.abhibus.com/ks/pdf/Loan-Agreement.pdf

India and the International Bank for Reconstruction and Development have formed an agreement for the Sustainable Urban Transport Project, focusing on sustainable transportation while adhering to anti-corruption guidelines.

Summarization -

Limitations :

Complex Legal Terminology: 

LLMs may struggle with accurately summarizing documents containing intricate legal terminology, which requires domain-specific knowledge to interpret correctly. 

Loss of Context: 

Summarization processes, especially in lengthy legal documents, may result in the loss of important contextual details, potentially affecting the comprehensiveness of the summaries. 

Inherent Bias: 

LLMs can inadvertently introduce bias into summaries based on the biases present in their training data. This is a critical concern when dealing with legal documents that require impartiality. 

Document Structure: 

Summarization models might not always understand the hierarchical or structural elements of legal documents, making it challenging to generate summaries that reflect the intended structure.

Limited Abstraction: 

LLMs excel at generating detailed summaries, but they may struggle with abstracting complex legal arguments, which is essential for high-level understanding.

Conclusion : 

  • In a nutshell, this project uses LangChain and OpenAI's LLM to bring in a fresh way of summarizing legal documents. This collaboration makes legal document management more accurate and efficient.
  • However, we faced some big challenges, like handling lots of legal documents and dealing with AI bias. As we move forward, we need to find new ways to make our automated summarization even better and meet the demands of the legal profession.
  • In the future, we're committed to improving our approach. We'll focus on fine-tuning algorithms for more accuracy and exploring new techniques, like combining different methods, to keep enhancing legal document summarization. Our aim is to meet the ever-growing needs of the legal profession.