JohnSnowLabs
diff --git a/‎data/ocr/MT_OCR_00.pdf
358 KB b/‎data/ocr/MT_OCR_00.pdf
358 KB
diff --git a/‎finance-nlp/90.3.Financial_Table_Signature_Extraction.ipynb
+1,247 b/‎finance-nlp/90.3.Financial_Table_Signature_Extraction.ipynb
+1,247
diff --git a/‎healthcare-nlp/00.SparkNLP_for_Healthcare_3h_Notebook.ipynb
+1 b/‎healthcare-nlp/00.SparkNLP_for_Healthcare_3h_Notebook.ipynb
+1
diff --git a/‎healthcare-nlp/04.0.Clinical_DeIdentification.ipynb
+1 b/‎healthcare-nlp/04.0.Clinical_DeIdentification.ipynb
+1
diff --git a/‎healthcare-nlp/24.0.Medical_Text_Summarization.ipynb
+1 b/‎healthcare-nlp/24.0.Medical_Text_Summarization.ipynb
+1
diff --git a/‎healthcare-nlp/25.0.Biogpt_Chat_JSL.ipynb
+1 b/‎healthcare-nlp/25.0.Biogpt_Chat_JSL.ipynb
+1
diff --git a/‎healthcare-nlp/25.1.Medical_Text_Generation.ipynb
+1 b/‎healthcare-nlp/25.1.Medical_Text_Generation.ipynb
+1
diff --git a/‎healthcare-nlp/slides/Spark NLP Healthcare Training - April 2023.pdf
17 MB b/‎healthcare-nlp/slides/Spark NLP Healthcare Training - April 2023.pdf
17 MB
diff --git a/‎jupyter/enterprise/healthcare/5.Spark_OCR.ipynb
+1 b/‎jupyter/enterprise/healthcare/5.Spark_OCR.ipynb
+1
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"markdown","source":["![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n"],"metadata":{"id":"X64RTULpsvUT"}},{"cell_type":"markdown","source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/25.0.Biogpt_Chat_JSL.ipynb)\n","\n"],"metadata":{"id":"TGSKYVuqsuE5"}},{"cell_type":"markdown","metadata":{"id":"5BC5b1eU_QDg"},"source":["# **BioGPT - Chat JSL - Closed Book Question Answering**"]},{"cell_type":"markdown","metadata":{"id":"sEjqwB2PHuuS"},"source":["The objective of this notebook is to explore the Biomedical Generative Pre-trained Transformer (BioGPT) models - `biogpt_chat_jsl` and `biogpt_chat_jsl_conversational_en`, for closed book question answering. These models are pre-trained on large biomedical text data and can generate coherent and relevant responses to biomedical questions.\n","\n","📖 Learning Objectives:\n","\n","- Learn how to use the BioGPT models in Spark NLP for closed book question answering tasks, including loading pre-trained models and configuring the pipeline.\n","\n","- Understand the parameters and options available for the BioGPT models to customize the text generation process based on specific use cases."]},{"cell_type":"markdown","metadata":{"id":"okhT7AcXxben"},"source":["# ⚒️ Setup and Import Libraries"]},{"cell_type":"markdown","metadata":{"id":"G7dOaR_TlgE-"},"source":["📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.\n","Otherwise, you can look at the example outputs at the bottom of the notebook."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tQLe_InJtnzA"},"outputs":[],"source":["# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.\n","! pip install -q johnsnowlabs"]},{"cell_type":"code","source":["from google.colab import files\n","print('Please Upload your John Snow Labs License using the button below')\n","license_keys = files.upload()"],"metadata":{"id":"Jjj9gCdWMXyF"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from johnsnowlabs import nlp, medical, visual\n","\n","# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM\n","nlp.install()"],"metadata":{"id":"L1LFkCjFMyxi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from johnsnowlabs import nlp, medical, visual\n","import pandas as pd\n","\n","# Automatically load license data and start a session with all jars user has access to\n","spark = nlp.start()"],"metadata":{"id":"fCy9pQxhhIkD"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from pyspark.sql import DataFrame\n","import pyspark.sql.functions as F\n","import pyspark.sql.types as T\n","import pyspark.sql as SQL\n","from pyspark import keyword_only\n","import textwrap"],"metadata":{"id":"gTVeDWGmhKuk","executionInfo":{"status":"ok","timestamp":1686634803290,"user_tz":240,"elapsed":20,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"execution_count":5,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"-62Qs6RAIC1V"},"source":["# \t📎🏥 `biogpt_chat_jsl`"]},{"cell_type":"markdown","metadata":{"id":"twxRx_PGIGm7"},"source":["This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"DCFN2tYF3X-Z"},"outputs":[],"source":["document_assembler = nlp.DocumentAssembler() \\\n","    .setInputCol(\"text\") \\\n","    .setOutputCol(\"documents\")\n","\n","gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl\", \"en\", \"clinical/models\")\\\n","    .setInputCols(\"documents\")\\\n","    .setOutputCol(\"answer\")\\\n","    .setMaxNewTokens(299)\\\n","    .setStopAtEos(True)\\\n","    .setDoSample(False)\\\n","    .setTopK(3)\\\n","    .setRandomSeed(42)\\\n","    .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])\n","\n","TEXT = \"What medications are commonly used to treat emphysema?\"\n","data = spark.createDataFrame(pd.DataFrame({\"text\": [TEXT]}))\n","\n","result = pipeline.fit(data).transform(data)\n","result.show(truncate=False)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"s9vKgjtjIjLA","executionInfo":{"status":"aborted","timestamp":1686634961458,"user_tz":240,"elapsed":15,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["result.select(\"answer.result\").show(truncate=False)"]},{"cell_type":"markdown","metadata":{"id":"Dv3a3Mm8aTh6"},"source":["## **📍 LightPipeline**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"FIMoP1-fvfF1","executionInfo":{"status":"aborted","timestamp":1686634961460,"user_tz":240,"elapsed":17,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl\", \"en\", \"clinical/models\")\\\n","    .setInputCols(\"documents\")\\\n","    .setOutputCol(\"answer\")\\\n","    .setMaxNewTokens(299)\\\n","    .setStopAtEos(True)\\\n","    .setDoSample(False)\\\n","    .setTopK(3)\\\n","    .setRandomSeed(42)\\\n","    .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"YSfsStyhCUKa","executionInfo":{"status":"aborted","timestamp":1686634961461,"user_tz":240,"elapsed":18,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["TEXT = \"What are the risk factors for developing heart disease?\"\n","\n","model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n","light_model = nlp.LightPipeline(model)\n","light_result = light_model.annotate(TEXT)\n","answer_text = light_result[\"answer\"]"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"RfIJU_0ACYfU","executionInfo":{"status":"aborted","timestamp":1686634961462,"user_tz":240,"elapsed":19,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["# Extract the text after 'answer:'\n","final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()\n","\n","# Format the text into paragraphs\n","wrapped_text = textwrap.fill(final_answer, width=120)\n","\n","print(\"➤ Answer: \\n{}\".format(wrapped_text))\n","print(\"\\n\")"]},{"cell_type":"markdown","metadata":{"id":"Ta3HNysnaYy4"},"source":["## 🚩 `setMaxNewTokens`"]},{"cell_type":"markdown","metadata":{"id":"Fpp0aiEzbH37"},"source":["- This parameter sets the maximum number of new tokens that the GPT model will generate for the output, constraining the length of the generated response and managing the computational cost."]},{"cell_type":"markdown","metadata":{"id":"RrdPu1wK8-Pw"},"source":["Pipeline with `setMaxNewTokens(128)` and `setMaxNewTokens(299)`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3jbbh6hy_w11","executionInfo":{"status":"aborted","timestamp":1686634961463,"user_tz":240,"elapsed":19,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["# Default parameters\n","gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl\", \"en\", \"clinical/models\") \\\n","    .setInputCols(\"documents\") \\\n","    .setOutputCol(\"answer\") \\\n","    .setStopAtEos(True)\\\n","    .setDoSample(False)\\\n","    .setTopK(3) \\\n","    .setRandomSeed(42)\\\n","    .setStopAtEos(True)\\\n","    .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","\n","MaxNewTokens = [128, 299]\n","\n","\n","# Sample question\n","TEXT = \"How can asthma be treated?\"\n","\n","for j in MaxNewTokens:\n","    print(\"Question:\", TEXT)\n","    print(\"Parameters:\")\n","    print(f\"\\nsetMaxNewTokens({j}):\")\n","    gpt_qa.setMaxNewTokens(j)\n","    pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])\n","\n","    light_model = nlp.LightPipeline(pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\")))\n","    answer_default = light_model.annotate(TEXT)\n","\n","    answer_text = answer_default[\"answer\"][0][len(TEXT[0]) + 1:].strip()\n","    wrapped_answer_text = textwrap.fill(answer_text, width=150)\n","    token_count = len(answer_text.split())\n","    print(\"➤ Answer:\")\n","    print(wrapped_answer_text)\n","    print(f\"Number of tokens used: {token_count}\")\n","    print(\"-\" * 40)  # Separator line\n"]},{"cell_type":"markdown","source":["<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>\n","\n","<b>before running the following cells, <font color='darkred'>RESTART the COLAB RUNTIME </font> than start your session and go ahead.<b>"],"metadata":{"id":"M6sjuM3NW-ZS"}},{"cell_type":"markdown","metadata":{"id":"UYOrd_2OLSyD"},"source":["# \t📎🏥 `biogpt_chat_jsl_conversational`"]},{"cell_type":"markdown","metadata":{"id":"z4vCaKWzyBCX"},"source":["This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"uXW20vHBLd3u","outputId":"dd01f049-9c93-4345-a5a4-be0bdb6915c4"},"outputs":[{"output_type":"stream","name":"stdout","text":["biogpt_chat_jsl_conversational download started this may take some time.\n","[OK!]\n"]}],"source":["document_assembler = nlp.DocumentAssembler() \\\n","    .setInputCol(\"text\") \\\n","    .setOutputCol(\"documents\")\n","\n","gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl_conversational\", \"en\", \"clinical/models\")\\\n","    .setInputCols(\"documents\")\\\n","    .setOutputCol(\"answer\")\\\n","    .setMaxNewTokens(399)\\\n","    .setStopAtEos(True)\\\n","    .setDoSample(False)\\\n","    .setTopK(1)\\\n","    .setRandomSeed(42)\\\n","    .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zy_Jt_MkCfvn"},"outputs":[],"source":["TEXT = \"What is the difference between melanoma and sarcoma?\"\n","\n","model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n","light_model = nlp.LightPipeline(model)\n","light_result = light_model.annotate(TEXT)\n","answer_text = light_result[\"answer\"]\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"-pqxGMy4Chmz","colab":{"base_uri":"https://localhost:8080/"},"outputId":"2d22d05a-4ff7-4ce4-f260-f6fda510ec46"},"outputs":[{"output_type":"stream","name":"stdout","text":["➤ Answer: \n","Both are blood - borne cancers. Melanoma is a type of skin cancer that arises from melanocytes, the pigment - producing\n","cells in the skin. Sarcoma is a type of bone cancer that arises from bone. Both are blood - borne cancers and therefore\n","have very different treatment options.\n","\n","\n"]}],"source":["# Extract the text after 'answer:'\n","final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()\n","\n","# Format the text into paragraphs\n","wrapped_text = textwrap.fill(final_answer, width=120)\n","\n","print(\"➤ Answer: \\n{}\".format(wrapped_text))\n","print(\"\\n\")"]},{"cell_type":"markdown","metadata":{"id":"lTKvyKGnyCBP"},"source":["# \t📎🏥 `biogpt_chat_jsl_conditions`"]},{"cell_type":"markdown","metadata":{"id":"5OtNKmjcyCBQ"},"source":["This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"outputId":"ace4c4e7-0480-4f55-bf4d-2313b993c614","id":"1zydEm1ZyCBQ","executionInfo":{"status":"ok","timestamp":1686632142634,"user_tz":240,"elapsed":97942,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["biogpt_chat_jsl_conditions download started this may take some time.\n","[OK!]\n"]}],"source":["document_assembler = nlp.DocumentAssembler() \\\n","    .setInputCol(\"text\") \\\n","    .setOutputCol(\"documents\")\n","\n","gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl_conditions\", \"en\", \"clinical/models\")\\\n","    .setInputCols(\"documents\")\\\n","    .setOutputCol(\"answer\")\\\n","    .setMaxNewTokens(399)\\\n","    .setStopAtEos(True)\\\n","    .setDoSample(False)\\\n","    .setTopK(1)\\\n","    .setRandomSeed(42)\\\n","    .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Of_DSovWyCBR"},"outputs":[],"source":["TEXT = \"What are the potential causes and risk factors for developing cardiovascular disease?\"\n","\n","\n","model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n","light_model = nlp.LightPipeline(model)\n","light_result = light_model.annotate(TEXT)\n","answer_text = light_result[\"answer\"]\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"outputId":"8eeb4455-bba3-4479-9a51-d1ce56efd9c3","id":"PJeDRasmyCBR","executionInfo":{"status":"ok","timestamp":1686632150598,"user_tz":240,"elapsed":17,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["➤ Answer: \n","estion: What are the potential causes and risk factors for developing cardiovascular disease ? answer: Cardiovascular\n","disease ( CVD ) is a general term for conditions affecting the heart or blood vessels. It can be caused by a variety of\n","factors, including smoking, high blood pressure, diabetes, high cholesterol, and obesity. Certain medical conditions,\n","such as chronic kidney disease, can also increase the risk of developing CVD.\n","\n","\n"]}],"source":["# Extract the text after 'answer:'\n","final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()\n","\n","# Format the text into paragraphs\n","wrapped_text = textwrap.fill(final_answer, width=120)\n","\n","print(\"➤ Answer: \\n{}\".format(wrapped_text))\n","print(\"\\n\")"]}],"metadata":{"accelerator":"GPU","colab":{"machine_shape":"hm","provenance":[]},"gpuClass":"standard","kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"cells":[{"cell_type":"markdown","source":["![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n"],"metadata":{"id":"X64RTULpsvUT"}},{"cell_type":"markdown","source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/25.0.Biogpt_Chat_JSL.ipynb)\n","\n"],"metadata":{"id":"TGSKYVuqsuE5"}},{"cell_type":"markdown","metadata":{"id":"5BC5b1eU_QDg"},"source":["# BioGPT - Chat JSL - Closed Book Question Answering"]},{"cell_type":"markdown","metadata":{"id":"sEjqwB2PHuuS"},"source":["The objective of this notebook is to explore the Biomedical Generative Pre-trained Transformer (BioGPT) models - `biogpt_chat_jsl` and `biogpt_chat_jsl_conversational_en`, for closed book question answering. These models are pre-trained on large biomedical text data and can generate coherent and relevant responses to biomedical questions.\n","\n","📖 Learning Objectives:\n","\n","- Learn how to use the BioGPT models in Spark NLP for closed book question answering tasks, including loading pre-trained models and configuring the pipeline.\n","\n","- Understand the parameters and options available for the BioGPT models to customize the text generation process based on specific use cases."]},{"cell_type":"markdown","metadata":{"id":"okhT7AcXxben"},"source":["# ⚒️ Setup and Import Libraries"]},{"cell_type":"markdown","metadata":{"id":"G7dOaR_TlgE-"},"source":["📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.\n","Otherwise, you can look at the example outputs at the bottom of the notebook."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tQLe_InJtnzA"},"outputs":[],"source":["# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.\n","! pip install -q johnsnowlabs"]},{"cell_type":"code","source":["from google.colab import files\n","print('Please Upload your John Snow Labs License using the button below')\n","license_keys = files.upload()"],"metadata":{"id":"Jjj9gCdWMXyF"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from johnsnowlabs import nlp, medical, visual\n","\n","# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM\n","nlp.install()"],"metadata":{"id":"L1LFkCjFMyxi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from johnsnowlabs import nlp, medical, visual\n","import pandas as pd\n","\n","# Automatically load license data and start a session with all jars user has access to\n","spark = nlp.start()"],"metadata":{"id":"fCy9pQxhhIkD"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["from pyspark.sql import DataFrame\n","import pyspark.sql.functions as F\n","import pyspark.sql.types as T\n","import pyspark.sql as SQL\n","from pyspark import keyword_only\n","import textwrap"],"metadata":{"id":"gTVeDWGmhKuk","executionInfo":{"status":"ok","timestamp":1686634803290,"user_tz":240,"elapsed":20,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"execution_count":5,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"-62Qs6RAIC1V"},"source":["# \t📎🏥 `biogpt_chat_jsl`"]},{"cell_type":"markdown","metadata":{"id":"twxRx_PGIGm7"},"source":["This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"DCFN2tYF3X-Z"},"outputs":[],"source":["document_assembler = nlp.DocumentAssembler() \\\n"," .setInputCol(\"text\") \\\n"," .setOutputCol(\"documents\")\n","\n","gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl\", \"en\", \"clinical/models\")\\\n"," .setInputCols(\"documents\")\\\n"," .setOutputCol(\"answer\")\\\n"," .setMaxNewTokens(299)\\\n"," .setStopAtEos(True)\\\n"," .setDoSample(False)\\\n"," .setTopK(3)\\\n"," .setRandomSeed(42)\\\n"," .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])\n","\n","TEXT = \"What medications are commonly used to treat emphysema?\"\n","data = spark.createDataFrame(pd.DataFrame({\"text\": [TEXT]}))\n","\n","result = pipeline.fit(data).transform(data)\n","result.show(truncate=False)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"s9vKgjtjIjLA","executionInfo":{"status":"aborted","timestamp":1686634961458,"user_tz":240,"elapsed":15,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["result.select(\"answer.result\").show(truncate=False)"]},{"cell_type":"markdown","metadata":{"id":"Dv3a3Mm8aTh6"},"source":["## 📍 LightPipeline"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"FIMoP1-fvfF1","executionInfo":{"status":"aborted","timestamp":1686634961460,"user_tz":240,"elapsed":17,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl\", \"en\", \"clinical/models\")\\\n"," .setInputCols(\"documents\")\\\n"," .setOutputCol(\"answer\")\\\n"," .setMaxNewTokens(299)\\\n"," .setStopAtEos(True)\\\n"," .setDoSample(False)\\\n"," .setTopK(3)\\\n"," .setRandomSeed(42)\\\n"," .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"YSfsStyhCUKa","executionInfo":{"status":"aborted","timestamp":1686634961461,"user_tz":240,"elapsed":18,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["TEXT = \"What are the risk factors for developing heart disease?\"\n","\n","model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n","light_model = nlp.LightPipeline(model)\n","light_result = light_model.annotate(TEXT)\n","answer_text = light_result[\"answer\"]"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"RfIJU_0ACYfU","executionInfo":{"status":"aborted","timestamp":1686634961462,"user_tz":240,"elapsed":19,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["# Extract the text after 'answer:'\n","final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()\n","\n","# Format the text into paragraphs\n","wrapped_text = textwrap.fill(final_answer, width=120)\n","\n","print(\"➤ Answer: \\n{}\".format(wrapped_text))\n","print(\"\\n\")"]},{"cell_type":"markdown","metadata":{"id":"Ta3HNysnaYy4"},"source":["## 🚩 `setMaxNewTokens`"]},{"cell_type":"markdown","metadata":{"id":"Fpp0aiEzbH37"},"source":["- This parameter sets the maximum number of new tokens that the GPT model will generate for the output, constraining the length of the generated response and managing the computational cost."]},{"cell_type":"markdown","metadata":{"id":"RrdPu1wK8-Pw"},"source":["Pipeline with `setMaxNewTokens(128)` and `setMaxNewTokens(299)`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3jbbh6hy_w11","executionInfo":{"status":"aborted","timestamp":1686634961463,"user_tz":240,"elapsed":19,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[],"source":["# Default parameters\n","gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl\", \"en\", \"clinical/models\") \\\n"," .setInputCols(\"documents\") \\\n"," .setOutputCol(\"answer\") \\\n"," .setStopAtEos(True)\\\n"," .setDoSample(False)\\\n"," .setTopK(3) \\\n"," .setRandomSeed(42)\\\n"," .setStopAtEos(True)\\\n"," .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","\n","MaxNewTokens = [128, 299]\n","\n","\n","# Sample question\n","TEXT = \"How can asthma be treated?\"\n","\n","for j in MaxNewTokens:\n"," print(\"Question:\", TEXT)\n"," print(\"Parameters:\")\n"," print(f\"\\nsetMaxNewTokens({j}):\")\n"," gpt_qa.setMaxNewTokens(j)\n"," pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])\n","\n"," light_model = nlp.LightPipeline(pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\")))\n"," answer_default = light_model.annotate(TEXT)\n","\n"," answer_text = answer_default[\"answer\"][0][len(TEXT[0]) + 1:].strip()\n"," wrapped_answer_text = textwrap.fill(answer_text, width=150)\n"," token_count = len(answer_text.split())\n"," print(\"➤ Answer:\")\n"," print(wrapped_answer_text)\n"," print(f\"Number of tokens used: {token_count}\")\n"," print(\"-\" * 40) # Separator line\n"]},{"cell_type":"markdown","source":["<b><h1><font color='darkred'>!!! ATTENTION !!! </font><h1><b>\n","\n","<b>before running the following cells, <font color='darkred'>RESTART the COLAB RUNTIME </font> than start your session and go ahead.<b>"],"metadata":{"id":"M6sjuM3NW-ZS"}},{"cell_type":"markdown","metadata":{"id":"UYOrd_2OLSyD"},"source":["# \t📎🏥 `biogpt_chat_jsl_conversational`"]},{"cell_type":"markdown","metadata":{"id":"z4vCaKWzyBCX"},"source":["This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"uXW20vHBLd3u","outputId":"dd01f049-9c93-4345-a5a4-be0bdb6915c4"},"outputs":[{"output_type":"stream","name":"stdout","text":["biogpt_chat_jsl_conversational download started this may take some time.\n","[OK!]\n"]}],"source":["document_assembler = nlp.DocumentAssembler() \\\n"," .setInputCol(\"text\") \\\n"," .setOutputCol(\"documents\")\n","\n","gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl_conversational\", \"en\", \"clinical/models\")\\\n"," .setInputCols(\"documents\")\\\n"," .setOutputCol(\"answer\")\\\n"," .setMaxNewTokens(399)\\\n"," .setStopAtEos(True)\\\n"," .setDoSample(False)\\\n"," .setTopK(1)\\\n"," .setRandomSeed(42)\\\n"," .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"zy_Jt_MkCfvn"},"outputs":[],"source":["TEXT = \"What is the difference between melanoma and sarcoma?\"\n","\n","model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n","light_model = nlp.LightPipeline(model)\n","light_result = light_model.annotate(TEXT)\n","answer_text = light_result[\"answer\"]\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"-pqxGMy4Chmz","colab":{"base_uri":"https://localhost:8080/"},"outputId":"2d22d05a-4ff7-4ce4-f260-f6fda510ec46"},"outputs":[{"output_type":"stream","name":"stdout","text":["➤ Answer: \n","Both are blood - borne cancers. Melanoma is a type of skin cancer that arises from melanocytes, the pigment - producing\n","cells in the skin. Sarcoma is a type of bone cancer that arises from bone. Both are blood - borne cancers and therefore\n","have very different treatment options.\n","\n","\n"]}],"source":["# Extract the text after 'answer:'\n","final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()\n","\n","# Format the text into paragraphs\n","wrapped_text = textwrap.fill(final_answer, width=120)\n","\n","print(\"➤ Answer: \\n{}\".format(wrapped_text))\n","print(\"\\n\")"]},{"cell_type":"markdown","metadata":{"id":"lTKvyKGnyCBP"},"source":["# \t📎🏥 `biogpt_chat_jsl_conditions`"]},{"cell_type":"markdown","metadata":{"id":"5OtNKmjcyCBQ"},"source":["This model is based on BioGPT finetuned with medical conversations happening in a clinical settings and can answer clinical questions related to symptoms, drugs, tests, and diseases. The difference between this model and `biogpt_chat_jsl` is that this model produces more concise/smaller response."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"outputId":"ace4c4e7-0480-4f55-bf4d-2313b993c614","id":"1zydEm1ZyCBQ","executionInfo":{"status":"ok","timestamp":1686632142634,"user_tz":240,"elapsed":97942,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["biogpt_chat_jsl_conditions download started this may take some time.\n","[OK!]\n"]}],"source":["document_assembler = nlp.DocumentAssembler() \\\n"," .setInputCol(\"text\") \\\n"," .setOutputCol(\"documents\")\n","\n","gpt_qa = medical.TextGenerator().pretrained(\"biogpt_chat_jsl_conditions\", \"en\", \"clinical/models\")\\\n"," .setInputCols(\"documents\")\\\n"," .setOutputCol(\"answer\")\\\n"," .setMaxNewTokens(399)\\\n"," .setStopAtEos(True)\\\n"," .setDoSample(False)\\\n"," .setTopK(1)\\\n"," .setRandomSeed(42)\\\n"," .setCustomPrompt(\"QUESTION: {DOCUMENT} ANSWER:\")\n","\n","pipeline = nlp.Pipeline().setStages([document_assembler, gpt_qa])\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Of_DSovWyCBR"},"outputs":[],"source":["TEXT = \"What are the potential causes and risk factors for developing cardiovascular disease?\"\n","\n","\n","model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n","light_model = nlp.LightPipeline(model)\n","light_result = light_model.annotate(TEXT)\n","answer_text = light_result[\"answer\"]\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"outputId":"8eeb4455-bba3-4479-9a51-d1ce56efd9c3","id":"PJeDRasmyCBR","executionInfo":{"status":"ok","timestamp":1686632150598,"user_tz":240,"elapsed":17,"user":{"displayName":"Vildan Sarıkaya","userId":"07789644790967768983"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["➤ Answer: \n","estion: What are the potential causes and risk factors for developing cardiovascular disease ? answer: Cardiovascular\n","disease ( CVD ) is a general term for conditions affecting the heart or blood vessels. It can be caused by a variety of\n","factors, including smoking, high blood pressure, diabetes, high cholesterol, and obesity. Certain medical conditions,\n","such as chronic kidney disease, can also increase the risk of developing CVD.\n","\n","\n"]}],"source":["# Extract the text after 'answer:'\n","final_answer = answer_text[0][len(TEXT[0]) + 1:].strip()\n","\n","# Format the text into paragraphs\n","wrapped_text = textwrap.fill(final_answer, width=120)\n","\n","print(\"➤ Answer: \\n{}\".format(wrapped_text))\n","print(\"\\n\")"]}],"metadata":{"accelerator":"GPU","colab":{"machine_shape":"hm","provenance":[]},"gpuClass":"standard","kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}