# Text Processing Guide

## Model
Models from Hugging Face, Ollama, OpenAI, and Anthropic are all supported.
Hugging Face and Ollama models are free of charge and can be run locally, while OpenAI and Anthropic models are hosted externally, requiring an API key for access.
Keep in mind that using non-local models may raise privacy concerns.

### Requirements
- **Hugging Face**:
To download models from Hugging Face, you might need a Hugging Face token. Go to [Hugging Face](https://huggingface.co/join) to generate one.
- **Ollama**:
If you're using Ollama with Docker, no additional setup is required.
Otherwise, install Ollama locally by following the instructions on [Ollama](https://github.com/ollama/ollama).
For Linux, it should be as simple as:
  ```bash
  curl -fsSL https://ollama.com/install.sh | sh
  ```
- **OpenAI**: 
To access OpenAI models, create an OpenAI account and get an API key.
- **Anthropic**:
To use Anthropic models, sign up at [Anthropic’s Console](https://console.anthropic.com/) and obtain an API key.

### Command Line Arguments


When using a Text tool, you can configure the model with the following command line arguments:

- **`--provider`**: This specifies the model provider you want to use. Options include:
  - `hf` (short for Hugging Face)
  - `ollama`
  - `openai`
  - `anthropic`
  
- **`--model`**: This is the name of the specific language model you would like to use. Make sure the model is compatible with the provider you have selected.

- **`--model_config`**: You may use this optional argument to point to a `.yaml` configuration file. In this file you can set specific runtime settings (e.g., temperature), as well as the model and provider (so you do not have to specify them each time you run a command).
- **`--api_key`**: If the provider you are using requires a subscription or token for access, supply the API key here. You can also supply the API key as the environment variable **HF_TOKEN**, **OPENAI_API_KEY**, or **ANTHROPIC_API_KEY** instead of including it in the command line.

> **Note**: Do not store the API key within the `.yaml` configuration file to better protect it. 

#### Example `.yaml` Model Configuration File

Below is an example of a `.yaml` file for the model configuration:

```yaml
provider: ollama
model: llama3.3
temperature: 0.7
```

In this example:
- `provider` specifies the model provider.
- `model` sets the name of the language model.
- `temperature` adjusts the randomness of the output. A value closer to 0 makes the output more deterministic, while a higher value increases creativity.

---

## Usage
### Chat
This feature is intended for benchmarking and testing LLMs (either local or 3rd party) 
for language processing tasks. The user can interact with an LLM in the command terminal, 
whilst specifying a prompt and an output file directory to store the conversation.
```bash
psifx text chat \
    [--prompt chat_history.txt] \
    [--output file.txt] \
    [--provider ollama] \
    [--model llama3.1] \
    [--model_config model_config.yaml] \
    [--api_key api_key]       
```
- `--prompt`: Prompt or path to a .txt file containing the prompt / chat history.
- `--output`: Path to a .txt save file.
### Instruction
This feature is intended to allow general usage of LLMs (either local or 3rd party) for language processing tasks.

```bash
psifx text instruction \
    --instruction instruction.yaml \
    --input input.txt \
    --output output.txt \
    [--provider ollama] \
    [--model llama3.1] \
    [--model_config model_config.yaml] \
    [--api_key api_key] \
  ```
- `--instruction`: Path to a .yaml file containing the prompt and parser, details below.
- `--input`: Path to the input file.
- `--output`: Path to the output file.

  Supported format combinations:
  - `.txt` input → `.txt` output
  - `.vtt` input → `.txt` output
  - `.csv` input → `.csv` output


> **Note**: The .txt and .vtt formats are suited for simpler use cases. 
> The .csv format, however, allows you to process multiple datas and use complex prompts that combine multiples information.

#### Instruction files
Both the prompt and the parser are specified in a .yaml file.
The model will generate an answer to the prompt, this answer will be parsed by the parser, which you will get as output.

```yaml
prompt: |
    user: Here is a semi-structured interview transcript between an
    interviewer denoted ’INTERVIEWER’ and a patient denoted ’PATIENT’
    who is reviewing a mobile app: {text}.
    I am interested in the following question with the desired
    response types in parentheses. Do not make anything up that is
    not in the original transcript. If there is no information to
    answer the question, just write NA. If they barely say anything
    about a question, then the certainty should be very low.
    1) Does the patient find the app useful? (Two integers: Rating out
    of 10 with certainty out of 10 where 10 is maximally certain)
parser:
    to_lower: True 
```
##### Prompt
The prompt enables you to tell the model what you want, and guide its generation.

```yaml
prompt: |
    system: You are an expert doctor.
    user: You take care of a new patient.
    assistant: What are the patient symptoms?
    user: The patient has the following symptoms {text}.
```
Prompts can be customized with the headers **system**, **user**, and **assistant**.

In prompts **{text}** is a placeholder for the content of a .txt or .vtt files. 

When using .csv files, you will instead use placeholder for the content of columns, specified as **{column_name}**. 
Hence, with .csv file you can have placeholder referring to different elements, i.e., **{city}** **{county}**.
```yaml
prompt: |
    user: A patient stayed in hospital {hospital_name}.
    He was asked to fill in this satisfaction questionnaire: {questionary_content}
    Here are the answers he gave: {patient_answers}
    On a scale out of 10 was the patient satisfied?
    What did he think could be improved?
```
##### Parser
The parser enables you to post-process the generated text.
It is optional; to use it you should specify a parser in the .yaml file.
```yaml
prompt: |
    user: ...
parser:
    start_after: 'ANSWER:' 
    regex: '<(.*)>'
    to_lower: True 
    expect:
        - 'yes'
        - 'no' 
```
The steps are all optional and are applied in the following order:
- `start_after` (*str*, optional):  
  Retains only the portion of the generated text that follows the last occurrence of the specified string. 
  
   _If the string is not found, the full text is retained, and an error message is displayed._

- `regex` (*str*, optional):  
  Applies a regular expression search to the retained text. 
  
  If capturing groups are present, only the matched groups are returned; otherwise, the full match is used.
  
  _If no match is found, the full text is retained, and an error message is displayed._

- `to_lower` (*bool*, default=`False`):  
  Converts the final output to lowercase if set to `True`.

- `expect` (*list[str]*, optional):  
  A list of expected output values. 
  
  _If the final result is not found in this list, an error message is displayed._