Experimenting with Local LLMs

I have been experimenting with local LLMs using the Tesla P40 GPU with 24GB VRAM (without my 980Ti GPU) and the it is reasonably responsive with the 13B parameter models. I am using a Linux Mint VM on Proxmox and was able to pass through the GPU and install the drivers.

LLM-Test-GPU

I was able to import the models and interact locally using an open source web UI:

https://github.com/oobabooga/text-generation-webui

I used one of the most popular Llama 2 models with 13B parameters and compared it to a newer version (Tiefighter) that is modified to remove censorship and fine tune the accuracy.

LLM-Models-Test

A simple way to see if information is censored is to ask how to break into a car. The Tiefighter model actually does provide an answer after a few warnings. The other Llama 2 model refused to give any answer and I did not try any techniques to trick it into answering.

Llama 2 GPTQ

Llama2-Car-Censored

Tiefighter

TieFighter-Car-Uncensored

I also found that the Tiefighter model was generally more accurate on difficult questions. The toughest examples seem to be math problems, I was eventually able to get Tiefighter to solve a simple problem that required understanding PEMDAS but the other model never could and neither of them could multiply large numbers.

Llama 2 GPTQ

$Llama2-Math$

Tiefighter

$TieFighter-Math$

This Tiefighter model looks promising and may be a good choice for further fine tuning. One use would be to further train it on a private dataset such as proprietary information from within a business. Another opportunity would be to combine it with other tools to select which questions it is best suited to answer. Math could be routed to a calculator, knowledge based questions could route to a tool like Wolfram Alpha, and then anything more descriptive or not fitting the previous could be routed to the LLM.

Creating Long Written Content from Scratch

I have started experimenting with having it generate long form written content and the results can vary significantly based on how you present the request. There have been a lot of tests already for summarizing articles or documents so I am focusing on writing new content. Here is what I have found to be an efficient way to get the LLMs to help:

Type all your thoughts about what your are going to write carelessly with minimal structure and editing
Create a quick rough draft of the document or other written content with minimal effort
Tell the LLM that you want it to improve the draft and that you will also provide additional related information
In the same prompt, insert the draft and the additional info with clear separation and then submit

This method may seem like a lot of work but I am seeing excellent results from this Tiefighter variant. It is harder to organize thoughts and perfect writing than it is chaotically type up everything like this. It would be risky to do this with online LLMs if any of the info used is not meant to be public so using it locally is a huge benefit.

Do not enter multiple prompts or ask follow-up questions for long form content like this, it will quickly cause hallucinations. If you do not get the results that you want, then you need to immediately start a new session and change the initial prompt. I will continue testing and create some examples in the future.