Something is wrong with the chat - it breaks after sometime
#4
by alexaione - opened
using default llama.cpp web ui, the chat works but it stops working randomly after some conversations.
Also it stops working with openwebui.
Tested with KiloCode in vscode - it does not respond at all.
on the log side, I am not sure what to look for, adding the below info (not sure if its helpful)
======================
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot print_timing: id 3 | task 0 |
prompt eval time = 574.50 ms / 16 tokens ( 35.91 ms per token, 27.85 tokens per second)
eval time = 355.56 ms / 26 tokens ( 13.68 ms per token, 73.12 tokens per second)
total time = 930.06 ms / 42 tokens
slot release: id 3 | task 0 | stop processing: n_tokens = 41, truncated = 0
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-native
slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 2 | task 27 | processing task, is_child = 0
slot update_slots: id 2 | task 27 | new prompt, n_ctx_slot = 25600, n_keep = 0, task.n_tokens = 9373
slot update_slots: id 2 | task 27 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 2 | task 27 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.218500
slot update_slots: id 2 | task 27 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 2 | task 27 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.437000
slot update_slots: id 2 | task 27 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 2 | task 27 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.655500
slot update_slots: id 2 | task 27 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 2 | task 27 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.874000
slot update_slots: id 2 | task 27 | n_tokens = 8192, memory_seq_rm [8192, end)
slot init_sampler: id 2 | task 27 | init sampler, took 0.72 ms, tokens: text = 9373, total = 9373
slot update_slots: id 2 | task 27 | prompt processing done, n_tokens = 9373, batch.n_tokens = 1181
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
srv stop: cancel task, id_task = 27
slot release: id 2 | task 27 | stop processing: n_tokens = 12938, truncated = 0
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-native
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.661 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 3599 | processing task, is_child = 0
slot update_slots: id 3 | task 3599 | new prompt, n_ctx_slot = 25600, n_keep = 0, task.n_tokens = 62
slot update_slots: id 3 | task 3599 | n_tokens = 41, memory_seq_rm [41, end)
slot init_sampler: id 3 | task 3599 | init sampler, took 0.02 ms, tokens: text = 62, total = 62
slot update_slots: id 3 | task 3599 | prompt processing done, n_tokens = 62, batch.n_tokens = 21
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
llama.cpp has introduced regressions in the parser over the past few weeks, this could be a consequence. I can't run mistral 3.2 small on the current llama code anymore (I get 500 errors from llama server, it has trouble with the template/parser), I have to use an old commit from early march. Now, the additional problem with mistral 4 is I think there were some necessary fixes recently to make it run. So you probably can't revert cleanly either.