Chonky is a Python library that intelligently segments text into meaningful semantic chunks using a fine-tuned transformer model. This library can be used in the RAG systems.
from chonky import TextSplitter
# on the first run it will download the transformer model
splitter = TextSplitter(device="cpu")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, dis
10 Comments
jaggirs
Did you evaluate it on a RAG benchmark?
suddenlybananas
I feel you could improve your README.md considerably just by showing the actual output of the little snippet you show.
mentalgear
I applaud the FOSS initiative but as with anything ml: benchmarks please so we can see what test cases are covered and how well they align with a project's needs.
petesergeant
Love that people are trying to improve chunkers, but just some examples of how it chunked some input text in the README would go a long way here!
mathis-l
You might want to take a look at https://github.com/segment-any-text/wtpsplit
It uses a similar approach but the focus is on sentence/paragraph segmentation generally and not specifically focused on RAG. It also has some benchmarks. Might be a good source of inspiration for where to take chonky next.
oezi
Just to understand: The model is trained to put paragraph breaks into text. The training dataset is books (in contrast for instance to scientific articles or advertising flyers).
It shouldn't break sentences at commas, right?
sushidev
So I could use this to index i.e. a fiction book in a vector db, right?
And the semantic chunking will possibly provide better results at query time for rag, did I understand that correctly?
acstorage
You mention that the fine tuning time took half a day, have you ever thought to reduce that time?
dmos62
Pretty cool. What use case did you have for this? Text with paragraph breaks missing seems fairly exotic.
cmenge
> I took the base distilbert model
I read "the base Dilbert model", all sorts of weird ideas going through my head, concluded I should re-read and made the same mistake again XD
Guess I better take a break and go for a walk now…