Show HN: Chonky – a neural approach for text semantic chunking by hessdalenlight

Share This Article

Sed ut perspiciatis unde.

Chonky is a Python library that intelligently segments text into meaningful semantic chunks using a fine-tuned transformer model. This library can be used in the RAG systems.

from chonky import TextSplitter

# on the first run it will download the transformer model
splitter = TextSplitter(device="cpu")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, dis

Post Author

jaggirs

Posted April 13, 2025 at 8:10 am

Did you evaluate it on a RAG benchmark?

0Likes Log in to Reply
Post Author

suddenlybananas

Posted April 13, 2025 at 8:26 am

I feel you could improve your README.md considerably just by showing the actual output of the little snippet you show.

0Likes Log in to Reply
Post Author

mentalgear

Posted April 13, 2025 at 8:28 am

I applaud the FOSS initiative but as with anything ml: benchmarks please so we can see what test cases are covered and how well they align with a project's needs.

0Likes Log in to Reply
Post Author

petesergeant

Posted April 13, 2025 at 8:30 am

Love that people are trying to improve chunkers, but just some examples of how it chunked some input text in the README would go a long way here!

0Likes Log in to Reply
Post Author

mathis-l

Posted April 13, 2025 at 9:10 am

You might want to take a look at https://github.com/segment-any-text/wtpsplit

It uses a similar approach but the focus is on sentence/paragraph segmentation generally and not specifically focused on RAG. It also has some benchmarks. Might be a good source of inspiration for where to take chonky next.

0Likes Log in to Reply
Post Author

oezi

Posted April 13, 2025 at 9:44 am

Just to understand: The model is trained to put paragraph breaks into text. The training dataset is books (in contrast for instance to scientific articles or advertising flyers).

It shouldn't break sentences at commas, right?

0Likes Log in to Reply
Post Author

sushidev

Posted April 13, 2025 at 10:59 am

So I could use this to index i.e. a fiction book in a vector db, right?
And the semantic chunking will possibly provide better results at query time for rag, did I understand that correctly?

0Likes Log in to Reply
Post Author

acstorage

Posted April 13, 2025 at 12:45 pm

You mention that the fine tuning time took half a day, have you ever thought to reduce that time?

0Likes Log in to Reply
Post Author

dmos62

Posted April 13, 2025 at 1:20 pm

Pretty cool. What use case did you have for this? Text with paragraph breaks missing seems fairly exotic.

0Likes Log in to Reply
Post Author

cmenge

Posted April 13, 2025 at 1:26 pm

> I took the base distilbert model

I read "the base Dilbert model", all sorts of weird ideas going through my head, concluded I should re-read and made the same mistake again XD

Guess I better take a break and go for a walk now…

0Likes Log in to Reply

Show HN: Chonky – a neural approach for text semantic chunking by hessdalenlight

Show HN: Chonky – a neural approach for text semantic chunking by hessdalenlight

Share This Article

Newsletter

HackTech

10 Comments

jaggirs

suddenlybananas

mentalgear

petesergeant

mathis-l

oezi

sushidev

acstorage

dmos62

cmenge

Leave a comment Cancel reply

Editor's Choice

Show HN: Chonky – a neural approach for text semantic chunking by hessdalenlight

Show HN: Chonky – a neural approach for text semantic chunking by hessdalenlight

Share This Article

Newsletter

10 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter