Skip to content Skip to footer
0 items - $0.00 0

Show HN: Chonky – a neural approach for text semantic chunking by hessdalenlight

Show HN: Chonky – a neural approach for text semantic chunking by hessdalenlight

Show HN: Chonky – a neural approach for text semantic chunking by hessdalenlight

10 Comments

  • Post Author
    jaggirs
    Posted April 13, 2025 at 8:10 am

    Did you evaluate it on a RAG benchmark?

  • Post Author
    suddenlybananas
    Posted April 13, 2025 at 8:26 am

    I feel you could improve your README.md considerably just by showing the actual output of the little snippet you show.

  • Post Author
    mentalgear
    Posted April 13, 2025 at 8:28 am

    I applaud the FOSS initiative but as with anything ml: benchmarks please so we can see what test cases are covered and how well they align with a project's needs.

  • Post Author
    petesergeant
    Posted April 13, 2025 at 8:30 am

    Love that people are trying to improve chunkers, but just some examples of how it chunked some input text in the README would go a long way here!

  • Post Author
    mathis-l
    Posted April 13, 2025 at 9:10 am

    You might want to take a look at https://github.com/segment-any-text/wtpsplit

    It uses a similar approach but the focus is on sentence/paragraph segmentation generally and not specifically focused on RAG. It also has some benchmarks. Might be a good source of inspiration for where to take chonky next.

  • Post Author
    oezi
    Posted April 13, 2025 at 9:44 am

    Just to understand: The model is trained to put paragraph breaks into text. The training dataset is books (in contrast for instance to scientific articles or advertising flyers).

    It shouldn't break sentences at commas, right?

  • Post Author
    sushidev
    Posted April 13, 2025 at 10:59 am

    So I could use this to index i.e. a fiction book in a vector db, right?
    And the semantic chunking will possibly provide better results at query time for rag, did I understand that correctly?

  • Post Author
    acstorage
    Posted April 13, 2025 at 12:45 pm

    You mention that the fine tuning time took half a day, have you ever thought to reduce that time?

  • Post Author
    dmos62
    Posted April 13, 2025 at 1:20 pm

    Pretty cool. What use case did you have for this? Text with paragraph breaks missing seems fairly exotic.

  • Post Author
    cmenge
    Posted April 13, 2025 at 1:26 pm

    > I took the base distilbert model

    I read "the base Dilbert model", all sorts of weird ideas going through my head, concluded I should re-read and made the same mistake again XD

    Guess I better take a break and go for a walk now…

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.