Skip to content Skip to footer
A love letter to the CSV format by Yomguithereal

A love letter to the CSV format by Yomguithereal

23 Comments

  • Post Author
    Qem
    Posted March 26, 2025 at 5:18 pm

      9. Excel hates CSV
      It clearly means CSV must be doing something right.
    

    This is one area where LibreOffice Calc shines in comparison to Excel. Importing CSVs is much more convenient.

  • Post Author
    Der_Einzige
    Posted March 26, 2025 at 5:27 pm

    I'm in on the "shit on microsoft for hard to use formats train" but as someone who did a LOT of .docx parsing – it turned into zen when I realized that I can just convert my docs into the easily parsed .html5 using something like pandoc.

    This is a good blog post and Xan is a really neat terminal tool.

  • Post Author
    tengwar2
    Posted March 26, 2025 at 5:30 pm

    I'm not really sure why "Excel hates CSV". I import into Excel all the time. I'm sure the functionality could be expanded, but it seems to work fine. The bit of the process I would like improved is nothing to do with CSV – it's that the exporting programs sometimes rearrange the order of fields, and you have to accommodate that in Excel after the import. But since you can have named columns in Excel (make the data in to a table), it's not a big deal.

  • Post Author
    nayuki
    Posted March 26, 2025 at 5:32 pm
  • Post Author
    mjw_byrne
    Posted March 26, 2025 at 5:36 pm

    CSV is ever so elegant but it has one fatal flaw – quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:

    1. It's tricky to parallelise processing of CSV.
    2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).

    So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and are all -escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.

  • Post Author
    polyrand
    Posted March 26, 2025 at 5:39 pm

    As someone who likes modern formats like parquet, when in doubt, I end up using CSV or JSONL (newline-delimited JSON). Mainly because they are plain-text (fast to find things with just `grep`) and can be streamed.

    Most features listed in the document are also shared by JSONL, which is my favourite format. It compresses really well with gzip or zstd. Compression removes some plain-text advantages, but ripgrep can search compressed files too. Otherwise, you can:

      zcat data.jsonl.gz | grep ...
    
    
    

    Another advantage of JSONL is that it's easier to chunk into smaller files.

  • Post Author
    jszymborski
    Posted March 26, 2025 at 5:43 pm

    > 4. CSV is streamable

    This is what keeps me coming back.

  • Post Author
    lxe
    Posted March 26, 2025 at 5:48 pm

    TSV > CSV

    Way easier to parse

  • Post Author
    brazzy
    Posted March 26, 2025 at 5:50 pm

    Funny how the "specification holds in a tweet" yet manages to miss at least three things: 1) character encoding, 2) BOM or not, 3) header or no header.

  • Post Author
    circadian
    Posted March 26, 2025 at 5:52 pm

    Kudos for writing this, it's always worth flagging up the utility of a format that just is what it is, for the benefit of all. Commas can also create fun ambiguity, as that last sentence demonstrates. :P

    CSV is lovely. It isn't trying to be cool or legendary. It works for the reasons the author proposes, but isn't trying to go further.

    I work in a work of VERY low power devices and CSV sometimes is all you need for a good time.

    If it doesn't need to be complicated, it shouldn't be. There are always times when I think to myself CSV fits and that is what makes it a legend. Are those times when I want to parallelise or deal with gigs of data in one sitting. Nope. There are more complex formats for that. CSV has a place in my heart too.

    Thanks for reminding me of the beauty of this legendary format… :)

  • Post Author
    mccanne
    Posted March 26, 2025 at 5:52 pm

    Relevant discussion from a few years back

    https://news.ycombinator.com/item?id=28221654

  • Post Author
    Maro
    Posted March 26, 2025 at 5:53 pm

    I hate CSV (but not as much as XML).

    Most reasonably large CSV files will have issues parsing on another system.

  • Post Author
    TrackerFF
    Posted March 26, 2025 at 5:55 pm

    Excel hates CSV only if you don't use the "From text / csv" function (under the data tab).

    For whatever reason, it flawlessly manages to import most CSV data using that functionality. It is the only way I can reliably import data to excel with datestamps / formats.

    Just drag/dropping a CSV file onto a spreadsheet, or "open with excel" sucks.

  • Post Author
    mitchpatin
    Posted March 26, 2025 at 5:55 pm

    CSV still quietly powers the majority of the world’s "data plumbing."

    At any medium+ sized company, you’ll find huge amounts of CSVs being passed around, either stitched into ETL pipelines or sent manually between teams/departments.

    It’s just so damn adaptable and easy to understand.

  • Post Author
    primitivesuave
    Posted March 26, 2025 at 5:55 pm

    One thing that has changed the game with how I work with CSVs is ClickHouse. It is trivially easy to run a local database, import CSV files into a table, and run blazing-fast queries on it. If you leave the data there, ClickHouse will gradually optimize the compression. It's pretty magical stuff if you work in data science.

  • Post Author
    inglor_cz
    Posted March 26, 2025 at 6:03 pm

    "the controversial ex-post RFC 4180"

    I looked at the RFC. What is controversial about it?

  • Post Author
    slg
    Posted March 26, 2025 at 6:10 pm

    >This is so simple you might even invent it yourself without knowing it already exists while learning how to program.

    As someone who has in the past had to handle CSVs from a variety of different third party sources, this is a double-edged sword. The "you might even event it yourself" simplicity means that lots of different places do end up just inventing their own version rather than standardizing to RFC-4180 or whatever when it comes to "quote values containing commas", values containing quotes, values containing newlines, etc. And the simplicity means these type of non-standard implementations can go completely undetectable until a problematic value happens to be used. Sometimes added complexity that forces paying more attention to standards and quickly surfaces a diversion from those standards is helpful.

  • Post Author
    owlstuffing
    Posted March 26, 2025 at 6:12 pm

    CSV is everywhere. I use manifold-csv[1] it’s amazing.

    1. https://github.com/manifold-systems/manifold/tree/master/man…

  • Post Author
    hajile
    Posted March 26, 2025 at 6:15 pm

    The argument against JSON isn't very compelling. Adding a name to every field as they do in their strawman example isn't necessary.

    Compare this CSV

        field1,field2,fieldN
        "value (0,0)","value (0,1)","value (0,n)"
        "value (1,0)","value (1,1)","value (1,n)"
        "value (2,0)","value (2,1)","value (2,n)"
    

    To the directly-equivalent JSON

        [["field1","field2","fieldN"],
         ["value (0,0)","value (0,1)","value (0,n)"],
         ["value (1,0)","value (1,1)","value (1,n)"],
         ["value (2,0)","value (2,1)","value (2,n)"]]
    

    The JSON version is only marginally bigger (just a few brackets), but those brackets represent the ability to be either simple or complex. This matters because you wind up with terrible ad-hoc nesting in CSV ranging from entries using query string syntax to some entirely custom arrangement.

        person,val2,val3,valN
        fname=john&lname=doe&age=55&children=[jill|jim|joey],v2,v3,vN
    

    And in these cases, JSON's objects are WAY better.

    Because CSV is so simple, it's common for them to avoid using a parsing/encoding library. Over the years, I've run into this particular kind of issue a bunch.

        //outputs `val1,val2,unexpected,comma,valN` which has one too many items
        ["val1", "val2", "unexpected,comma", "valN"].join(',')
    

    JSON parsers will not only output the expected values every time, but your language likely uses one of the super-efficient SIMD-based parsers under the surface (probably faster than what you are doing with your custom CSV parser).

    Another point is standardization. Does that .csv file use commas, spaces, semicolons, pipes, etc? Does it use CR,LF, or CRLF? Does it allow escaping quotations? Does it allow quotations to escape commas? Is it utf-8, UCS-2, or something different? JSON doesn't have these issues because these are all laid out in the spec.

    JSON is typed. Sure, it's not a LOT of types, but 6 types is better than none.

    While JSON isn't perfect (I'd love to see an official updated spec with some additional features), it's generally better than CSV in my experience.

  • Post Author
    KingLancelot
    Posted March 26, 2025 at 6:17 pm

    [dead]

  • Post Author
    boricj
    Posted March 26, 2025 at 6:17 pm

    I've recently written a library at work to run visitors on data models bound to data sets. One of these visitors is a CSV serializer that dumps a collection as a CSV document.

    I've just checked and string are escaped using the same mechanism for JSON, with backslashes. I should've double-checked against RFC-4180, but thankfully that mechanism isn't currently triggered anywhere (it's used for log exportation and no data for these triggers that code path). I've also checked the code from other teams and it's just handwritten C++ stream statements inside a loop that doesn't even try to escape data. It also happens to be fine for the same reason (log exportation).

    I've also written serializers for JSON, BSON and YAML and they actually output spec-compliant documents, because there's only one spec to pay attention to. CSV isn't a specification, it's a bunch of loosely-related formats that look similar at a glance. There's a reason why fleshed-out CSV parsers usually have a ton of knobs to deal with all the dialects out there (and I've almost added my own by accident), that's simply not a thing for properly specified file formats.

  • Post Author
    meemo
    Posted March 26, 2025 at 6:20 pm

    Quick question while we’re on the topic of CSV files: is there a command-line tool you’d recommend for handling CSV files that are malformed, corrupted, or use unexpected encodings?

    My experience with CSVs is mostly limited to personal projects, and I generally find the format very convenient. That said, I occasionally (about once a year) run into issues that are tricky to resolve.

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.