A love letter to the CSV format by Yomguithereal

ByHackTech 3 days ago

23Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

Or why people pretending CSV is dead are wrong

Every month or so, a new blog article declaring the near demise of CSV in favor of some “obviously superior” format (parquet, newline-delimited JSON, MessagePack records etc.) find its ways to the reader’s eyes. Sadly those articles often offer a very narrow and biased comparison and often fail to understand what makes CSV a seemingly unkillable staple of data serialization.

It is therefore my intention, through this article, to write a love letter to this data format, often criticized for the wrong reasons, even more so when it is somehow deemed “cool” to hate on it. My point is not, far from it, to say that CSV is a silver bullet but rather to shine a light on some of the format’s sometimes overlooked strengths.

The specification of CSV holds in its title: “comma separated values”. Okay, it’s a lie, but still, the specification holds in a tweet and can be explained to anybody in seconds: commas separate values, new lines separate rows. Now quote values containing commas and line breaks, double your quotes, and that’s it. This is so simple you might even invent it yourself without knowing it already exists while learning how to program.

Of course it does not mean you should not use a dedicated CSV parser/writer because you will mess something up.

2. CSV is a collective idea

No one owns CSV. It has no real specification (yes, I know about the controversial ex-post RFC 4180), just a set of rules everyone kinda agrees to respect implicitly. It is, and will forever remain, an open and free collective idea.

Like JSON, YAML or XML, CSV is just plain text, that you are free to encode however you like. CSV is not a binary format, can be opened with any text editor and does not require any specialized program to be read. This means, by extension, that it can both be read and edited by humans directly, somehow.

CSV can be read row by row very easily without requiring more memory than what is needed to fit a single row. This also means that a trivial program that anyone can write is able to read gigabytes of CSV data with only some kilobytes of RAM.

By comparison, column-oriented data formats such as parquet are not able to stream files row by row without requiring you to jump here and there in the file or to buffer the memory cleverly so you don’t tank read performance.

But of course, CSV is terrible if you are only interested in specific columns because you will indeed need to read all of a row only to access the part you are interested in.

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (23)

23 Comments

Post Author

Qem

Posted March 26, 2025 at 5:18 pm
9. Excel hates CSV It clearly means CSV must be doing something right.

This is one area where LibreOffice Calc shines in comparison to Excel. Importing CSVs is much more convenient.
0Likes Log in to Reply
Post Author

Der_Einzige

Posted March 26, 2025 at 5:27 pm

I'm in on the "shit on microsoft for hard to use formats train" but as someone who did a LOT of .docx parsing – it turned into zen when I realized that I can just convert my docs into the easily parsed .html5 using something like pandoc.

This is a good blog post and Xan is a really neat terminal tool.

0Likes Log in to Reply
Post Author

tengwar2

Posted March 26, 2025 at 5:30 pm

I'm not really sure why "Excel hates CSV". I import into Excel all the time. I'm sure the functionality could be expanded, but it seems to work fine. The bit of the process I would like improved is nothing to do with CSV – it's that the exporting programs sometimes rearrange the order of fields, and you have to accommodate that in Excel after the import. But since you can have named columns in Excel (make the data in to a table), it's not a big deal.

0Likes Log in to Reply
Post Author

nayuki

Posted March 26, 2025 at 5:32 pm

I greatly prefer TSV over CSV. https://en.wikipedia.org/wiki/Tab-separated_values

0Likes Log in to Reply
Post Author

mjw_byrne

Posted March 26, 2025 at 5:36 pm

CSV is ever so elegant but it has one fatal flaw – quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:

1. It's tricky to parallelise processing of CSV.
2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).

So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and are all -escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.

0Likes Log in to Reply
Post Author

polyrand

Posted March 26, 2025 at 5:39 pm
As someone who likes modern formats like parquet, when in doubt, I end up using CSV or JSONL (newline-delimited JSON). Mainly because they are plain-text (fast to find things with just `grep`) and can be streamed.

Most features listed in the document are also shared by JSONL, which is my favourite format. It compresses really well with gzip or zstd. Compression removes some plain-text advantages, but ripgrep can search compressed files too. Otherwise, you can:

zcat data.jsonl.gz | grep ...

Another advantage of JSONL is that it's easier to chunk into smaller files.
0Likes Log in to Reply
Post Author

jszymborski

Posted March 26, 2025 at 5:43 pm

> 4. CSV is streamable

This is what keeps me coming back.

0Likes Log in to Reply
Post Author

lxe

Posted March 26, 2025 at 5:48 pm

TSV > CSV

Way easier to parse

0Likes Log in to Reply
Post Author

brazzy

Posted March 26, 2025 at 5:50 pm

Funny how the "specification holds in a tweet" yet manages to miss at least three things: 1) character encoding, 2) BOM or not, 3) header or no header.

0Likes Log in to Reply
Post Author

circadian

Posted March 26, 2025 at 5:52 pm

Kudos for writing this, it's always worth flagging up the utility of a format that just is what it is, for the benefit of all. Commas can also create fun ambiguity, as that last sentence demonstrates. :P

CSV is lovely. It isn't trying to be cool or legendary. It works for the reasons the author proposes, but isn't trying to go further.

I work in a work of VERY low power devices and CSV sometimes is all you need for a good time.

If it doesn't need to be complicated, it shouldn't be. There are always times when I think to myself CSV fits and that is what makes it a legend. Are those times when I want to parallelise or deal with gigs of data in one sitting. Nope. There are more complex formats for that. CSV has a place in my heart too.

Thanks for reminding me of the beauty of this legendary format… :)

0Likes Log in to Reply
Post Author

mccanne

Posted March 26, 2025 at 5:52 pm

Relevant discussion from a few years back

https://news.ycombinator.com/item?id=28221654

0Likes Log in to Reply
Post Author

Maro

Posted March 26, 2025 at 5:53 pm

I hate CSV (but not as much as XML).

Most reasonably large CSV files will have issues parsing on another system.

0Likes Log in to Reply
Post Author

TrackerFF

Posted March 26, 2025 at 5:55 pm

Excel hates CSV only if you don't use the "From text / csv" function (under the data tab).

For whatever reason, it flawlessly manages to import most CSV data using that functionality. It is the only way I can reliably import data to excel with datestamps / formats.

Just drag/dropping a CSV file onto a spreadsheet, or "open with excel" sucks.

0Likes Log in to Reply
Post Author

mitchpatin

Posted March 26, 2025 at 5:55 pm

CSV still quietly powers the majority of the world’s "data plumbing."

At any medium+ sized company, you’ll find huge amounts of CSVs being passed around, either stitched into ETL pipelines or sent manually between teams/departments.

It’s just so damn adaptable and easy to understand.

0Likes Log in to Reply
Post Author

primitivesuave

Posted March 26, 2025 at 5:55 pm

One thing that has changed the game with how I work with CSVs is ClickHouse. It is trivially easy to run a local database, import CSV files into a table, and run blazing-fast queries on it. If you leave the data there, ClickHouse will gradually optimize the compression. It's pretty magical stuff if you work in data science.

0Likes Log in to Reply
Post Author

inglor_cz

Posted March 26, 2025 at 6:03 pm

"the controversial ex-post RFC 4180"

I looked at the RFC. What is controversial about it?

0Likes Log in to Reply
Post Author

slg

Posted March 26, 2025 at 6:10 pm

>This is so simple you might even invent it yourself without knowing it already exists while learning how to program.

As someone who has in the past had to handle CSVs from a variety of different third party sources, this is a double-edged sword. The "you might even event it yourself" simplicity means that lots of different places do end up just inventing their own version rather than standardizing to RFC-4180 or whatever when it comes to "quote values containing commas", values containing quotes, values containing newlines, etc. And the simplicity means these type of non-standard implementations can go completely undetectable until a problematic value happens to be used. Sometimes added complexity that forces paying more attention to standards and quickly surfaces a diversion from those standards is helpful.

0Likes Log in to Reply
Post Author

owlstuffing

Posted March 26, 2025 at 6:12 pm

CSV is everywhere. I use manifold-csv[1] it’s amazing.

1. https://github.com/manifold-systems/manifold/tree/master/man…

0Likes Log in to Reply
Post Author

hajile

Posted March 26, 2025 at 6:15 pm
The argument against JSON isn't very compelling. Adding a name to every field as they do in their strawman example isn't necessary.

Compare this CSV

field1,field2,fieldN "value (0,0)","value (0,1)","value (0,n)" "value (1,0)","value (1,1)","value (1,n)" "value (2,0)","value (2,1)","value (2,n)"

To the directly-equivalent JSON

[["field1","field2","fieldN"], ["value (0,0)","value (0,1)","value (0,n)"], ["value (1,0)","value (1,1)","value (1,n)"], ["value (2,0)","value (2,1)","value (2,n)"]]

The JSON version is only marginally bigger (just a few brackets), but those brackets represent the ability to be either simple or complex. This matters because you wind up with terrible ad-hoc nesting in CSV ranging from entries using query string syntax to some entirely custom arrangement.

person,val2,val3,valN fname=john&lname=doe&age=55&children=[jill|jim|joey],v2,v3,vN

And in these cases, JSON's objects are WAY better.

Because CSV is so simple, it's common for them to avoid using a parsing/encoding library. Over the years, I've run into this particular kind of issue a bunch.

//outputs `val1,val2,unexpected,comma,valN` which has one too many items ["val1", "val2", "unexpected,comma", "valN"].join(',')

JSON parsers will not only output the expected values every time, but your language likely uses one of the super-efficient SIMD-based parsers under the surface (probably faster than what you are doing with your custom CSV parser).

Another point is standardization. Does that .csv file use commas, spaces, semicolons, pipes, etc? Does it use CR,LF, or CRLF? Does it allow escaping quotations? Does it allow quotations to escape commas? Is it utf-8, UCS-2, or something different? JSON doesn't have these issues because these are all laid out in the spec.

JSON is typed. Sure, it's not a LOT of types, but 6 types is better than none.

While JSON isn't perfect (I'd love to see an official updated spec with some additional features), it's generally better than CSV in my experience.
0Likes Log in to Reply
Post Author

KingLancelot

Posted March 26, 2025 at 6:17 pm

[dead]

0Likes Log in to Reply
Post Author

boricj

Posted March 26, 2025 at 6:17 pm

I've recently written a library at work to run visitors on data models bound to data sets. One of these visitors is a CSV serializer that dumps a collection as a CSV document.

I've just checked and string are escaped using the same mechanism for JSON, with backslashes. I should've double-checked against RFC-4180, but thankfully that mechanism isn't currently triggered anywhere (it's used for log exportation and no data for these triggers that code path). I've also checked the code from other teams and it's just handwritten C++ stream statements inside a loop that doesn't even try to escape data. It also happens to be fine for the same reason (log exportation).

I've also written serializers for JSON, BSON and YAML and they actually output spec-compliant documents, because there's only one spec to pay attention to. CSV isn't a specification, it's a bunch of loosely-related formats that look similar at a glance. There's a reason why fleshed-out CSV parsers usually have a ton of knobs to deal with all the dialects out there (and I've almost added my own by accident), that's simply not a thing for properly specified file formats.

0Likes Log in to Reply
Post Author

meemo

Posted March 26, 2025 at 6:20 pm

Quick question while we’re on the topic of CSV files: is there a command-line tool you’d recommend for handling CSV files that are malformed, corrupted, or use unexpected encodings?

My experience with CSVs is mostly limited to personal projects, and I generally find the format very convenient. That said, I occasionally (about once a year) run into issues that are tricky to resolve.

0Likes Log in to Reply
Post Author

williamcotton

Posted March 26, 2025 at 6:30 pm

Essential CSV shell tools:

csvtk: https://bioinf.shenwei.me/csvtk/

gawk: https://www.gnu.org/software/gawk/manual/html_node/Comma-Sep…

awk: https://github.com/onetrueawk/awk?tab=readme-ov-file#csv

0Likes Log in to Reply

A love letter to the CSV format by Yomguithereal

A love letter to the CSV format by Yomguithereal

Share This Article

Newsletter

2. CSV is a collective idea

HackTech

23 Comments

Qem

Der_Einzige

tengwar2

nayuki

mjw_byrne

polyrand

jszymborski

lxe

brazzy

circadian

mccanne

Maro

TrackerFF

mitchpatin

primitivesuave

inglor_cz

slg

owlstuffing

hajile

KingLancelot

boricj

meemo

williamcotton

Leave a comment Cancel reply

Editor's Choice

A love letter to the CSV format by Yomguithereal

A love letter to the CSV format by Yomguithereal

Share This Article

Newsletter

2. CSV is a collective idea

23 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter