Or why people pretending CSV is dead are wrong
Every month or so, a new blog article declaring the near demise of CSV in favor of some “obviously superior” format (parquet, newline-delimited JSON, MessagePack records etc.) find its ways to the reader’s eyes. Sadly those articles often offer a very narrow and biased comparison and often fail to understand what makes CSV a seemingly unkillable staple of data serialization.
It is therefore my intention, through this article, to write a love letter to this data format, often criticized for the wrong reasons, even more so when it is somehow deemed “cool” to hate on it. My point is not, far from it, to say that CSV is a silver bullet but rather to shine a light on some of the format’s sometimes overlooked strengths.
The specification of CSV holds in its title: “comma separated values”. Okay, it’s a lie, but still, the specification holds in a tweet and can be explained to anybody in seconds: commas separate values, new lines separate rows. Now quote values containing commas and line breaks, double your quotes, and that’s it. This is so simple you might even invent it yourself without knowing it already exists while learning how to program.
Of course it does not mean you should not use a dedicated CSV parser/writer because you will mess something up.
No one owns CSV. It has no real specification (yes, I know about the controversial ex-post RFC 4180), just a set of rules everyone kinda agrees to respect implicitly. It is, and will forever remain, an open and free collective idea.
Like JSON, YAML or XML, CSV is just plain text, that you are free to encode however you like. CSV is not a binary format, can be opened with any text editor and does not require any specialized program to be read. This means, by extension, that it can both be read and edited by humans directly, somehow.
CSV can be read row by row very easily without requiring more memory than what is needed to fit a single row. This also means that a trivial program that anyone can write is able to read gigabytes of CSV data with only some kilobytes of RAM.
By comparison, column-oriented data formats such as parquet are not able to stream files row by row without requiring you to jump here and there in the file or to buffer the memory cleverly so you don’t tank read performance.
But of course, CSV is terrible if you are only interested in specific columns because you will indeed need to read all of a row only to access the part you are interested in.
23 Comments
Qem
This is one area where LibreOffice Calc shines in comparison to Excel. Importing CSVs is much more convenient.
Der_Einzige
I'm in on the "shit on microsoft for hard to use formats train" but as someone who did a LOT of .docx parsing – it turned into zen when I realized that I can just convert my docs into the easily parsed .html5 using something like pandoc.
This is a good blog post and Xan is a really neat terminal tool.
tengwar2
I'm not really sure why "Excel hates CSV". I import into Excel all the time. I'm sure the functionality could be expanded, but it seems to work fine. The bit of the process I would like improved is nothing to do with CSV – it's that the exporting programs sometimes rearrange the order of fields, and you have to accommodate that in Excel after the import. But since you can have named columns in Excel (make the data in to a table), it's not a big deal.
nayuki
I greatly prefer TSV over CSV. https://en.wikipedia.org/wiki/Tab-separated_values
mjw_byrne
CSV is ever so elegant but it has one fatal flaw – quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:
1. It's tricky to parallelise processing of CSV.
2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).
So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and are all -escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.
polyrand
As someone who likes modern formats like parquet, when in doubt, I end up using CSV or JSONL (newline-delimited JSON). Mainly because they are plain-text (fast to find things with just `grep`) and can be streamed.
Most features listed in the document are also shared by JSONL, which is my favourite format. It compresses really well with gzip or zstd. Compression removes some plain-text advantages, but ripgrep can search compressed files too. Otherwise, you can:
Another advantage of JSONL is that it's easier to chunk into smaller files.
jszymborski
> 4. CSV is streamable
This is what keeps me coming back.
lxe
TSV > CSV
Way easier to parse
brazzy
Funny how the "specification holds in a tweet" yet manages to miss at least three things: 1) character encoding, 2) BOM or not, 3) header or no header.
circadian
Kudos for writing this, it's always worth flagging up the utility of a format that just is what it is, for the benefit of all. Commas can also create fun ambiguity, as that last sentence demonstrates. :P
CSV is lovely. It isn't trying to be cool or legendary. It works for the reasons the author proposes, but isn't trying to go further.
I work in a work of VERY low power devices and CSV sometimes is all you need for a good time.
If it doesn't need to be complicated, it shouldn't be. There are always times when I think to myself CSV fits and that is what makes it a legend. Are those times when I want to parallelise or deal with gigs of data in one sitting. Nope. There are more complex formats for that. CSV has a place in my heart too.
Thanks for reminding me of the beauty of this legendary format… :)
mccanne
Relevant discussion from a few years back
https://news.ycombinator.com/item?id=28221654
Maro
I hate CSV (but not as much as XML).
Most reasonably large CSV files will have issues parsing on another system.
TrackerFF
Excel hates CSV only if you don't use the "From text / csv" function (under the data tab).
For whatever reason, it flawlessly manages to import most CSV data using that functionality. It is the only way I can reliably import data to excel with datestamps / formats.
Just drag/dropping a CSV file onto a spreadsheet, or "open with excel" sucks.
mitchpatin
CSV still quietly powers the majority of the world’s "data plumbing."
At any medium+ sized company, you’ll find huge amounts of CSVs being passed around, either stitched into ETL pipelines or sent manually between teams/departments.
It’s just so damn adaptable and easy to understand.
primitivesuave
One thing that has changed the game with how I work with CSVs is ClickHouse. It is trivially easy to run a local database, import CSV files into a table, and run blazing-fast queries on it. If you leave the data there, ClickHouse will gradually optimize the compression. It's pretty magical stuff if you work in data science.
inglor_cz
"the controversial ex-post RFC 4180"
I looked at the RFC. What is controversial about it?
slg
>This is so simple you might even invent it yourself without knowing it already exists while learning how to program.
As someone who has in the past had to handle CSVs from a variety of different third party sources, this is a double-edged sword. The "you might even event it yourself" simplicity means that lots of different places do end up just inventing their own version rather than standardizing to RFC-4180 or whatever when it comes to "quote values containing commas", values containing quotes, values containing newlines, etc. And the simplicity means these type of non-standard implementations can go completely undetectable until a problematic value happens to be used. Sometimes added complexity that forces paying more attention to standards and quickly surfaces a diversion from those standards is helpful.
owlstuffing
CSV is everywhere. I use manifold-csv[1] it’s amazing.
1. https://github.com/manifold-systems/manifold/tree/master/man…
hajile
The argument against JSON isn't very compelling. Adding a name to every field as they do in their strawman example isn't necessary.
Compare this CSV
To the directly-equivalent JSON
The JSON version is only marginally bigger (just a few brackets), but those brackets represent the ability to be either simple or complex. This matters because you wind up with terrible ad-hoc nesting in CSV ranging from entries using query string syntax to some entirely custom arrangement.
And in these cases, JSON's objects are WAY better.
Because CSV is so simple, it's common for them to avoid using a parsing/encoding library. Over the years, I've run into this particular kind of issue a bunch.
JSON parsers will not only output the expected values every time, but your language likely uses one of the super-efficient SIMD-based parsers under the surface (probably faster than what you are doing with your custom CSV parser).
Another point is standardization. Does that .csv file use commas, spaces, semicolons, pipes, etc? Does it use CR,LF, or CRLF? Does it allow escaping quotations? Does it allow quotations to escape commas? Is it utf-8, UCS-2, or something different? JSON doesn't have these issues because these are all laid out in the spec.
JSON is typed. Sure, it's not a LOT of types, but 6 types is better than none.
While JSON isn't perfect (I'd love to see an official updated spec with some additional features), it's generally better than CSV in my experience.
KingLancelot
[dead]
boricj
I've recently written a library at work to run visitors on data models bound to data sets. One of these visitors is a CSV serializer that dumps a collection as a CSV document.
I've just checked and string are escaped using the same mechanism for JSON, with backslashes. I should've double-checked against RFC-4180, but thankfully that mechanism isn't currently triggered anywhere (it's used for log exportation and no data for these triggers that code path). I've also checked the code from other teams and it's just handwritten C++ stream statements inside a loop that doesn't even try to escape data. It also happens to be fine for the same reason (log exportation).
I've also written serializers for JSON, BSON and YAML and they actually output spec-compliant documents, because there's only one spec to pay attention to. CSV isn't a specification, it's a bunch of loosely-related formats that look similar at a glance. There's a reason why fleshed-out CSV parsers usually have a ton of knobs to deal with all the dialects out there (and I've almost added my own by accident), that's simply not a thing for properly specified file formats.
meemo
Quick question while we’re on the topic of CSV files: is there a command-line tool you’d recommend for handling CSV files that are malformed, corrupted, or use unexpected encodings?
My experience with CSVs is mostly limited to personal projects, and I generally find the format very convenient. That said, I occasionally (about once a year) run into issues that are tricky to resolve.
williamcotton
Essential CSV shell tools:
csvtk: https://bioinf.shenwei.me/csvtk/
gawk: https://www.gnu.org/software/gawk/manual/html_node/Comma-Sep…
awk: https://github.com/onetrueawk/awk?tab=readme-ov-file#csv