Dear Lazyweb,
I have a CSV file, fields surrounded by double quotes, comma separated. There are a bit more than 230K rows.
Some of the fields have hard returns in them, so it breaks my rows, and my importer (phpMyAdmin) chokes.
What’s the easiest way to remove only those hard returns? I’m thinking something like a fancy regex in vim to remove hard returns from the ends of lines that don’t end in double quotes, but my regex-fu isn’t that awesome.
Anyone know how to do that, or have a better idea?
#!/usr/bin/perl
sub is_legit_csv {
$_[0] =~ /^(“[^”]*”,)(“[^”]*”)/);
}
while (my $line = <>) {
while(not is_legit_csv($line)) {
chomp $line;
$line .= <>;
}
print $line;
}
===============
~ ed$ cat foobar.csv
“foo”,”bar
baz”,”bang”
“fim”,”fam”,”foom”
~ ed$ ./fixer.pl foobar.csv
“foo”,”barbaz”,”bang”
“fim”,”fam”,”foom”
Looks like Ed beat me to it, but in vi, you should be able to do this
:%sm/([^”])n/1 /gc
It doesn’t do any fancy checking as to whether it’s a valid csv or not; it just replaces each newline not preceded by a ” with a space.
Mine wasn’t quite right, but Topher and I are hashing it out in IMs. 🙂
Can you post whether or not this is solved? Don’t know what the heck VIM is or anything 😉 but I can dance with you RegEx boys to an extent.
Ed’s method may have worked, but was taking a very long time.
Mark’s did the trick with this:
:%sm/([^”])n/1 /g
in about 3/4 of a second. Vim rocks.