CSV Parser method

Hey folks. I always cringe about importing csv files from some outside source and typically I cry about it for 15 seconds and then open it up in excel and save it to xml or some other format that I trust just so I don’t have to deal with commas, quotes, and carriage returns. Of course today I had a situation where I couldn’t just do that (silly clients), so I broke down and used those 15 seconds to whip up a csv parser. While there is no real “standard” for csv documents this parser will handle embedded characters such as strings with quotes, commas, and returns inside of them, rows that end with carriage returns, new lines, or a combination of the two, and it even gives you the choice of stripping out the white spaces or not. I was going to do this with regexes but then I remembered how terrible I am with them (also, this ends up being faster than some good regexes some other folks suggested). Code is below and if interested, I based my csv standards on the rules found here Comma-separated values - Wikipedia

/********************************************************************************
	@method: parseCSV(sData, bTrim)
	
	@arg sData : [string] The csv data you want to parse
	@opt bTrim : [boolean] optional argument used if you want to trim the white spaces off the fields
	
	@return aParsedData : [array] multi-dimensional array of rows and columns

	@description : Takes a string and applies csv rules to it to form an array of rows and columns. 
				   As there aren't hard standards for csv I applied the rules from the wiki http://en.wikipedia.org/wiki/Comma-separated_values
	
	@note:
	
	@history:	01/07/2010	MSF	Created parseCSV method
********************************************************************************/
var sData;
var bTrim = false;

var aParsedData;
var aRow;
var aSearch;
var bInQuotes;
var bSpecial;
var sColumn;


// If no string to parse was passed there is no reason to run this method so return
if (!arguments[0])
{
	return;
}
sData = arguments[0];

// If column data should be trimmed set the marker
if (arguments[1])
{
	bTrim = true;
}

// Set the default character search array made up of the special characters that csv has to deal with
aSearch = ['"', ",", "\n", "\r"];

// Initialize the return array, the first column, and the first row array
aParsedData = new Array();
sColumn = "";
aRow = new Array();

// Loop through the characters in the string
for (var i = 0; i < sData.length; i++)
{
	// Reset the special character marker to false
	bSpecial = false;
	
	// Loop through the current special character search array
	for (var s = 0; s < aSearch.length; s++)
	{
		// And if we hit a speical character note that
		if ( sData[i] == aSearch[s])
		{
			bSpecial = true;
			break;
		}
	}
	
	// If this is a special character that we are currently looking for it process it separately
	if ( bSpecial )
	{
		// Switch on the current character (we can only get to commas, carriage returns, and line feeds if we aren't in double-quotes which is take care of by the aSearch array)
		switch (sData[i])
		{
			// If the character is a comma then we know this is the end of a column of data so push it to the row and reinitialize the column string
			case ",":
				if (bTrim)
				{
					sColumn = utils.stringTrim(sColumn);
				}
				aRow.push(sColumn);
				sColumn = "";
				break;
			
			// If this is a carriage return then it is the end of a row so push the column to the row and then push the row to the ParsedData array
			case "\r":
				if (bTrim)
				{
					sColumn = utils.stringTrim(sColumn);
				}
				aRow.push(sColumn);
				aParsedData.push(aRow);
				aRow = new Array();
				sColumn = "";
				
				// If the next character happens to be a new line then skip it because this is a csv whose rows end with \r\n
				if ( sData[i+1] && "\n" == sData[i+1] )
				{
					i++;
				}
				break;
		

			// If this is a new line then it is the end of a row so push the column to the row and then push the row to the ParsedData array
			case "\n":
				if (bTrim)
				{
					sColumn = utils.stringTrim(sColumn);
				}
				aRow.push(sColumn);
				aParsedData.push(aRow);
				aRow = new Array();
				sColumn = "";
				break;		
			
			// Quotes are highly special, read on
			case '"':
				// Check to see if we are already inside a set of double-quotes
				if ( bInQuotes )
				{
					// If we are inside double-quotes, are currently on a quote, AND the next character is a quote it means we have found an escaped quote that we want to allow
					// So add a single quote to the column and then skip the next character by incrementing i
					if (sData[i+1] && '"' == sData[i+1])
					{
						sColumn += '"';
						i++;
					}
					// If we are inside quotes but the next character is not another set of double quotes it means we have reached the quoted data so change the Search array back to our normal set and move on
					else
					{
						aSearch = ['"', ",", "\n", "\r"];
						bInQuotes = false;
					}
				}
				// If we are not inside a set of double-quotes, this is out first set and means that the next special character that we care about is out closing double-quotes
				else
				{
					bInQuotes = true;
					aSearch = ['"'];
				}
				break;
		}
	}
	// If this is NOT a special character just add it to the current column's data
	else
	{
		sColumn += sData[i];
	}
}

// Finalize the data by adding the last column and row to our return data
if (bTrim)
{
	sColumn = utils.stringTrim(sColumn);
}
aRow.push(sColumn);
aParsedData.push(aRow);

// Enjoy your data
return aParsedData;

Hi Matt,

Thanks a lot ! Much appreciated !

For my own use I added a third optional parameter columnDelimiter because we work a lot with ; (Excel CSV, Europe)

Regards,

Matt,

You da’ man! That’s a really nice, compact piece of code! Thanks very much for sharing it! :D

Glad to hear some folks can use it. I might add a couple more options soon but we shall see. Have a good week all.

Thanks Matt!
really helpfull! :-)

Just as I was about to do a very similar task (for the nth time) this comes along “just in time”. :D
Thanks - it even gets rid of the double quotes that I was going to have to get rid of as well.

Hello everyone,

I’ve been playing around with Matt’s method, and I’ve adapted it to specify a separator character (as per my module). Here is the modified code, which will be included in my module:

function parseCSV()
{
	/*
   @method: parseCSV(sData, bTrim, sSeparator)

   @arg sData : [string] The csv data you want to parse
   @opt bTrim : [boolean] optional argument used if you want to trim the white spaces off the fields
   @arg sSeparator : [string] The separator character for columns delimitation

   @return aParsedData : [array] multi-dimensional array of rows and columns

   @description : Takes a string and applies csv rules to it to form an array of rows and columns.
               As there aren't hard standards for csv I applied the rules from the wiki http://en.wikipedia.org/wiki/Comma-separated_values

   @note:	Original by Matt Frizzell of adBlocks, modified by Ben Savignac of LOGIsoft

   @history:   01/07/2010   MSF   Created parseCSV method
               01/21/2010   BRS   Modified parseCSV method
	 ********************************************************************************/
   var sData;
   var bTrim = false;

   var aParsedData;
   var aRow;
   var aSearchTemplate;
   var aSearch;
   var bInQuotes;
   var bSpecial;
   var sColumn;
   var sSeparator;


   // If no string to parse was passed there is no reason to run this method so return
   if (!arguments[0])
   {
	   return;
   }
   sData = arguments[0];

   // If column data should be trimmed set the marker
   if (arguments[1])
   {
	   bTrim = true;
   }

   // If a separator has been defined, set it
   if (arguments[2])
   {
	   sSeparator = arguments[2];
	   aSearchTemplate = ['"', "\n", "\r", sSeparator];
   } else {
	   aSearchTemplate = ['"', "\n", "\r"];
   }
   
   // Set the default character search array made up of the special characters that csv has to deal with
   aSearch = aSearchTemplate.slice(0);

   // Initialize the return array, the first column, and the first row array
   aParsedData = new Array();
   sColumn = "";
   aRow = new Array();

   // Loop through the characters in the string
   for (var i = 0; i < sData.length; i++)
   {
	   // Reset the special character marker to false
	   bSpecial = false;

	   // Loop through the current special character search array
	   for (var s = 0; s < aSearch.length; s++)
	   {
		   // And if we hit a speical character note that
		   if ( sData[i] == aSearch[s])
		   {
			   bSpecial = true;
			   break;
		   }
	   }

	   // If this is a special character that we are currently looking for it process it separately
	   if ( bSpecial )
	   {
		   // Switch on the current character (we can only get to separator characters, carriage returns, and line feeds if we aren't in double-quotes which is take care of by the aSearch array)
		   switch (sData[i])
		   {
		   // If the character is the separator OUTSIDE DOUBLE QUOTES then we know this is the end of a column of data, so push it to the row and reinitialize the column string
		   case sSeparator:
			if ( !bInQuotes )
		   {
			   // If we are not inside double-quotes, are currently on the separator, AND the next character is a quote,
			   // then we know this is the end of a column of data so push it to the row and reinitialize the column string
			   if (sData[i+1] && '"' == sData[i+1])
			   {
				   if (bTrim)
				   {
					   sColumn = utils.stringTrim(sColumn);
				   }
				   aRow.push(sColumn);
				   sColumn = "";
				   aSearch = aSearchTemplate.slice(0);
			   }
			   // If we are not inside quotes but the next character is not another set of double quotes it means this is not really a separator and we just add the character to the data
			   else
			   {
				   sColumn += sData[i];
				   aSearch = aSearchTemplate.slice(0);
			   }
		   }
		   // If we are inside a set of double-quotes, this is not really a separator and we just add the character to the data
		   else
		   {
			   sColumn += sData[i];
			   aSearch = aSearchTemplate.slice(0);
		   }
		   break;

		   // If this is a carriage return then it is the end of a row so push the column to the row and then push the row to the ParsedData array
		   case "\r":
		   if (bTrim)
		   {
			   sColumn = utils.stringTrim(sColumn);
		   }
		   aRow.push(sColumn);
		   aParsedData.push(aRow);
		   aRow = new Array();
		   sColumn = "";

		   // If the next character happens to be a new line then skip it because this is a csv whose rows end with \r\n
		   if ( sData[i+1] && "\n" == sData[i+1] )
		   {
			   i++;
		   }
		   break;


		   // If this is a new line then it is the end of a row so push the column to the row and then push the row to the ParsedData array
		   case "\n":
		   if (bTrim)
		   {
			   sColumn = utils.stringTrim(sColumn);
		   }
		   aRow.push(sColumn);
		   aParsedData.push(aRow);
		   aRow = new Array();
		   sColumn = "";
		   break;      

		   // Quotes are highly special, read on
		   case '"':
		   // Check to see if we are already inside a set of double-quotes
		   if ( bInQuotes )
		   {
			   // If we are inside double-quotes, are currently on a quote, AND the next character is a quote it means we have found an escaped quote that we want to allow
			   // So add a single quote to the column and then skip the next character by incrementing i
			   if (sData[i+1] && '"' == sData[i+1])
			   {
				   sColumn += '"';
				   i++;
			   }
			   // If we are inside quotes but the next character is not another set of double quotes it means we have reached the quoted data so change the Search array back to our normal set and move on
			   else
			   {
				   aSearch = aSearchTemplate.slice(0);
				   bInQuotes = false;
			   }
		   }
		   // If we are not inside a set of double-quotes, this is our first set and means that the next special characters that we care about are closing double-quotes and the separator character
		   else
		   {
			   bInQuotes = true;
			   aSearch = ['"', sSeparator];
		   }
		   break;
		   }
	   }
	   // If this is NOT a special character just add it to the current column's data
	   else
	   {
		   sColumn += sData[i];
	   }
   }

   // Finalize the data by adding the last column and row to our return data
   if (bTrim)
   {
	   sColumn = utils.stringTrim(sColumn);
   }
   aRow.push(sColumn);
   aParsedData.push(aRow);

   // Enjoy your data
   return aParsedData;
}

This is run using the following method:

function btn_ParseCSV()
{
	var target = plugins.file.showFileOpenDialog(1);  // Get the file to import from
	if ( target.isFile() )  // Check if we selected a file or a folder
	{
		var targetFullPath = target.getAbsolutePath();  // Obtain the full file path
		var fileContents = application.readTXTFile(targetFullPath);
		var sSeparator = ";";  // Change to the separator character used in the definition
		var result = globals.parseCSV(fileContents, false, sSeparator);
	}
}

Enjoy!

Ben

Excellent stuff Ben - really looking forward to seeing this in your module.

Best of both worlds!

This is run using the following method:

function btn_ParseCSV()

{
var target = plugins.file.showFileOpenDialog(1); // Get the file to import from
if ( target.isFile() ) // Check if we selected a file or a folder
{
var targetFullPath = target.getAbsolutePath(); // Obtain the full file path
var fileContents = application.readTXTFile(targetFullPath);
var sSeparator = “;”; // Change to the separator character used in the definition
var result = globals.parseCSV(fileContents, false, sSeparator);
}
}




Enjoy!

Ben

Hi Ben,

What are your performance/memory experiences with large csv-files ? Is it than necessary to read the file line by line and process it line by line (if possible in Servoy ?) ?

Lambert.

mattfrizzell:
Hey folks. I always cringe about importing csv files from some outside source and typically I cry about it for 15 seconds and then open it up in excel and save it to xml or some other format that I trust just so I don’t have to deal with commas, quotes, and carriage returns. Of course today I had a situation where I couldn’t just do that (silly clients), so I broke down and used those 15 seconds to whip up a csv parser. While there is no real “standard” for csv documents this parser will handle embedded characters such as strings with quotes, commas, and returns inside of them, rows that end with carriage returns, new lines, or a combination of the two, and it even gives you the choice of stripping out the white spaces or not. I was going to do this with regexes but then I remembered how terrible I am with them (also, this ends up being faster than some good regexes some other folks suggested). Code is below and if interested, I based my csv standards on the rules found here Comma-separated values - Wikipedia

/********************************************************************************
@method: parseCSV(sData, bTrim)

@arg sData : [string] The csv data you want to parse
@opt bTrim : [boolean] optional argument used if you want to trim the white spaces off the fields

@return aParsedData : [array] multi-dimensional array of rows and columns

@description : Takes a string and applies csv rules to it to form an array of rows and columns. 
			   As there aren't hard standards for csv I applied the rules from the wiki http://en.wikipedia.org/wiki/Comma-separated_values

@note:

@history:	01/07/2010	MSF	Created parseCSV method

********************************************************************************/
var sData;
var bTrim = false;

var aParsedData;
var aRow;
var aSearch;
var bInQuotes;
var bSpecial;
var sColumn;

// If no string to parse was passed there is no reason to run this method so return
if (!arguments[0])
{
return;
}
sData = arguments[0];

// If column data should be trimmed set the marker
if (arguments[1])
{
bTrim = true;
}

// Set the default character search array made up of the special characters that csv has to deal with
aSearch = [‘"’, “,”, “\n”, “\r”];

// Initialize the return array, the first column, and the first row array
aParsedData = new Array();
sColumn = “”;
aRow = new Array();

// Loop through the characters in the string
for (var i = 0; i < sData.length; i++)
{
// Reset the special character marker to false
bSpecial = false;

// Loop through the current special character search array
for (var s = 0; s < aSearch.length; s++)
{
	// And if we hit a speical character note that
	if ( sData[i] == aSearch[s])
	{
		bSpecial = true;
		break;
	}
}

// If this is a special character that we are currently looking for it process it separately
if ( bSpecial )
{
	// Switch on the current character (we can only get to commas, carriage returns, and line feeds if we aren't in double-quotes which is take care of by the aSearch array)
	switch (sData[i])
	{
		// If the character is a comma then we know this is the end of a column of data so push it to the row and reinitialize the column string
		case ",":
			if (bTrim)
			{
				sColumn = utils.stringTrim(sColumn);
			}
			aRow.push(sColumn);
			sColumn = "";
			break;
		
		// If this is a carriage return then it is the end of a row so push the column to the row and then push the row to the ParsedData array
		case "\r":
			if (bTrim)
			{
				sColumn = utils.stringTrim(sColumn);
			}
			aRow.push(sColumn);
			aParsedData.push(aRow);
			aRow = new Array();
			sColumn = "";
			
			// If the next character happens to be a new line then skip it because this is a csv whose rows end with \r\n
			if ( sData[i+1] && "\n" == sData[i+1] )
			{
				i++;
			}
			break;
	

		// If this is a new line then it is the end of a row so push the column to the row and then push the row to the ParsedData array
		case "\n":
			if (bTrim)
			{
				sColumn = utils.stringTrim(sColumn);
			}
			aRow.push(sColumn);
			aParsedData.push(aRow);
			aRow = new Array();
			sColumn = "";
			break;		
		
		// Quotes are highly special, read on
		case '"':
			// Check to see if we are already inside a set of double-quotes
			if ( bInQuotes )
			{
				// If we are inside double-quotes, are currently on a quote, AND the next character is a quote it means we have found an escaped quote that we want to allow
				// So add a single quote to the column and then skip the next character by incrementing i
				if (sData[i+1] && '"' == sData[i+1])
				{
					sColumn += '"';
					i++;
				}
				// If we are inside quotes but the next character is not another set of double quotes it means we have reached the quoted data so change the Search array back to our normal set and move on
				else
				{
					aSearch = ['"', ",", "\n", "\r"];
					bInQuotes = false;
				}
			}
			// If we are not inside a set of double-quotes, this is out first set and means that the next special character that we care about is out closing double-quotes
			else
			{
				bInQuotes = true;
				aSearch = ['"'];
			}
			break;
	}
}
// If this is NOT a special character just add it to the current column's data
else
{
	sColumn += sData[i];
}

}

// Finalize the data by adding the last column and row to our return data
if (bTrim)
{
sColumn = utils.stringTrim(sColumn);
}
aRow.push(sColumn);
aParsedData.push(aRow);

// Enjoy your data
return aParsedData;

Hi!

Thanks for sharing.

I am experiencing a slowdown when parsing large csv files. Would you like to share how to manage this?

Thanks and Regards.

rogel:
I am experiencing a slowdown when parsing large csv files. Would you like to share how to manage this?

Probably because the data is build in memory.
Try writing each line directly to a file, see: http://forum.servoy.com/viewtopic.php?f=22&t=13866&p=72648&hilit=java.io.filewriter#p72648
This improved the speed of my exports very much!

Hope this helps.

mboegem:

rogel:
I am experiencing a slowdown when parsing large csv files. Would you like to share how to manage this?

Probably because the data is build in memory.
Try writing each line directly to a file, see: http://forum.servoy.com/viewtopic.php?f=22&t=13866&p=72648&hilit=java.io.filewriter#p72648
This improved the speed of my exports very much!

Hope this helps.

Thanks Marc!

I did use bufferedreader to read the file.

Marc talked about fileWrite() and not fileRead()…

What do you do with the import data ? Collect in memory ? If so then there will be a slow down with large csv files.
If you write the import data to a Servoy foundset then it’s up to Servoy to do it the speedy way (balance speed / memory).

We are doing some testing this moment because we see that the foundset filling speed (rows added per second) drops at a certain (memory ?) moment. We save the data during the import to the database at a regular interval of 500 records. (Yes, autosave is off)

Unfortunately I cannot look into the black box called JSFoundSet so I don’t know what happens under the hood (foundset memory management)…

Regards,

lwjwillemsen:
Marc talked about fileWrite() and not fileRead()…

What do you do with the import data ? Collect in memory ? If so then there will be a slow down with large csv files.
If you write the import data to a Servoy foundset then it’s up to Servoy to do it the speedy way (balance speed / memory).

We are doing some testing this moment because we see that the foundset filling speed (rows added per second) drops at a certain (memory ?) moment. We save the data during the import to the database at a regular interval of 500 records. (Yes, autosave is off)

Unfortunately I cannot look into the black box called JSFoundSet so I don’t know what happens under the hood (foundset memory management)…

Regards,

Sorry about that. But I will use the filewrite in my backup database utility. what i do with the import data(csv) is i create an xml file. i have autosave(false) as well. i do create record in the foundset one at a time and save it right away before going to the next record in the xml.

hope this helps.

You could run a test without filling the foundset and look what happens with regard to the time consumption…

Regards,

rogel:
We are doing some testing this moment because we see that the foundset filling speed (rows added per second) drops at a certain (memory ?) moment. We save the data during the import to the database at a regular interval of 500 records. (Yes, autosave is off)

Are you using a transaction? Are you clearing the foundset after your 500 records? I think both would speed up things and prevent memory issues…

hi!

i am importing a large amount of data, using smart client and it very much slows down averaging on 2 records per second using developer. :(

here is my logic.
read csv file per line using the parser
for each line

  1. create the foundset records and related foundsets(its 5 tables deep)
  2. databaseManager.saveData()

is there any suggestions on how to speed things?

i am thinking of using a headlessclient. would it perform any difference if testing in developer?

thanks!

Rogel,

I would not call saveData for each record.
Try saving every 100th record or so.

Rob

patrick:

rogel:
We are doing some testing this moment because we see that the foundset filling speed (rows added per second) drops at a certain (memory ?) moment. We save the data during the import to the database at a regular interval of 500 records. (Yes, autosave is off)

Are you using a transaction? Are you clearing the foundset after your 500 records? I think both would speed up things and prevent memory issues…

What do you mean by “clearing the foundset”?

if you add many records to a foundset, also once in while do: foundset.clear()
that will keep the speed up, and the memory low,