Page 1 of 2

CSV Parser method

PostPosted: Fri Jan 08, 2010 2:54 am
by mattfrizzell
Hey folks. I always cringe about importing csv files from some outside source and typically I cry about it for 15 seconds and then open it up in excel and save it to xml or some other format that I trust just so I don't have to deal with commas, quotes, and carriage returns. Of course today I had a situation where I couldn't just do that (silly clients), so I broke down and used those 15 seconds to whip up a csv parser. While there is no real "standard" for csv documents this parser will handle embedded characters such as strings with quotes, commas, and returns inside of them, rows that end with carriage returns, new lines, or a combination of the two, and it even gives you the choice of stripping out the white spaces or not. I was going to do this with regexes but then I remembered how terrible I am with them (also, this ends up being faster than some good regexes some other folks suggested). Code is below and if interested, I based my csv standards on the rules found here http://en.wikipedia.org/wiki/Comma-separated_values


Code: Select all
/********************************************************************************
   @method: parseCSV(sData, bTrim)
   
   @arg sData : [string] The csv data you want to parse
   @opt bTrim : [boolean] optional argument used if you want to trim the white spaces off the fields
   
   @return aParsedData : [array] multi-dimensional array of rows and columns

   @description : Takes a string and applies csv rules to it to form an array of rows and columns.
               As there aren't hard standards for csv I applied the rules from the wiki http://en.wikipedia.org/wiki/Comma-separated_values
   
   @note:
   
   @history:   01/07/2010   MSF   Created parseCSV method
********************************************************************************/
var sData;
var bTrim = false;

var aParsedData;
var aRow;
var aSearch;
var bInQuotes;
var bSpecial;
var sColumn;


// If no string to parse was passed there is no reason to run this method so return
if (!arguments[0])
{
   return;
}
sData = arguments[0];

// If column data should be trimmed set the marker
if (arguments[1])
{
   bTrim = true;
}

// Set the default character search array made up of the special characters that csv has to deal with
aSearch = ['"', ",", "\n", "\r"];

// Initialize the return array, the first column, and the first row array
aParsedData = new Array();
sColumn = "";
aRow = new Array();

// Loop through the characters in the string
for (var i = 0; i < sData.length; i++)
{
   // Reset the special character marker to false
   bSpecial = false;
   
   // Loop through the current special character search array
   for (var s = 0; s < aSearch.length; s++)
   {
      // And if we hit a speical character note that
      if ( sData[i] == aSearch[s])
      {
         bSpecial = true;
         break;
      }
   }
   
   // If this is a special character that we are currently looking for it process it separately
   if ( bSpecial )
   {
      // Switch on the current character (we can only get to commas, carriage returns, and line feeds if we aren't in double-quotes which is take care of by the aSearch array)
      switch (sData[i])
      {
         // If the character is a comma then we know this is the end of a column of data so push it to the row and reinitialize the column string
         case ",":
            if (bTrim)
            {
               sColumn = utils.stringTrim(sColumn);
            }
            aRow.push(sColumn);
            sColumn = "";
            break;
         
         // If this is a carriage return then it is the end of a row so push the column to the row and then push the row to the ParsedData array
         case "\r":
            if (bTrim)
            {
               sColumn = utils.stringTrim(sColumn);
            }
            aRow.push(sColumn);
            aParsedData.push(aRow);
            aRow = new Array();
            sColumn = "";
            
            // If the next character happens to be a new line then skip it because this is a csv whose rows end with \r\n
            if ( sData[i+1] && "\n" == sData[i+1] )
            {
               i++;
            }
            break;
      

         // If this is a new line then it is the end of a row so push the column to the row and then push the row to the ParsedData array
         case "\n":
            if (bTrim)
            {
               sColumn = utils.stringTrim(sColumn);
            }
            aRow.push(sColumn);
            aParsedData.push(aRow);
            aRow = new Array();
            sColumn = "";
            break;      
         
         // Quotes are highly special, read on
         case '"':
            // Check to see if we are already inside a set of double-quotes
            if ( bInQuotes )
            {
               // If we are inside double-quotes, are currently on a quote, AND the next character is a quote it means we have found an escaped quote that we want to allow
               // So add a single quote to the column and then skip the next character by incrementing i
               if (sData[i+1] && '"' == sData[i+1])
               {
                  sColumn += '"';
                  i++;
               }
               // If we are inside quotes but the next character is not another set of double quotes it means we have reached the quoted data so change the Search array back to our normal set and move on
               else
               {
                  aSearch = ['"', ",", "\n", "\r"];
                  bInQuotes = false;
               }
            }
            // If we are not inside a set of double-quotes, this is out first set and means that the next special character that we care about is out closing double-quotes
            else
            {
               bInQuotes = true;
               aSearch = ['"'];
            }
            break;
      }
   }
   // If this is NOT a special character just add it to the current column's data
   else
   {
      sColumn += sData[i];
   }
}

// Finalize the data by adding the last column and row to our return data
if (bTrim)
{
   sColumn = utils.stringTrim(sColumn);
}
aRow.push(sColumn);
aParsedData.push(aRow);

// Enjoy your data
return aParsedData;

Re: CSV Parser method

PostPosted: Fri Jan 08, 2010 10:51 am
by lwjwillemsen
Hi Matt,

Thanks a lot ! Much appreciated !

For my own use I added a third optional parameter columnDelimiter because we work a lot with ; (Excel CSV, Europe)

Regards,

Re: CSV Parser method

PostPosted: Mon Jan 11, 2010 5:51 pm
by bobcusick
Matt,

You da' man! That's a really nice, compact piece of code! Thanks very much for sharing it! :D

Re: CSV Parser method

PostPosted: Mon Jan 11, 2010 6:27 pm
by mattfrizzell
Glad to hear some folks can use it. I might add a couple more options soon but we shall see. Have a good week all.

Re: CSV Parser method

PostPosted: Mon Jan 11, 2010 6:54 pm
by Harjo
Thanks Matt!
really helpfull! :-)

Re: CSV Parser method

PostPosted: Wed Jan 13, 2010 2:00 pm
by Thomas Parry
Just as I was about to do a very similar task (for the nth time) this comes along "just in time". :D
Thanks - it even gets rid of the double quotes that I was going to have to get rid of as well.

Re: CSV Parser method

PostPosted: Thu Jan 21, 2010 5:48 pm
by LOGIsoft
Hello everyone,

I've been playing around with Matt's method, and I've adapted it to specify a separator character (as per my module). Here is the modified code, which will be included in my module:

Code: Select all
function parseCSV()
{
   /*
   @method: parseCSV(sData, bTrim, sSeparator)

   @arg sData : [string] The csv data you want to parse
   @opt bTrim : [boolean] optional argument used if you want to trim the white spaces off the fields
   @arg sSeparator : [string] The separator character for columns delimitation

   @return aParsedData : [array] multi-dimensional array of rows and columns

   @description : Takes a string and applies csv rules to it to form an array of rows and columns.
               As there aren't hard standards for csv I applied the rules from the wiki http://en.wikipedia.org/wiki/Comma-separated_values

   @note:   Original by Matt Frizzell of adBlocks, modified by Ben Savignac of LOGIsoft

   @history:   01/07/2010   MSF   Created parseCSV method
               01/21/2010   BRS   Modified parseCSV method
    ********************************************************************************/
   var sData;
   var bTrim = false;

   var aParsedData;
   var aRow;
   var aSearchTemplate;
   var aSearch;
   var bInQuotes;
   var bSpecial;
   var sColumn;
   var sSeparator;


   // If no string to parse was passed there is no reason to run this method so return
   if (!arguments[0])
   {
      return;
   }
   sData = arguments[0];

   // If column data should be trimmed set the marker
   if (arguments[1])
   {
      bTrim = true;
   }

   // If a separator has been defined, set it
   if (arguments[2])
   {
      sSeparator = arguments[2];
      aSearchTemplate = ['"', "\n", "\r", sSeparator];
   } else {
      aSearchTemplate = ['"', "\n", "\r"];
   }
   
   // Set the default character search array made up of the special characters that csv has to deal with
   aSearch = aSearchTemplate.slice(0);

   // Initialize the return array, the first column, and the first row array
   aParsedData = new Array();
   sColumn = "";
   aRow = new Array();

   // Loop through the characters in the string
   for (var i = 0; i < sData.length; i++)
   {
      // Reset the special character marker to false
      bSpecial = false;

      // Loop through the current special character search array
      for (var s = 0; s < aSearch.length; s++)
      {
         // And if we hit a speical character note that
         if ( sData[i] == aSearch[s])
         {
            bSpecial = true;
            break;
         }
      }

      // If this is a special character that we are currently looking for it process it separately
      if ( bSpecial )
      {
         // Switch on the current character (we can only get to separator characters, carriage returns, and line feeds if we aren't in double-quotes which is take care of by the aSearch array)
         switch (sData[i])
         {
         // If the character is the separator OUTSIDE DOUBLE QUOTES then we know this is the end of a column of data, so push it to the row and reinitialize the column string
         case sSeparator:
         if ( !bInQuotes )
         {
            // If we are not inside double-quotes, are currently on the separator, AND the next character is a quote,
            // then we know this is the end of a column of data so push it to the row and reinitialize the column string
            if (sData[i+1] && '"' == sData[i+1])
            {
               if (bTrim)
               {
                  sColumn = utils.stringTrim(sColumn);
               }
               aRow.push(sColumn);
               sColumn = "";
               aSearch = aSearchTemplate.slice(0);
            }
            // If we are not inside quotes but the next character is not another set of double quotes it means this is not really a separator and we just add the character to the data
            else
            {
               sColumn += sData[i];
               aSearch = aSearchTemplate.slice(0);
            }
         }
         // If we are inside a set of double-quotes, this is not really a separator and we just add the character to the data
         else
         {
            sColumn += sData[i];
            aSearch = aSearchTemplate.slice(0);
         }
         break;

         // If this is a carriage return then it is the end of a row so push the column to the row and then push the row to the ParsedData array
         case "\r":
         if (bTrim)
         {
            sColumn = utils.stringTrim(sColumn);
         }
         aRow.push(sColumn);
         aParsedData.push(aRow);
         aRow = new Array();
         sColumn = "";

         // If the next character happens to be a new line then skip it because this is a csv whose rows end with \r\n
         if ( sData[i+1] && "\n" == sData[i+1] )
         {
            i++;
         }
         break;


         // If this is a new line then it is the end of a row so push the column to the row and then push the row to the ParsedData array
         case "\n":
         if (bTrim)
         {
            sColumn = utils.stringTrim(sColumn);
         }
         aRow.push(sColumn);
         aParsedData.push(aRow);
         aRow = new Array();
         sColumn = "";
         break;     

         // Quotes are highly special, read on
         case '"':
         // Check to see if we are already inside a set of double-quotes
         if ( bInQuotes )
         {
            // If we are inside double-quotes, are currently on a quote, AND the next character is a quote it means we have found an escaped quote that we want to allow
            // So add a single quote to the column and then skip the next character by incrementing i
            if (sData[i+1] && '"' == sData[i+1])
            {
               sColumn += '"';
               i++;
            }
            // If we are inside quotes but the next character is not another set of double quotes it means we have reached the quoted data so change the Search array back to our normal set and move on
            else
            {
               aSearch = aSearchTemplate.slice(0);
               bInQuotes = false;
            }
         }
         // If we are not inside a set of double-quotes, this is our first set and means that the next special characters that we care about are closing double-quotes and the separator character
         else
         {
            bInQuotes = true;
            aSearch = ['"', sSeparator];
         }
         break;
         }
      }
      // If this is NOT a special character just add it to the current column's data
      else
      {
         sColumn += sData[i];
      }
   }

   // Finalize the data by adding the last column and row to our return data
   if (bTrim)
   {
      sColumn = utils.stringTrim(sColumn);
   }
   aRow.push(sColumn);
   aParsedData.push(aRow);

   // Enjoy your data
   return aParsedData;
}


This is run using the following method:

Code: Select all
function btn_ParseCSV()
{
   var target = plugins.file.showFileOpenDialog(1);  // Get the file to import from
   if ( target.isFile() )  // Check if we selected a file or a folder
   {
      var targetFullPath = target.getAbsolutePath();  // Obtain the full file path
      var fileContents = application.readTXTFile(targetFullPath);
      var sSeparator = ";";  // Change to the separator character used in the definition
      var result = globals.parseCSV(fileContents, false, sSeparator);
   }
}


Enjoy!

Ben

Re: CSV Parser method

PostPosted: Thu Jan 21, 2010 6:03 pm
by Kahuna
Excellent stuff Ben - really looking forward to seeing this in your module.

Best of both worlds!

Re: CSV Parser method

PostPosted: Sat Jan 23, 2010 3:18 pm
by lwjwillemsen
This is run using the following method:

Code: Select all
function btn_ParseCSV()
{
   var target = plugins.file.showFileOpenDialog(1);  // Get the file to import from
   if ( target.isFile() )  // Check if we selected a file or a folder
   {
      var targetFullPath = target.getAbsolutePath();  // Obtain the full file path
      var fileContents = application.readTXTFile(targetFullPath);
      var sSeparator = ";";  // Change to the separator character used in the definition
      var result = globals.parseCSV(fileContents, false, sSeparator);
   }
}


Enjoy!

Ben


Hi Ben,

What are your performance/memory experiences with large csv-files ? Is it than necessary to read the file line by line and process it line by line (if possible in Servoy ?) ?

Lambert.

Re: CSV Parser method

PostPosted: Tue Sep 20, 2011 8:41 am
by rogel
mattfrizzell wrote:Hey folks. I always cringe about importing csv files from some outside source and typically I cry about it for 15 seconds and then open it up in excel and save it to xml or some other format that I trust just so I don't have to deal with commas, quotes, and carriage returns. Of course today I had a situation where I couldn't just do that (silly clients), so I broke down and used those 15 seconds to whip up a csv parser. While there is no real "standard" for csv documents this parser will handle embedded characters such as strings with quotes, commas, and returns inside of them, rows that end with carriage returns, new lines, or a combination of the two, and it even gives you the choice of stripping out the white spaces or not. I was going to do this with regexes but then I remembered how terrible I am with them (also, this ends up being faster than some good regexes some other folks suggested). Code is below and if interested, I based my csv standards on the rules found here http://en.wikipedia.org/wiki/Comma-separated_values


Code: Select all
/********************************************************************************
   @method: parseCSV(sData, bTrim)
   
   @arg sData : [string] The csv data you want to parse
   @opt bTrim : [boolean] optional argument used if you want to trim the white spaces off the fields
   
   @return aParsedData : [array] multi-dimensional array of rows and columns

   @description : Takes a string and applies csv rules to it to form an array of rows and columns.
               As there aren't hard standards for csv I applied the rules from the wiki http://en.wikipedia.org/wiki/Comma-separated_values
   
   @note:
   
   @history:   01/07/2010   MSF   Created parseCSV method
********************************************************************************/
var sData;
var bTrim = false;

var aParsedData;
var aRow;
var aSearch;
var bInQuotes;
var bSpecial;
var sColumn;


// If no string to parse was passed there is no reason to run this method so return
if (!arguments[0])
{
   return;
}
sData = arguments[0];

// If column data should be trimmed set the marker
if (arguments[1])
{
   bTrim = true;
}

// Set the default character search array made up of the special characters that csv has to deal with
aSearch = ['"', ",", "\n", "\r"];

// Initialize the return array, the first column, and the first row array
aParsedData = new Array();
sColumn = "";
aRow = new Array();

// Loop through the characters in the string
for (var i = 0; i < sData.length; i++)
{
   // Reset the special character marker to false
   bSpecial = false;
   
   // Loop through the current special character search array
   for (var s = 0; s < aSearch.length; s++)
   {
      // And if we hit a speical character note that
      if ( sData[i] == aSearch[s])
      {
         bSpecial = true;
         break;
      }
   }
   
   // If this is a special character that we are currently looking for it process it separately
   if ( bSpecial )
   {
      // Switch on the current character (we can only get to commas, carriage returns, and line feeds if we aren't in double-quotes which is take care of by the aSearch array)
      switch (sData[i])
      {
         // If the character is a comma then we know this is the end of a column of data so push it to the row and reinitialize the column string
         case ",":
            if (bTrim)
            {
               sColumn = utils.stringTrim(sColumn);
            }
            aRow.push(sColumn);
            sColumn = "";
            break;
         
         // If this is a carriage return then it is the end of a row so push the column to the row and then push the row to the ParsedData array
         case "\r":
            if (bTrim)
            {
               sColumn = utils.stringTrim(sColumn);
            }
            aRow.push(sColumn);
            aParsedData.push(aRow);
            aRow = new Array();
            sColumn = "";
            
            // If the next character happens to be a new line then skip it because this is a csv whose rows end with \r\n
            if ( sData[i+1] && "\n" == sData[i+1] )
            {
               i++;
            }
            break;
      

         // If this is a new line then it is the end of a row so push the column to the row and then push the row to the ParsedData array
         case "\n":
            if (bTrim)
            {
               sColumn = utils.stringTrim(sColumn);
            }
            aRow.push(sColumn);
            aParsedData.push(aRow);
            aRow = new Array();
            sColumn = "";
            break;      
         
         // Quotes are highly special, read on
         case '"':
            // Check to see if we are already inside a set of double-quotes
            if ( bInQuotes )
            {
               // If we are inside double-quotes, are currently on a quote, AND the next character is a quote it means we have found an escaped quote that we want to allow
               // So add a single quote to the column and then skip the next character by incrementing i
               if (sData[i+1] && '"' == sData[i+1])
               {
                  sColumn += '"';
                  i++;
               }
               // If we are inside quotes but the next character is not another set of double quotes it means we have reached the quoted data so change the Search array back to our normal set and move on
               else
               {
                  aSearch = ['"', ",", "\n", "\r"];
                  bInQuotes = false;
               }
            }
            // If we are not inside a set of double-quotes, this is out first set and means that the next special character that we care about is out closing double-quotes
            else
            {
               bInQuotes = true;
               aSearch = ['"'];
            }
            break;
      }
   }
   // If this is NOT a special character just add it to the current column's data
   else
   {
      sColumn += sData[i];
   }
}

// Finalize the data by adding the last column and row to our return data
if (bTrim)
{
   sColumn = utils.stringTrim(sColumn);
}
aRow.push(sColumn);
aParsedData.push(aRow);

// Enjoy your data
return aParsedData;


Hi!

Thanks for sharing.

I am experiencing a slowdown when parsing large csv files. Would you like to share how to manage this?

Thanks and Regards.

Re: CSV Parser method

PostPosted: Tue Sep 20, 2011 9:11 am
by mboegem
rogel wrote:I am experiencing a slowdown when parsing large csv files. Would you like to share how to manage this?


Probably because the data is build in memory.
Try writing each line directly to a file, see: http://forum.servoy.com/viewtopic.php?f=22&t=13866&p=72648&hilit=java.io.filewriter#p72648
This improved the speed of my exports very much!

Hope this helps.

Re: CSV Parser method

PostPosted: Wed Sep 21, 2011 5:14 pm
by rogel
mboegem wrote:
rogel wrote:I am experiencing a slowdown when parsing large csv files. Would you like to share how to manage this?


Probably because the data is build in memory.
Try writing each line directly to a file, see: http://forum.servoy.com/viewtopic.php?f=22&t=13866&p=72648&hilit=java.io.filewriter#p72648
This improved the speed of my exports very much!

Hope this helps.

Thanks Marc!

I did use bufferedreader to read the file.

Re: CSV Parser method

PostPosted: Wed Sep 21, 2011 6:20 pm
by lwjwillemsen
Marc talked about fileWrite() and not fileRead()...

What do you do with the import data ? Collect in memory ? If so then there will be a slow down with large csv files.
If you write the import data to a Servoy foundset then it's up to Servoy to do it the speedy way (balance speed / memory).

We are doing some testing this moment because we see that the foundset filling speed (rows added per second) drops at a certain (memory ?) moment. We save the data during the import to the database at a regular interval of 500 records. (Yes, autosave is off)

Unfortunately I cannot look into the black box called JSFoundSet so I don't know what happens under the hood (foundset memory management)...

Regards,

Re: CSV Parser method

PostPosted: Thu Sep 22, 2011 8:31 am
by rogel
lwjwillemsen wrote:Marc talked about fileWrite() and not fileRead()...

What do you do with the import data ? Collect in memory ? If so then there will be a slow down with large csv files.
If you write the import data to a Servoy foundset then it's up to Servoy to do it the speedy way (balance speed / memory).

We are doing some testing this moment because we see that the foundset filling speed (rows added per second) drops at a certain (memory ?) moment. We save the data during the import to the database at a regular interval of 500 records. (Yes, autosave is off)

Unfortunately I cannot look into the black box called JSFoundSet so I don't know what happens under the hood (foundset memory management)...

Regards,


Sorry about that. But I will use the filewrite in my backup database utility. what i do with the import data(csv) is i create an xml file. i have autosave(false) as well. i do create record in the foundset one at a time and save it right away before going to the next record in the xml.

hope this helps.

Re: CSV Parser method

PostPosted: Thu Sep 22, 2011 1:12 pm
by lwjwillemsen
You could run a test without filling the foundset and look what happens with regard to the time consumption...

Regards,