Unusual size of saved variable as .mat file
    7 visualizzazioni (ultimi 30 giorni)
  
       Mostra commenti meno recenti
    
Hello everyone,
Can anybody explain me why the size of my .mat file is going to get too large as I save, for example a variable containing cells .
here I show you my case :
Imagine I have a text file that is just 50 M and each lines on this file is look like this:
1567461735.497778;5051030001001903;800001E1;Sttel_ZLG1;INV;0.093750
1567461735.498107;40403C3C00000000;80000201;Sttel_ZLG;TKK1;24.000000
.
.
.
so as i read this text file with textscan( ) and then save this variable C  with save( ) as below:
fid = fopen('samples.txt','rt');
C = textscan(fid,'%f %s %s %s %s %f','delimiter',';');
fclose(fid);
save('test.mat','C')
the test.mat file will be over 1 G big . Now how is that possible to get 100 time bigger wenn the original file ist just 50 M ? and is there any solution for this .
I hope you can help me
Thank you  
6 Commenti
  dpb
      
      
 il 5 Mar 2020
				
      Modificato: dpb
      
      
 il 5 Mar 2020
  
			"I'm sure the problem is in textscan and how it read my text..."
Nothing wrong with textscan itself, it's how you're using it.  It returned a cell array of 1x6, one cell for each of the six columns in the text file.  Since the data aren't all of the same class, and you clearly didn't use the 'CollectOutput' input parameter to merge like data types into minimum number of cell arrays, it built one for each.
There are/will be the 977K lines of data as rows in the cell arrays; you can treat them as wish as demonstrated below; a table would be one way.
The internal memory is owing to the choice of which data structure you pick; and is the implementation overhead of whichever one you select; but that's in those datatype classes overhead and not the fault of  textscan.
It's your responsibility to use the proper functionality within textscan or to otherwise manipulate the data returned from it to match what the data itself are and to pick the most appropriate structure for the problem; MATLAB can't necessarily do that automagically for you.  The most effiicient is always to use the base numeric classes for absolutely everything possible; that's not always the convenient programming way, however.
Risposta accettata
  dpb
      
      
 il 29 Feb 2020
        
      Modificato: dpb
      
      
 il 1 Mar 2020
  
      "I expect ,the size of this structure ... not to be much much bigger than my data..."
That's perhaps a reasonable expectation on the surface, unfortunately as the famous aphorism has it, "In theory, there is no difference between practice and theory. In practice, there is."
The overhead in the struct is significant in comparison to the raw data that the structure contains; for the organization you have created by the signal, you've paid a large price in memory to hold that structure.  Perhaps TMW should have found a more efficient implementation for the struct class, but that's what it is so to use it, that's the price, unfortunately.
I tried one optimization that cut the overhead on the sample case down by over a factor of two; whether you can make use of it in the real application I don't know -- I turned the HEX string variable into a categorical one, but only for the categories in each message, NOT the entire file--doing the latter actually increases the storage significantly because the categorical datatype contains the information for every level in every instantiation so when it gets duplicated, that overhead also proliferates.  However, if you can process the data such that you're looking only at the one set at a time, then you should be able to get by with this.
The only line I changed in your code is:
 can(i).hex = categorical(hx(i==ix));
 which resulted in the following comparison-- CELLCAN is a copy the original cell struct saved for the comparison.
>> whos CELLCAN can
  Name         Size              Bytes  Class     Attributes
  CELLCAN      1x149            477380  struct              
  can          1x149            210680  struct              
>> 
>> CELLCAN(1)
ans = 
  struct with fields:
       signal_name: 'AKHZ3S1'
    telegramm_name: 'Stat_NEM_1'
      decimal_wert: [17×1 double]
              zeit: [17×1 double]
               hex: {17×1 cell}
>> can(1)
ans = 
  struct with fields:
       signal_name: 'AKHZ3S1'
    telegramm_name: 'Stat_NEM_1'
      decimal_wert: [17×1 double]
              zeit: [17×1 double]
               hex: [17×1 categorical]
>> 
Another alternative you might consider would be to not build the struct but do the processing on the fly...depends upon just what it is that one is after in the end as to whether practical or not, I suspect, but don't have any idea of what the next step(s) is(are).
t=table(zt, hx, tn, sn, dw);        % make a table of the desired variables
t.hx=categorical(t.hx);             % turn those appropriate into categorical
t.tn=categorical(t.tn);             % that saves memory in single location
t.sn=categorical(t.sn);             % as long as don't duplicate the whole thing
This shows a distinct memory savings over both the struct and the input file as far as saving the data in usable form...
>> whos t CELLCAN
  Name            Size              Bytes  Class     Attributes
  CELLCAN         1x149            477380  struct              
  t            2426x5               93910  table               
>> t(1:10,:)
ans =
  10×5 table
        zt               hx                 tn                sn           dw   
    __________    ________________    _______________    ____________    _______
    1.5675e+09    400100C080000000    Stat_NEM_1         HEAT_ON               0
    1.5675e+09    400100C080000000    Stat_NEM_1         COOL_ON               1
    1.5675e+09    400100C080000000    Stat_NEM_1         CLIM_ON               0
    1.5675e+09    0E000C0014000000    Stat_Bat_BMU_09    BMU_UBAT_09     0.69641
    1.5675e+09    0E000C0014000000    Stat_Bat_BMU_9     BMU_IBAT_09     0.17234
    1.5675e+09    0E000C0014000000    Stat_Bat_BMU_9     BMU_TKK_09           20
    1.5675e+09    0E000C0014000000    Stat_Bat_BMU_9     BMU_ANLU1_09          0
    1.5675e+09    0E000C0014000000    Stat_Bat_BMU_9     BMU_ANLU2_09          1
    1.5675e+09    7F00A800F4010000    Bat1_Aw_T1         BATIMAXELA1         509
    1.5675e+09    7F00A800F4010000    Bat1_Aw_T1         BATIMAXLA1            0
>> 
To use this format that doesn't have the structure by sn as your struct, use:
>> [g,ig]=findgroups(t.sn);
>> whos g ig
  Name         Size            Bytes  Class          Attributes
  g         2426x1             19408  double                   
  ig         149x1             19261  categorical              
>>
Here, ig is the same set of 149  unique values of sn and g is the lookup vector for finding each group in t as is the index variable from unique before.
splitapply and/or rowfun let you process each group and return whatever results you wish for each as you go.  You trade some processing for storage, although you can create a summary new table of length 149 as output which may be where you're trying to get to anyways...
ADDENDUM:
tCAN=struct2table(can,'asarray',1);  % convert the struct to table with arrays
results in table organized similarly to the struct...
>> tCAN(1:10,:)
ans =
  10×5 table
    signal_name    telegramm_name     decimal_wert         zeit                hex        
    ___________    _______________    _____________    _____________    __________________
    AKHZ3S1        Stat_NEM_1         {17×1 double}    {17×1 double}    {17×1 categorical}
    AKHZ3S2        Stat_NEM_1         {17×1 double}    {17×1 double}    {17×1 categorical}
    AKHZ4S1        Stat_NEM_1         {17×1 double}    {17×1 double}    {17×1 categorical}
    AKHZ4S2        Stat_Bat_BMU_09    {17×1 double}    {17×1 double}    {17×1 categorical}
    BATIMAXELA1    Stat_Bat_BMU_9     {33×1 double}    {33×1 double}    {33×1 categorical}
    BATIMAXELA2    Stat_Bat_BMU_9     {33×1 double}    {33×1 double}    {33×1 categorical}
    BATIMAXLA1     Stat_Bat_BMU_9     {33×1 double}    {33×1 double}    {33×1 categorical}
    BATIMAXLA2     Stat_Bat_BMU_9     {33×1 double}    {33×1 double}    {33×1 categorical}
    BATUMAXELA1    Bat1_Aw_T1         {17×1 double}    {17×1 double}    {17×1 categorical}
    BATUMAXELA2    Bat1_Aw_T1         {17×1 double}    {17×1 double}    {17×1 categorical}
>> save canDECstruct2table tcan  -v7.3
>> !dir can*.mat
...
02/29/2020  10:08 AM         1,462,856 CANDec(day_1_9)_2.mat 
02/29/2020  02:12 PM         1,081,744 canDECcateg.mat 
03/01/2020  07:47 AM           802,303 canDECstruct2table.mat 
02/29/2020  02:28 PM           195,628 canDECtable.mat 
               5 File(s)      3,561,171 bytes 
               0 Dir(s)  819,800,248,320 bytes free 
>>
>> whos t tCAN
  Name         Size             Bytes  Class    Attributes
  t         2426x5              93911  table              
  tCAN       149x5             196162  table              
>>
Altho the use of the cell arrays inside the table also doubles the memory, the footprint is still much better than the structure.  But the -v7.3 storage problem comes back with the cell arrays in the table.  Not quite as bad as the struct but still 4X memory.
Guess the moral of the story is there is no free lunch but this one is an expensive dinner!
ADDENDUM DEUX:
This might be a place where the 'RowNames' property is useful -- dunno, again depends on what the next step(s) is(are)
1 Commento
  Simon
 il 7 Set 2023
				
      Modificato: Simon
 il 7 Set 2023
  
			>> The only line I changed in your code is:
>> can(i).hex = categorical(hx(i==ix));
I use the same method to downsize cell arrays. In my case, they have mixture of double and string data, but mainly string. The string variables can be reasonablly seen as categorical. Though a couple string variables generate a large number of categories, categorical recasting still reduce storage size significantly. 
However, in some processing the categorical variables would be much slower than string. Then I recast them back to string for that job.
Più risposte (1)
  Walter Roberson
      
      
 il 29 Feb 2020
        Cell array and struct are represented inefficiently in v7.3 files. Although hdf5 does provide a compound data type, it does not not provide much in the way of nested data types. The compression is also not as effective in hdf5.
Basically if you have a large object to save and it is not pure numeric, you will certainly get a much larger v7.3 file.
1 Commento
  dpb
      
      
 il 29 Feb 2020
				
      Modificato: dpb
      
      
 il 29 Feb 2020
  
			Yeah, that's disappointing, indeed.  Not being able to do anything about that problem, looked for ways might be able to shrink the storage internally and hope that to spill over to the external storage as well.  Seemed to help at least some...
>> !dir can*.mat
... 
02/29/2020  10:08 AM         1,462,856 CANDec(day_1_9)_2.mat 
02/29/2020  02:12 PM         1,081,744 canDECcateg.mat 
               2 File(s)      2,544,600 bytes 
               0 Dir(s)  819,996,459,008 bytes free 
>> 
where the second is the above struct with the categorical variable for the HEX field.  
I suppose another memory-saving device to use might be to use single() for the decimal_wert floating point field instead of double(); doesn't look as though it holds high precision values altho the POSIX time requires double so can't with it.
But, the .mat-file storage explosion with struct is really amazing when compare to the table form above...and, the effectiveness of compression without the struct, it would appear.
>> whos t
  Name         Size            Bytes  Class    Attributes
  t         2426x5             93910  table              
>> save canDECtable t -v7.3
>> !dir can*.mat
 ...
02/29/2020  10:08 AM         1,462,856 CANDec(day_1_9)_2.mat 
02/29/2020  02:12 PM         1,081,744 canDECcateg.mat 
02/29/2020  02:28 PM           195,628 canDECtable.mat 
               3 File(s)      2,740,228 bytes 
               0 Dir(s)  819,996,258,304 bytes free 
>> 
Vedere anche
Categorie
				Scopri di più su Workspace Variables and MAT Files in Help Center e File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!




