How to extract data from a long set of strings and put it into one cell/array/matrix?

3 visualizzazioni (ultimi 30 giorni)
Dear Colleges,
I am sitting here now for too long and think I ultimately need your help. Problem:
I crawled a website with search info and responding I wirte each request into a string in MATLAB. Thus I have a data folder containing ~1000 .mat elements.
Inside a string it looks e.g. like this [see end of post for data]. My problem is, that I want to crawl through all the data and extract the information given by
  • dc:identifier
  • dc:title
  • dc:creator
  • prism:publicationName
  • prism:coverDate
  • prism:coverDisplayDate
  • prism:doi
  • citedby-count
  • prism:aggregationType
from the strings. That means that I want to search for the dc:identifier entry, extract the data after that entry, delet all " _ ' whatever signs and put the information into a cell/matrix.
Here one string of the 1000 ~ has mostly more then 1 dataset (mostly 200) indside. Therefor I would like to exerpt all data in, perhaps, a cell array where the headcolumn represents the "dc:identifier" etc. and each following column contains then one dataset ending up in having ~ 147.000 dataset in one "array" / "cell" whatever.
So up to now I tried strsplit and regex but my Matlab language knowledge is coming to and end.
Another try is to put the following into a huge for loop and reading one after one stringdataset and trying to get data out
somestring = COMPdata2010res7;
underscore_indices = strfind(somestring,'"dc:title":"');
fs_indices = strfind(somestring,'creator"');
title = somestring(underscore_indices(end)+12:fs_indices(end)-1);
somestring = COMPdata2010res7;
underscore_indices = strfind(somestring,'creator":"');
fs_indices = strfind(somestring,'","prism:publication');
creator = somestring(underscore_indices(end)+10:fs_indices(end)-1);
[DATA Example from ONE String] [Data filename: "COMPdata2010res7.mat"]
{"search-results":{"opensearch:totalResults":"3127","opensearch:startIndex":"1201","opensearch:itemsPerPage":"200","opensearch:Query":{"@role": "request", "@searchTerms": "%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29", "@startPage": "1201"},"link": [{"@_fa": "true", "@ref": "self", "@href": "http://api.elsevier.com:80/content/search/scopus?start=1201&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"},{"@_fa": "true", "@ref": "first", "@href": "http://api.elsevier.com:80/content/search/scopus?start=0&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"},{"@_fa": "true", "@ref": "prev", "@href": "http://api.elsevier.com:80/content/search/scopus?start=1001&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"},{"@_fa": "true", "@ref": "next", "@href": "http://api.elsevier.com:80/content/search/scopus?start=1401&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"},{"@_fa": "true", "@ref": "last", "@href": "http://api.elsevier.com:80/content/search/scopus?start=2927&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"}],"entry": [{"@_fa": "true", "link": [{"@_fa": "true", "@ref": "self", "@href": "http://api.elsevier.com/content/abstract/scopus_id/78049372368"},{"@_fa": "true", "@ref": "author-affiliation", "@href": "http://api.elsevier.com/content/abstract/scopus_id/78049372368?field=author,affiliation"},{"@_fa": "true", "@ref": "scopus", "@href": "http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=78049372368&origin=inward"},{"@_fa": "true", "@ref": "scopus-citedby", "@href": "http://www.scopus.com/inward/citedby.url?partnerID=HzOxMe3b&scp=78049372368&origin=inward"},{"@_fa": "true", "@ref": "full-text", "@href": "http://api.elsevier.com/content/article/eid/1-s2.0-S0360835210002287"}],"prism:url":"http://api.elsevier.com/content/abstract/scopus_id/78049372368","dc:identifier":"SCOPUS_ID:78049372368","eid":"2-s2.0-78049372368","dc:title":"Developing Oregon's renewable energy portfolio using fuzzy goal programming model","dc:creator":"Daim T.","prism:publicationName":"Computers and Industrial Engineering","prism:issn":"03608352","prism:volume":"59","prism:issueIdentifier":"4","prism:pageRange":"786-793","prism:coverDate":"2010-11-01","prism:coverDisplayDate":"November 2010","prism:doi":"10.1016/j.cie.2010.08.004","pii":"S0360835210002287","citedby-count":"16","affiliation": [{"@_fa": "true", "affilname":"Portland State University","affiliation-city":"Portland","affiliation-country":"United States"}],"prism:aggregationType":"Journal","subtype":"ar","subtypeDescription":"Article","source-id":"18164"},{"@_fa": "true", "link": [{"@_fa": "true", "@ref": "self", "@href": "http://api.elsevier.com/content/abstract/scopus_id/79953798579"},{"@_fa": "true", "@ref": "author-affiliation", "@href": "http://api.elsevier.com/content/abstract/scopus_id/79953798579?field=author,affiliation"},{"@_fa": "true", "@ref": "scopus", "@href": "http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=79953798579&origin=inward"},{"@_fa": "true", "@ref": "scopus-citedby", "@href": "http://www.scopus.com/inward/citedby.url?partnerID=HzOxMe3b&scp=79953798579&origin=inward"}],"prism:url":"http://api.elsevier.com/content/abstract/scopus_id/79953798579","dc:identifier":"SCOPUS_ID:79953798579","eid":"2-s2.0-79953798579","dc:title":"Discovery and analysis of tightly knit communities in telecom social networks","dc:creator":"Modani N.","prism:publicationName":"IBM Journal of Research and Development","prism:issn":"00188646","prism:eIssn":"00188646","prism:volume":"54","prism:issueIdentifier":"6","prism:coverDate":"2010-11-01","prism:coverDisplayDate":"November 2010","prism:doi":"10.1147/JRD.2010.2081230","citedby-count":"2","affiliation": [{"@_fa": "true", "affilname":"IBM India Research Laboratory New Delhi","affiliation-city":"New Delhi","affiliation-country":"India"}],"prism:aggregationType":"Journal","subtype":"ar","subtypeDescription":"Article","article-number":"5643246","source-id":"15099"},{"@_fa": "true", "link": [{"@_fa": "true", "@ref": "self", "@href": "http://api.elsevier.com/content/abstract/scopus_id/77958558552"},{"@_fa": "true", "@ref": "author-affiliation", "@href": "http://api.elsevier.com/content/abstract/scopus_id/77958558552?field=author,affiliation"},{"@_fa": "true", "@ref": "scopus", "@href": "http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=77958558552&origin=inward"},{"@_fa": "true", "@ref": "scopus-citedby", "@href": "http://www.scopus.com/inward/citedby.url?partnerID=HzOxMe3b&scp=77958558552&origin=inward"}],"prism:url":"http://api.elsevier.com/content/abstract/scopus_id/77958558552","dc:identifier":"SCOPUS_ID:77958558552","eid":"2-s2.0-77958558552","dc:title":"The development of the display terminal system used in PHEV based on CAN bus"
  1 Commento
Stephen23
Stephen23 il 14 Lug 2015
It is easier for everyone if you simply upload your sample data, rather than giving it in your question. You can edit your question, delete that huge block of text, and then upload that text using the paperclip button and the pressing both Choose file and Attach file

Accedi per commentare.

Risposte (1)

Abhishek Pandey
Abhishek Pandey il 16 Lug 2015
Hello Marcus,
I understand that you’re trying to extract information like identifier, title, creator, and so on from a search string, and organize it into a cell/matrix.
Although it would be easier for the community to help if you attached sample data with your question, I believe you might be able to do this using “ strsplit ” and “ strfind ” function.
The “strsplit” function takes a string and a delimiter as input arguments and gives a cell array containing the strings split by the specified delimiter as output. A string pattern can be used as a delimiter here. Whereas the “strfind” function searches the string for occurrences of the delimiter, and returns a vector of indices wherever the delimiter string occurs in the string.
For example, for the following lines of code,
str = 'abcdabcdcdefabcd';
A = strsplit(str, 'ab')
The output is:
A =
' ' 'cd' 'cdcdef' 'cd'
On the other hand, for the following lines of code,
str = 'abcdabcdcdefabcd';
A = strfind (str, 'ab')
The output is:
B =
1 5 13
Since the information that you are seeking seems to be in a specific order, you could store the separated strings for different delimiters in different vectors and associate them accordingly using their indices.
I hope that helps!
- Abhishek

Prodotti

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by