OCLC Merging of the 7 library sets
Brian Lavoie at OCLC:
I'd like to start disseminating some of the analysis I've finished. To
start with, here's a spreadsheet containing a characterization of the 7
individual sets of records submitted by Smithsonian, MOBOT, NYBG, NHML,
Kew, ANHM, and Harvard. For each set, I've broken the records down by
Type of Record, Bibliographic Level, Type of Material, Format, Language,
and Publication Date. I also performed the same analysis for all records
combined (but not merged). I did not do subject classification for these
record sets, but I will do it for the merged set. I think the results
are self-explanatory, but please take a look and let me know what you
think. Comments, questions, pointing out of errors, etc. are all
welcome. There is some scope for revision and additional work if that
would be helpful, but I think this gives a pretty good sense of what was
provided.
As I mentioned last week, I merged the language-based monograph records
into a set of unique records (in MARC format). The resulting file is
nearly 700 MB in size, so obviously I can't send it as an attachment. I
think what I'll do is write up a short document describing how I did the
merge, what the merged records look like and so forth, and then we can
figure out what the best way to share the records is. So more on this
soon ...
LavoieReport8_8_06BHLreport.xls
Here is a description of how I merged the records, and what the merged
file looks like. Hopefully this explanation is reasonably clear; if not,
I'm happy to answer any questions. I've also attached a few sample
records in human-readable form from the merged file, to illustrate
various things covered in the explanation (since the sample is taken
from the beginning of the file, which is in alphabetical order by ID,
all of the records are from AMNH).
As I mentioned in an earlier message, the file of merged records is
nearly 700 MB in size. So we need some ideas for how best to share these
records.
I'll try to distribute some statistics on overlap and also the subject
analysis by the end of the week.
[[OCLC Merging of the 7 library sets|
sample_merged.txt]]
[[OCLC Merging of the 7 library sets|
mergeDescription.doc]]