Page 1 of 2

Crystallography Open Database

PostPosted: 26 Sep 2009, 18:16
by Rudi
Hi everyone,

inspired by a recent article in J. Appl. Cryst. I would like to start this thread.

http://journals.iucr.org/j/issues/2009/ ... index.html

The Crystallography Open Database can be found here:

http://www.crystallography.net/

What do you think about it? Do you already have experience with it? Did you ever use it either for searching or deposition of crystal structures?

Any comments are welcome.

Re: Crystallography Open Database

PostPosted: 26 Sep 2009, 18:30
by johnewarren
Hi Rudi

Yeah COD is an interesting thing I've posted a bit about it before CDS&COD. The CDS part will have little meaning as it is a UK specific resource but perhaps the COD stuff will also allow another point to jump off the debate.

I use to be very vocal about the project and "bang on" as they say over here about it. I've gone a bit quieter in recent years as I heard some rumours that it was started to try and force the CSD to release its structures or database out to the general public (something I do not disagree with, in other words I think you should be able to download the database - but pay for tools or write your own tools to access it) but also then for the some of the founders of COD to use in commercial software? Perhaps it was just a drunk discussion at a conference you know how it goes but they did back it up by pointing out that one of the COD founders now runs their own commercial powder software with database searching ability? They found this out by the way the domain name changed ownership!

I also contacted them via email and asked if they wanted to post/join the forum year ago but still never had a reply :cry: I think they had a period of dormancy for a year or two?

I should also post a full citation link:
J. Appl. Cryst. (2009). 42, 726-729 [ doi:10.1107/S0021889809016690]
Crystallography Open Database - an open-access collection of crystal structures
S. Grazulis, D. Chateigner, R. T. Downs, A. T. Yokochi, M. QuirĂ³s, L. Lutterotti, E. Manakova, J. Butkus, P. Moeck and A. Le Bail

To my shame I have not setup a link to their website via the fourm links area or on my website links area!

Re: Crystallography Open Database

PostPosted: 26 Sep 2009, 19:06
by Rudi
Sorry, I should have used the search function thoroughly. You may attach this thread to the previous one.

johnewarren wrote:Yeah COD is an interesting thing I've posted a bit about it before CDS&COD. The CDS part will have little meaning as it is a UK specific resource but perhaps the COD stuff will also allow another point to jump off the debate.

Is it possible to search the COD by means of a chemical diagram similar to the CSD Conquest?

I use to be very vocal about the project and "bang on" as they say over here about it. I've gone a bit quieter in recent years as I heard some rumours that it was started to try and force the CSD to release its structures or database out to the general public (something I do not disagree with, in other words I think you should be able to download the database - but pay for tools or write your own tools to access it) but also then for the some of the founders of COD to use in commercial software?

I also think that an open-access database is a good idea. The PDB is accessible for free, isn't it? So, why not for small molecules?

Perhaps it was just a drunk discussion at a conference you know how it goes but they did back it up by pointing out that one of the COD founders now runs their own commercial powder software with database searching ability? They found this out by the way the domain name changed ownership!

Do you refer to the Pearson's Crystal Data?

http://www.crystalimpact.com/pcd/Default.htm

I also contacted them via email and asked if they wanted to post/join the forum year ago but still never had a reply :cry: I think they had a period of dormancy for a year or two?

Well, it seems that not everyone likes internet forums very much. I have also invited some colleagues to the forum, but to my knowledge none of them has joined thus far.

Re: Crystallography Open Database

PostPosted: 26 Sep 2009, 23:07
by johnewarren
Hi, Rudi

Don't worry about the search, I think your post is great I had not read the recent article so that is good and I think cross posting in this case is beneficial rather than duplication so I will leave as is.

If the CDS guys (different to CSD just in case someone thinks I've got my letter the wrong way around) think they could search it then it should be doable if someone would spend the time investing in the search engine. That is where the CSD brings the added value to the data and why the CDS brings even more value through the CrystalWeb portal.

The problem is getting around the possible duplication and or conflicts in ownership of information in regards structures. You would like to in an ideal world post your structure to both parties but I have a feeling that there was some lawsuit or something potentially not good about doing so in regards this matter?

I'll need to read the website again to see about that.

Yeap, some people just do not want to join the forum. Others see things on the forum then privately email me about them rather than joining and posting a public debate about such matters? Strange eh! As it must of taken them more time to find me on my private work email than it would have to just join the forum!

Speaking of open access did you see that nature has/is launching a open journal under a creative commons like structure?

Re: Crystallography Open Database

PostPosted: 27 Sep 2009, 12:45
by Rudi
johnewarren wrote:If the CDS guys (different to CSD just in case someone thinks I've got my letter the wrong way around) think they could search it then it should be doable if someone would spend the time investing in the search engine. That is where the CSD brings the added value to the data and why the CDS brings even more value through the CrystalWeb portal.

So, at the moment it is not possible to search in the COD by means of a chemical diagram? Unfortunately, I do not have access to the CDS as my institution is not within the UK.

The problem is getting around the possible duplication and or conflicts in ownership of information in regards structures. You would like to in an ideal world post your structure to both parties but I have a feeling that there was some lawsuit or something potentially not good about doing so in regards this matter?

In the cited article it's mentioned that the fragmentation of structural data into the ICSD, CSD and CRYSTMET is for historical reasons. This is probably true. However, I think any further fragmentation is not a good idea. Therefore, I'm somewhat critical about the COD, too.

As a matter of fact the IUCr supports the COD; all of the structural data published in IUCr journals are piped into the COD.
However, the COD seems to be still rather uncommon and publication in other journals will normally require data deposition within the CCDC.

Just to see the relations: The COD contains ~80,000 structures, while the CSD contains ~450,000.

Fortunately, one can request any data from the CCDC free of charge. However, whithout any knowlegde to which publication a structure is attached, this doesn't help very much.

Speaking of open access did you see that nature has/is launching a open journal under a creative commons like structure?

I'm afraid not. What kind of journal? "Structure" is for proteins or biomolecules, isn't it?

Re: Crystallography Open Database

PostPosted: 28 Sep 2009, 13:32
by johnewarren
Rudi wrote:Fortunately, one can request any data from the CCDC free of charge. However, whithout any knowlegde to which publication a structure is attached, this doesn't help very much.

This is true but only in ones and two it would be very hard to get all the database out of CCDC. The database itself always seemed to me to be something that the IUCr should control and house and that the CCDC should get from them to add their "value added tools" to.

Simon Coles has some database thing, I can not remember what it is or what it is for but I think it may be linked to the CCDC? (Yes at this moment I am too lazy and or busy to go off and find out what the thing Simon did is called or where it is located, ironically I may have already posted it here somewhere on the forum and I'm still not going to look) :twisted: (Why? A rather long Nature paper sat on my desk wants to be read and I am doing whatever I can to not read it and I must draw a line under it right NOW and read the bloody thing).

Re: Crystallography Open Database

PostPosted: 28 Sep 2009, 13:36
by johnewarren
I nearly forgot about posting this open access Nature thing http://www.nature.com/press_releases/naturecommunications.html

Re: Crystallography Open Database

PostPosted: 17 Oct 2009, 19:05
by alexandr
hi guys!

database is cool! I look opensource =)
but can i express some opinions about it:
1. i don't like the search. i mean i guess for real cool database we need more powerful search interface. i guess it can be some kind of shelxl format, yes, i said shelxl. i can develop my idea if some interest appears.

2. for me, one of meanings of open is downloadable, yes, if it is open, anyone should be able to download this database.

3. i have some crazy idea about this database, more about search, but i guess, database without search sucks =) Yeah, Google rocks =)

PS: i very like this project and i want to try to be part of it, if COD will allow me =)

Sasha.

Re: Crystallography Open Database

PostPosted: 17 Oct 2009, 21:13
by johnewarren
Hi Sasha

Not a clue how to join COD they do not even return my emails!

Yeap completely agree open should also mean I can download snapshots.

I too think it would be cool to write a search engine for it, I was thinking a plugin for Olex2 rather than shelxl. But a useful tool for comparing your structure against published known results.

Also suggested a similar thing between Olex2 and CDS.

Re: Crystallography Open Database

PostPosted: 17 Oct 2009, 23:57
by alexandr
naaaa, i was wrong,
svn i accessible: svn://www.crystallography.net/cod and also http://www.crystallography.net/cif/<COD number>
so it is possible to play with CIF's =)

@johnewarren, any way their search sucks =). Your approach (for Olex2) is more GUIsh, mine is more console (can i say eq 'strict' :roll: ).

alllllll'right, cause my girl is angry on me, i will reveal my crazy idea =)

i want to import full CIF files to database (yes, with all distances, angles, etc ,etc)[1].
if we gonna move deep into details about db, tables gonna be like that:
* table for header info: from data_ till first loop (loop_ _atom_site_label) and last lines about angle of diffraction, etc.
* tables for each big loop: 5 i guess.

after that (that's not so simple, but still not so hard =)), very cool search is needed. i'm thinking about file format similar to export file after ORTH of xp soft (yes this is proprietary from GM).

small example:
Code: Select all
TITL     a in P2(1)/c                       
CELL 1
SFAC C  H  N  O
O1    4    0.00000    0.00000    0.00000
O2    4    0.58725   -2.08931   -1.90980
O3    4    3.34897   -2.80072   -2.93412
O4    4    4.79309   -1.67345   -0.71235
O5    4    4.72131    0.87606    1.01439
O6    4    1.83065    1.30171    1.24343
[lots of text]
LINK O1   C1     1
LINK O1   C2     1
LINK C2   H2A    1
LINK C2   H2B    1
LINK O2   C3     1
LINK C2   C3     1
END


so as you can see, it is simply and powerful. coordinates should not be neccessary, but still someone should be able to define them. For not important coordinates second variable comes in (ups, there can be coordinates like 20.0000, need to think about it). for link command, i don't know what 4th column means, but we definitly need bond type (cause CCDC has it :P). several things are missing in this file: symmetry, i suggest to use shelxl format LATT + SYMM it is long but more machine'ish then CIF's one. Another thing is missing are temperature factors (Ueq), i don't know if we really need them =)

about search: now i'm thinking about that it is really hard to define in search bond length between two atoms, cause only coordinates are available, may be we can somehow extend LINK line, need to think about it. And also we need X atom type, that someone can search like C-1-X-2-O, with 1, 2 bond length known (CCDC will suck after that kind of search =)) )
==========================
technical stuff:
database engine: PostgreSQL (in 2 words: it is better then MySQL)
scripts: 2 types: database populator and search (may be we need some db bot scripts)

database populator:
1. cif parser (obvious)
2. database writer

search:
1. request parser
2. database searcher
3. output

db bots:
1. maybe some kind of validation in database
2. statistics (ex: interatomic bond length (angles) statistical data) (we can then print this data and sell as International

Tables, he he he)
===========================

for technical stuff i purpose to use Perl.
database populator:
1. cif parser (obvious) STAR::Parser (http://pdb.sdsc.edu/STAR/index.html)
2. database writer (Perl DBI, it can work with any database engine, ok, almost any ;) )

search:
1. request parser (some new perl code ;) )
2. database searcher (Perl DBI)
3. output (some CGI on Perl)

============================
Conclusions:
it is possible =) I want hear what do you think about it. I will also publish tables schemes that describe db structure. Thank you for reading all that crap =)

Sasha.
PS: i don't like to write new code, i better use someone's else ;)

PS2: testing would be easy: we need the same id (COD id ) like there. So we can randomly export entries from db and compare them their CIFs and write all differences.

Re: Crystallography Open Database

PostPosted: 18 Oct 2009, 01:10
by johnewarren
Digesting, seems ok

I like MySQL :cry: You made me sad!

In regard Olex2, it is a GUI but also a CMDline it is a GUICMDline :roll: It is also modular, has database support built in (in test) for other projects also svn support etc. So it could possibly keep a copy of COD locally as well. Hmmm, how big is COD, just started to download the entire SVN! Looks big!

Bond order is quit tricky but not impossible. No reason why we can not also include a smiles version of the structural information?

I also want wavelength - why? Synchrotron data - it is good to know where the data came from and if the user actually refined against the real wavelength - also helps to detect errors in the bond lengths if there are outliers.

Are we suggesting hosting said system?

Perhaps we should get in touch with the CDS guys as a possible home? But under some sort of international rather than local agreement?

PS

I could host a MySQL based system along side the forum - as a possible test platform with a subset of CIF files digested.

Re: Crystallography Open Database

PostPosted: 18 Oct 2009, 01:49
by alexandr
about hosting I don't really care, i guess it is NOT *so* important.

Bond order is quit tricky but not impossible. No reason why we can not also include a smiles version of the structural information?

what is smiles structural information?

I also want wavelength - why? Synchrotron data - it is good to know where the data came from and if the user actually refined against the real wavelength - also helps to detect errors in the bond lengths if there are outliers.

Code: Select all
_diffrn_radiation_wavelength
?

Are we suggesting hosting said system?
Perhaps we should get in touch with the CDS guys as a possible home? But under some sort of international rather than local agreement?

are don't what we are suggesting, probably, good open source crystal database? =))

as i said i want entire CIF in db.
PS an example of COD CIF: http://www.crystallography.net/cif/1/1000000.cif

Re: Crystallography Open Database

PostPosted: 18 Oct 2009, 09:20
by pascalp
Is a traditional relational database the key?

Just to find a 6 atoms ring, it's a sql request of 6 join functions.

Re: Crystallography Open Database

PostPosted: 18 Oct 2009, 15:50
by johnewarren
Sorry Sasha, missed the entire cif in database - I'm happy now! Entire CIF is good thing hence my random request for wavelength. Yes it is in the CIF and so should be the machine the data came from, another interesting statistic.

I guess I'm showing my ages with this one? SMILES - A Simplified Chemical Language
Resources: http://www.journal.chemistrycentral.com ... t/2/S1/P40, http://opensmiles.org/

SMILES
SMILES Website- http://www.daylight.com/dayhtml/doc/the ... miles.html wrote:SMILES contains the same information as might be found in an extended connection table. The primary reason SMILES is more useful than a connection table is that it is a linguistic construct, rather than a computer data structure. SMILES is a true language, albeit with a simple vocabulary (atom and bond symbols) and only a few grammar rules. SMILES representations of structure can in turn be used as "words" in the vocabulary of other languages designed for storage of chemical information (information about chemicals) and chemical intelligence (information about chemistry).

Part of the power of SMILES is that unique SMILES exist. With standard SMILES, the name of a molecule is synonymous with its structure; with unique SMILES, the name is universal. Anyone in the world who uses unique SMILES to name a molecule will choose the exact same name.

One other important property of SMILES is that it is quite compact compared to most other methods of representing structure. A typical SMILES will take 50% to 70% less space than an equivalent connection table, even binary connection tables. For example, a database of 23,137 structures, with an average of 20 atoms per structure, uses only 1.6 bytes per atom when represented with SMILES. In addition, ordinary compression of SMILES is extremely effective. The same database cited above was reduced to 27% of its original size by Ziv-Lempel compression (i.e. 0.42 bytes per atom).


The idea being once the information in the database had been harvested we could then use the connectivity from a SMILES string to give a local information source a portable dataset of information? Which could also contain additional information about average bond lengths etc - a pocket database of information derived from the larger subset. With always an options to mine the bigger database if required.

Re: Crystallography Open Database

PostPosted: 21 Oct 2009, 01:25
by alexandr
REVISITED (after searching in CCDC =))

ok guys,

i've changed my mind a little bit =) For first iteration not entire cif will go to the DB. Help me please in deciding what is necessary to go in the db.

this what i think needs to go in db (from 1000000.cif):
Code: Select all
_publ_author_name
'Phan Thanh, S.'
'Marrot, J.'
'Renaudin, J.'
'Maisonneuve, V.'
_publ_section_title
;
[H~3~N(CH~2~)~5~NH~3~].AlP~2~O~8~H, a one-dimensional aluminophosphate
;
_journal_issue                   9
_journal_name_full               'Acta Crystallographica, Section C'
_journal_page_first              1073
_journal_page_last               1074
_journal_volume                  56
_journal_year                    2000
_chemical_formula_moiety         '(C5 H16 N2 )[AlHP2 O8 ]'
_chemical_formula_sum            'C5 H17 Al N2 O8 P2'
_chemical_formula_weight         322.13
_symmetry_cell_setting           Monoclinic
_symmetry_space_group_name_H-M   P2(1)/n
_audit_creation_method           SHELXL-97
_cell_angle_alpha                90.00
_cell_angle_beta                 95.1470(10)
_cell_angle_gamma                90.00
_cell_formula_units_Z            4
_cell_length_a                   7.8783(2)
_cell_length_b                   10.46890(10)
_cell_length_c                   16.0680(4)
_cell_measurement_reflns_used    5007
_cell_measurement_temperature    296(2)
_cell_measurement_theta_max      29.83
_cell_measurement_theta_min      2.32
_cell_volume                     1319.90(5)
_computing_cell_refinement       SMART
_computing_data_collection       'SMART (Siemens, 1996a)'
_computing_data_reduction        'SHELXTL96 (Siemens, 1996b)'
_computing_molecular_graphics    'DIAMOND (Bergerhoff, 1996)'
_computing_publication_material  SHELXTL
_computing_structure_refinement  'SHELXL93 (Sheldrick, 1993)'
_computing_structure_solution    'SHELXS86 (Sheldrick, 1990)'
_diffrn_ambient_temperature      296(2)
_diffrn_measurement_device       'Siemens SMART diffractometer'
_diffrn_measurement_method       '\w scans'
_diffrn_radiation_monochromator  graphite
_diffrn_radiation_source         'fine-focus sealed tube'
_diffrn_radiation_type           MoK\a
_diffrn_radiation_wavelength     .71073
_diffrn_reflns_av_R_equivalents  .0383
_diffrn_reflns_av_sigmaI/netI    .0532
_diffrn_reflns_limit_h_max       10
_diffrn_reflns_limit_h_min       -10
_diffrn_reflns_limit_k_max       13
_diffrn_reflns_limit_k_min       -14
_diffrn_reflns_limit_l_max       9
_diffrn_reflns_limit_l_min       -21
_diffrn_reflns_number            8939
_diffrn_reflns_theta_max         29.83
_diffrn_reflns_theta_min         2.32
_exptl_absorpt_coefficient_mu    .429
_exptl_absorpt_correction_T_max  .978
_exptl_absorpt_correction_T_min  .844
_exptl_absorpt_correction_type   semi-empirical
_exptl_absorpt_process_details   'SADABS (Sheldrick, 1996)'
_exptl_crystal_colour            colorless
_exptl_crystal_density_diffrn    1.621
_exptl_crystal_density_meas      'not measured'
_exptl_crystal_description       parallelepiped
_exptl_crystal_F_000             672
_exptl_crystal_size_max          .12
_exptl_crystal_size_mid          .06
_exptl_crystal_size_min          .05
_refine_diff_density_max         1.357
_refine_diff_density_min         -.604
_refine_ls_extinction_coef       .013(8)
_refine_ls_extinction_method     'SHELXL93 (Sheldrick, 1993)'
_refine_ls_goodness_of_fit_all   1.055
_refine_ls_goodness_of_fit_ref   1.080
_refine_ls_hydrogen_treatment    constr
_refine_ls_matrix_type           full
_refine_ls_number_parameters     167
_refine_ls_number_reflns         2521
_refine_ls_number_restraints     4
_refine_ls_restrained_S_all      1.370
_refine_ls_restrained_S_obs      1.096
_refine_ls_R_factor_all          .1073
_refine_ls_R_factor_gt           .0584
_refine_ls_shift/esd_mean        .000
_refine_ls_shift/su_max          <0.001
_refine_ls_structure_factor_coef Fsqd
_refine_ls_weighting_scheme
'calc w = 1/[\s^2^(Fo^2^)+(0.0573P)^2^+3.0698P] where P=(Fo^2^+2Fc^2^)/3'
_refine_ls_wR_factor_all         .2069
_refine_ls_wR_factor_ref         .1362
_reflns_number_gt                1901
_reflns_number_total             3421
_reflns_threshold_expression     I>2\s(I)
_[local]_cod_data_source_file    gs1096.cif
loop_
_symmetry_equiv_pos_as_xyz
'x, y, z'
'-x+1/2, y+1/2, -z+1/2'
'-x, -y, -z'
'x-1/2, -y-1/2, z-1/2'

_geom_bond_atom_site_label_1
_geom_bond_atom_site_label_2
_geom_bond_site_symmetry_2
_geom_bond_distance
_geom_bond_publ_flag


pascalp wrote:Is a traditional relational database the key?

Just to find a 6 atoms ring, it's a sql request of 6 join functions.

yeah, it's good point! For first iteration raw sql search could be the solution - easy and powerful!

Sasha.